Streaming Audio: Apache Kafka® & Real-Time Data
Streaming Audio features all things Apache Kafka®, Confluent, real-time data, and the cloud. We cover frequently asked questions, best practices, and use cases from the Kafka community—from Kafka connectors and distributed systems, to data mesh, data integration, modern data architectures, and data mesh built with Confluent and cloud Kafka as a service. Join our hosts as they stream through a series of interviews, stories, and use cases with guests from the data streaming industry. Apache®️, Apache Kafka, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Streaming Audio: Apache Kafka® & Real-Time Data
Resurrecting In-Sync Replicas with Automatic Observer Promotion ft. Anna McDonald
As most developers and architects know, data always needs to be accessible no matter what happens outside of the system. This week, Tim Berglund virtually sits down with Anna McDonald (Principal Customer Success Technical Architect, Confluent) to discuss how Automatic Observer Promotion (AOP) can help solve the Apache Kafka® 2.5 datacenter dilemma as a feature now available in Confluent Platform 6.1 and above. Many industries must have a backup plan not only to do the right thing by the data that they collect but because they are regulated by law to do so.
Anna has a knack for preparing operations that makes replication of data possible both synchronously and asynchronously. To avoid roadblocks in stretch clusters, she’s found that you need both a replication factor and a minimum in-sync replica (ISR). There needs to be a consideration for not just one but multiple copies for the protection of your data criteria. Not replicating the correct number on the datacenter can mean that your application is down, and there’s no way to retrieve vital information during this outage. The presence of observers enables asynchronous replicas that don’t count towards that minimum ISR.
These ISRs work because they help recover data without invalidating any other standards. Architects should try to maintain topic availability during an event in a two-zone configuration. This assures that the writes go to both zones during normal operation without compromise. With the newest version of Confluent, you can get data in sync and within the minimum ISR. AOP is an excellent solution for developers who want to prepare for the unexpected and maintain accessibility across zones. When you can avoid manual interruption, you’re more likely to avoid errors and tedious operations, which would otherwise lead to a higher probability of data loss.
In other exciting news, Anna shares about discovering patterns in order to make the entire Confluent ecosystem more automatic.
EPISODE LINKS
- Automatic Observer Promotion Brings Fast and Safe Multi-Datacenter Failover with Confluent Platform 6.1
- Amusing Ourselves to Death
- Join the Confluent Community
- Learn more with Kafka tutorials, resources, and guides at Confluent Developer
- Live demo: Kafka streaming in 10 minutes on Confluent Cloud
- Use 60PDCAST to get an additional $60 of free Confluent Cloud usage (details)
Tim Berglund:
Multi-DC Kafka has come a long way in the last few years and Anna McDonald has been a part of that. So much so that there is an eponymous pattern, the Anna Pattern, for how to get that done. And she's going to explain that to us today in terms of hanging a curtain rod, and also tell us about automatic observer promotion in Confluent Platform 6.1 and above. It's all on today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the cloud.
Tim Berglund:
Hello and welcome to another episode of Streaming Audio. I am as usual, your host, Tim Berglund and I am joined in the studio again by my friend and co-worker, Anna McDonald. Anna, welcome back to the show.
Anna McDonald:
Thank you and yet again I'm surprised I'm allowed back. It's just every time it's [inaudible 00:54:00]
Tim Berglund:
It is, as I've said recently, I think on your part, the triumph of hope over experience.
Anna McDonald:
It is. That's right.
Tim Berglund:
Yeah. I'm always delighted to have you on the show. You're famous, of course, now for your two Halloween episodes and we were thinking, how can we do an Easter episode? And I don't know that it's obvious. The Halloween thing, that makes a lot of sense. I just thought this is a holiday that we should observe. And to have you on as a special thing. To be honest, I don't think you need a holiday to be on the show. I think you should just be a regular feature. But here we are and we are going to talk about, today, what I like to call the Anna Pattern. There actually is, I can tell you in Confluent Slack, there is a whole channel called Anna Pattern.
Anna McDonald:
That's Matthias J. Sachs.
Tim Berglund:
Yeah. That is Matthias J. Sachs.
Anna McDonald:
Thank you for that. Dr. Matthias J. Sachs.
Tim Berglund:
It is actually Dr. Matthias J. Sachs and it is typical for Computer Science. One of my best friends is a professor of Mechanical Engineering and so they actually call each other doctor because students and things like that, because they're professors. It's funny in Computer Science, if you have a PhD, you don't talk about it, right? It's this weird thing.
Anna McDonald:
It's true.
Tim Berglund:
He's Dr. Sachs, and there's Dr. Posner, and there's all these doctors running around. You don't even know it.
Anna McDonald:
It is. That, very true.
Tim Berglund:
The Anna Pattern, specifically this has to do with automatic observer promotion. And this is the thing, recently as of this recording, Confluent platform 6.1 was released and automatic observer promotion was talked about there. I talked about the Anna Pattern by my fire pit.
Anna McDonald:
It looked cozy.
Tim Berglund:
It was, actually. Yeah, it was nice.
Anna McDonald:
Awesome.
Tim Berglund:
But tell us about this. What are the circumstances that give rise to it? What's the new feature? You're the one that this is named after, so [crosstalk 00:02:53]
Anna McDonald:
Yeah. The first thing, I just do a level set. In many industries, you want to be resilient and sometimes that's because you're a good citizen of your data, but more often than not, it's because you're required to by law, right? You're regulated. So, we're talking about banking industries, insurance, places where the law has stepped in and said, "Look", right? If you stop working, there are going to be economic impacts that are not fun.
Tim Berglund:
Yeah. It's not just a good idea but ultimately there are people with guns who will make it so that you can just-
Anna McDonald:
Yeah. I don't know if the Fed... I could see Janet Yellen though, literally stepping in there. She's pretty cool. But anyhow, yes. And so, one of the things that I noticed early on was that most people in the U.S. had two DCs. They only had two. And to explain this and why it matters and I kind of wanted to do this because there's some confusion when we try to explain why this is a problem, right?. And it can actually happen with any number of DCs. And I'm going to explain this at a high level, in terms of curtains, because I'm famous for my incredibly-
Tim Berglund:
The basic problem though is-
Anna McDonald:
...spot on analogies.
Tim Berglund:
...high availability.
Anna McDonald:
Well, right.
Tim Berglund:
And you are, amazing.
Anna McDonald:
Yeah. And I'm taking a little detour here to explain the problem, right? So let's say that your system is a curtain rod. And this happened because my six-year-old pulled down her curtains the other day. So, let's say your system is a curtain rod. Your curtain rod will stay up as long as you have at least two brackets. That's what you need, and in terms of DCs, that's how you're in compliance. Your compliance if you've got at least two brackets, your curtain rods up, and you're good to go.
Tim Berglund:
Mine's the tension kind where you screw it.
Anna McDonald:
No, [crosstalk 00:04:46] the banking industry is not allowed to use those. They must have all of the screwed brackets. That's their regulation.
Tim Berglund:
Got it.
Anna McDonald:
Yes.
Tim Berglund:
Okay. I'm thinking of a shower curtain rod. Anyway, you're probably thinking of a window.
Anna McDonald:
No. Curtains like windows. Yeah, on windows.
Tim Berglund:
Yeah, okay. The bracket kind. I'm with you. [crosstalk 00:05:03] I don't have that screw kind.
Anna McDonald:
Yeah, so screwed in. And so, if you've only got two brackets, I don't know if you've ever seen any of those money fancy windows that people have where they're super long, and then they actually have to have a third bracket in the middle to hold up the curtain rod because it's so heavy money and bougie?
Tim Berglund:
Yep!
Anna McDonald:
Yeah. So, if you have only two brackets because you don't have a bougie curtain rod, and your kid comes along and yanks one side of the curtain and it falls out of the wall because maybe you were in a hurry and you didn't [crosstalk 00:05:36],
Tim Berglund:
Hypothetically.
Anna McDonald:
I'm not saying that happened at my house, and then I got a funny look like, "You didn't put that in the stud". I was like, "Dude, I was..." Anyway, so it falls down. If you only have two, it falls down and one dies basically. So then, you have to get your husband or somebody like my husband, Luke, to hold up the curtain rod while you fix the bracket.
Tim Berglund:
Cool hand, Luke.
Anna McDonald:
Right. That's right. Cool hand by Luke. Underscore, because we're both wrapped in the underscore. But the thing is, if you've got one of those bougie windows and you've got three brackets on your curtain, rod and your kid comes along and pulls the curtain down and there's one bracket left, your rod still stays up because you've got two brackets. Now, if they like sucker punched the middle one and you're left with only one bracket, it's still coming down and you need a loop to like hold it up while you fix it. So in this analogy [crosstalk 00:06:29] you need it, right. The brackets are your DCS. The number of DCs you have. And Luke is the observers who can temporarily hold up the curtain while you screw it back in. So this is the analogy, right? So it's all about,
Tim Berglund:
I know how observers work and I actually didn't see that coming where Luke was the observer [crosstalk 00:06:49]
Anna McDonald:
No, I blew your mind.
Tim Berglund:
Yeah. Okay. Go on.
Anna McDonald:
Right. And the reason I say this is because I just gave a talk about this yesterday and people get really caught up in, this is only for a 2.5 DC scenario where you've got two DCs and you've got a zookeeper stored somewhere else, like thrown in the closet just hiding. But it's not. It's any time you're left with one zone. So you could have like that bougie curtain rod with three zones or three DCs, if your resiliency is that I have to still be operational, if two of those DCS die, then you need observers and you need a loop to hold that up.
Anna McDonald:
So it's really any number of DCs. It's whether or not what your resiliency tolerance is. And that's how I want to state this problem. The problem is I've got multiple brackets. I need to make sure that those are there for resiliency requirements, but I also need to make sure if all but one go down, I can still produce while I'm fixing them. I can still my curtains still up basically. So that's how I'm doing this. This could be the world's worst analogy, which wouldn't shock me. But I think it's pretty good. I think it's pretty good.
Tim Berglund:
No, it might be the world's best. This is the best analogy that is in there. Better than the other one.
Anna McDonald:
It is better than the Cantaloupe one. That left everyone feeling melancholy?
Tim Berglund:
Yeah.
Tim Berglund:
Folks, she actually can do that all day. That's the funny thing, like she doesn't get this stuff.
Anna McDonald:
So, knowing what we know now about the numbers, why does this a problem for Kafka? Why is this a problem in Apache Kafka? And the reason is the way that you assure when you have a stretch cluster, and I always do this and for showing video, I always do DC 1, DC 2, right? Or zone 1, zone 2. When you have a stretch cluster there like this: there's one cluster for MRC, I'm going to have to like pop up and down. And so, when I'm producing to this one cluster, I don't want to say yes, I've acknowledged a message till it hits this one and this one, until I hit some bolts. And in order to do that, we have these two concepts in Kafka. One is replication factor, and the other is min. ISR. Replication factor is how many copies of this partition do I have?
Anna McDonald:
Min. ISR is how many copies do I have to have that are in sync. That means that they're available. Right? They're all up to date in order for me to want to produce. And the reason people do that is because if I only have one left, I might lose it. So I only want to produce if I know that I had that it's going to go to multiple copies basically. That's kind of what the idea of min ISR is. So you set X equal all, you've got a min. ISR that is a protection. And if all goes well, and you meet those criteria, your produce request works. So what you do, and this is because Kafka basically has, you can either produce, your acts can be like none, one, or all. Those are the only options
Tim Berglund:
Or all. And all the in-sync replicas.
Anna McDonald:
That's correct? Yep. And so what ends up happening in a 2.5 DC sitch is, when I produce I want to make sure it hits both DCs. Otherwise, I'm not getting any resiliency. If I don't have I produce and a copy of that message isn't sitting in both DCs, what am I doing? Why am I here? Right. Because that's the requirement.
Tim Berglund:
Lose the DC, I lose the data.
Anna McDonald:
Right. And so in order to do that, what you have to do is you have to say, I have to have my min. ISR one greater than the number of replicas or copies located in one DC. So if I've got two DCs and I've got like two replicas over here, and two over here, I have to have a min. ISR of three.
Tim Berglund:
I need three.
Anna McDonald:
Because that way I know that it hit at least two here and one here. And if your min ISR drops below three, which can happen, let's say this DC fails, now I'm below my min ISR. I'm not allowed to produce. My curtain rod has fallen down.
Tim Berglund:
The application is down.
Anna McDonald:
That's right. My curtain is down.
Tim Berglund:
Yes. There's no Luke, there's nothing.
Anna McDonald:
The neighbors are looking, it's uncomfortable. So it's just not good. So then, and this is the situation that we were in before we had the Anna Pattern for automatic observer promotion. You basically had to intervene. You basically had to lower your acts on your producer, or you had to lower your min ISR in every topic. And that's like an operator thing where an operator would have had to go in and do that, or developer would have had to go in and redeploy their applications. That's not fun.
Tim Berglund:
Which lead why application change configuration.
Anna McDonald:
During an outage. Who wants to do that? I mean, it's just right from sticks. So the question then is how do you get a loop? How do you get somebody to temporarily hold up ISR while you fix your curtain, while you don't your DC backup. And so when I was thinking about this, when we first introduced MRC, we also introduced observers. And observers are an asynchronous replica. The thing about them is they don't count in the ISR because they're asynchronous. They're just sitting out there, there a copy of a data
Tim Berglund:
Asynchronous with respect to the produced-
Anna McDonald:
Correct.
Tim Berglund:
...of logic, right?
Anna McDonald:
Yes.
Tim Berglund:
They're taking replicas, but I'm not going to wait for them.
Anna McDonald:
Correct. Right. They may or may not be in-sync. And so when I was looking at this early on and I was like, well, why don't we do this? I said, in order to get around this problem, why don't we auto promote? Like if you fall below min ISR, so if you're in this situation where you used to have two replicas and two replicas min in-sync of three, this dies. If you had an observer sitting over here, why don't we have the leader say, "Hey, I'm below min ISR. I need to be three I'm only two. Come help me. Hold up this curtain rod while I bring this back up." And so that's what we did. That's the Anna Pattern. An observer sits over here, and as soon as you drop below min. ISR, The leader will go, "Hey, get over here." And it's like, okay. And then it comes back up and you can produce until you can get this DC backup. When that happens, this observer will automatically be demoted. And you'll go back to your happy path, two and two. You're good replica placement policy.
Anna McDonald:
So they're really a temporary fix. They come in to help you until you recover, and then they go right back out. And that's kind of the entire idea behind this. And again, it was designed for a situation where it's like a 2.5 DC, where you have two DCS. But that's why I started out with that like bracket example, because you could have three DCs. If your requirement is that you have to be still up. If two of those die, you're in the same area, you're in the same bucket- [crosstalk 00:13:54].
Tim Berglund:
Yeah. Somebody's got to come hold the curtain rod.
Anna McDonald:
...You could, you can use this pattern for that as well. Exactly. Because the curtain rod will still fall down. So that's kind of like the long and short of this pattern.
Tim Berglund:
But historically, with observer promotion, historically, it's like you had to call Luke from downstairs or something and maybe he was busy and it took him a minute to get there. And you were hands on in getting that person to prop up the rod for you. [crosstalk 00:14:20]
Anna McDonald:
I'm going to be honest, I'm going to add to that. It was like he might've also been drunk and cause damage. Because the thing was, is you had to enable unclean leader election. You had to go eyeball and make sure that these observers were caught up per partition, which in a large cluster, that's thousands of partitions. You don't want like a buzzed Luke holding an expensive curtain rod above your China cabinet. It's just a dangerous situation. You want [crosstalk 00:14:48] on their game. That's right. So it was even a little bit worse than that. And now you don't have to think about it. That's what I love about this. You just don't even have to think about it. It just happens. And so that's the long and short of it.
Tim Berglund:
Nice.
Anna McDonald:
Thank you.
Tim Berglund:
With the automatic observer promotion. Now, do you know the logic inside the observer that, how do we get away with not having to have unclean leader election? Like I understand why that had to be enabled and really need to promote. How does that work now?
Anna McDonald:
So the coolest thing. Yeah. So again, I think I've said this before, my favorite class ever is partitioned at Scala. You haven't looked at it. You should. It's a beautiful thing. And so in there, there's a method called maybe expand ISR. It's because really, well replicas have the same issue. So you can have replicas that fall out of sync, right? We really call those under replicated partitions. We hate those. And that's when ISR shrinks and expands because something's wrong with one of the replicas.
Anna McDonald:
So what ends up happening is now there's a check and that check says "Oh, are we under min ISR?" Well then go check observers too. Is there an in-sync observer? Because those observers will also report back whether or not they're in-sync. They still report back. They just don't count an ISR. So then the leader is going to go "Oh, Hey, I see you. I need your help. Come on." And so that's all handled and [crosstalk 00:16:21]
Tim Berglund:
You're in-sync. You're eligible to be promoted.
Anna McDonald:
Right. And what I love about that is we've been running, right. Kafka has been running that code base, that algorithm forever. So it's battle tested. It's not something new. It's basically making use of an existing process that we know it works great. And just letting observers become eligible. If they're in-sync if you're under min ISR. it's actually very elegant.
Tim Berglund:
Right. Okay. So, that makes total sense. And it's obvious that it's battle-tested sort of older code because it's still in scholar. Right?
Anna McDonald:
Yeah.
Tim Berglund:
See parts of the code base here at Scala and you're like this is sort of the- I mean no, not Scala is bad. I mean clearly insert Scala joke here, this would be the time to do that. But if Victor were on the show, he would definitely make some Scala right now and they'd be funny. But what was I going to say? Oh yeah, that's sort of like the reptile that the brainstem of Kafka, the early layers laid down in Scala and now you've got all this Java. The medial prefrontal cortex of Kafka is not written in Scala to use suddenly very obscure neurological.
Anna McDonald:
I like it. I'm a fan.
Tim Berglund:
Yeah. I have to, I'm going to credit him. But I feel like I have to tell Victor's, my favorite Scala of Victor's, which is off topic, but slightly-
Anna McDonald:
I would like to hear it.
Tim Berglund:
...since we're talking about scholar, okay. This was out of Kafka summit a couple of years ago, he was showing some Kotlin code. This was like in the early days of him doing Kotlin, which is good, I'm glad he does that. And by the way, I'm referring to our colleague Victor Gamma, he's a developer advocate and an occasional guest on this show.
Anna McDonald:
He has great taste in high tops.
Tim Berglund:
Yeah, no strong, strong shoe game in that man. So he's showing some Kotlin code and he said, okay, this is Kotlin. And if you're a Java developer, you probably find that you're able to just read this in contrast to Scala code, where even if you're a scholar developer, you can't read it.
Anna McDonald:
It was funny. I liked it
Tim Berglund:
It was funny. Utterly gratuitous, right. There was no reason he had to go there, at all.
Anna McDonald:
No, and I like that about it.
Tim Berglund:
But he did.
Anna McDonald:
I do. I like that about him. That's funny.
Tim Berglund:
It was good. Yeah.
Anna McDonald:
I do enjoy that.
Tim Berglund:
Anna, are we done? I feel like you've explained it.
Anna McDonald:
I have. Yeah. I mean, I think I have explained it now. I've explained like it an awesome level. Configuration wise, there are still things that you're going to need to do to take advantage of automatic observer promotion. There's monitoring. And then there's my favorite. I'm actually working right now on a tuning guide. Just in general for like low-level latency for this pattern. So there's there's other things you need to do to make this whole ecosystem awesome. But as far as like resurrecting your ISR automatically, that's it, man.
Tim Berglund:
See what you did there. Yeah.
Anna McDonald:
I do. I did. Did you see what I did there as good, right?
Tim Berglund:
That was good.
Anna McDonald:
Happy Easter.
Tim Berglund:
That's a theme I appreciate.
Anna McDonald:
That's right. So, so yeah. I mean, that's about it, man.
Tim Berglund:
I felt like it was going to take longer, but you made it make sense in 20 minutes. Like how do you do that?
Anna McDonald:
I don't know. I think that's why I was so excited about the curtain rod thing. Because before this, it took a lot longer. I was just trying to, it's a numbers game. And it's hard to explain that in a fashion that doesn't get you way too deep into like racks and ACS and all of this like lingo. And so at a high level, I was.
Tim Berglund:
And you got to whiteboard that.
Anna McDonald:
Yeah. Yeah. And at a high level, I was just.
Tim Berglund:
It doesn't work well in podcast.
Anna McDonald:
No, it does not at all. It really doesn't. Can we talk, you know what, if we've got time, like I'm excited for Kafka summit as an attendee.
Tim Berglund:
Let's talk about a little bit.
Anna McDonald:
Okay.
Tim Berglund:
I'm excited.
Anna McDonald:
I'm excited to see Neil's talk. I'm so excited because we got all these new Kafka streams metrics and he had made these incredible dashboards that are going to help people so much. And I cannot wait to see his talk. I absolutely cannot wait. I'm so excited.
Tim Berglund:
Yes. Referring to Neil Buesing, our friend. Yeah.
Anna McDonald:
Is there another Neil that we should really-
Tim Berglund:
The number extraordinary?
Anna McDonald:
Do you know another Neil? Cause I don't know.
Tim Berglund:
I know several other Neil's. There's the late Neil Postman who certainly has had, I think I would say a formative intellectual influence on me-
Anna McDonald:
Neil Diamond?
Tim Berglund:
...his book, Amusing Ourselves to Death. Less of an influence on me in the case of Neil Diamond. There's my friend, Neil Ford, who have made a big impact on me as a communicator. There's Neil Avery. Our once one time coworker and awesome guy.
Anna McDonald:
There's my friend Neil who lives in Boston. He's pretty cool.
Tim Berglund:
Yeah?
Anna McDonald:
Yep. I don't have a lot of Neil's. You got better? Yeah. I mean, I've got my Neil. That's why, I guess I don't ever qualify and say, Neil. You were saying?
Tim Berglund:
I'll make sure I link. I link to Amusing Ourselves to Death in the show notes has nothing to do with Kafka, but just, I think super good book for thinking about media and sort of like Marshall McLuhan applied.
Anna McDonald:
I like it.
Tim Berglund:
But yeah. Yeah. That's for this podcast. That'd be for a different podcast to really deep dive into that.
Anna McDonald:
I'm listening to a podcast Speaking why Oxford university has published all of their like lunchtime lecture series.
Tim Berglund:
Oh nice.
Anna McDonald:
Yeah. And they're hilarious and amazing. Like there's one, that's like 800 years of the history of mathematics at Oxford university. It is hilarious and awesome.
Tim Berglund:
I am currently getting my phone out because I needed to recall the name of a podcast. I just about done with an episode of legends of sales and marketing by people.ai, because our own president of worldwide field operations, Erica Schultz was just interviewed in that. That was interesting. Just like I'm not a sales guy, but I've been reading just some blogs and sort of like what's enterprise SAS sales. How does it, what are all these words that people use? It's just good stuff to know. And of course the good old lexicon Valley and econ talk. Those are my go to.
Anna McDonald:
Yeah. You're always on your game, Tim.
Tim Berglund:
Well, and thank you Anna. But in the spirit of my long-time favorite podcast, econ talk, I will say my guest today has been Anna McDonald. Anna, thanks for being a part of Streaming Audio.
Anna McDonald:
Thank you.
Tim Berglund:
Hey, you know what you get for listening to the end, some free Confluent Cloud, use the promo code 60PDCAST that's, 6-0-P-D-C-A-S-T, to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021, and use it within 90 days after activation. Any unused promo value after the expiration date is forfeit and there are a limited number of codes available, so don't miss out. Anyway, as always, I hope this podcast was useful to you. If you want to discuss it or ask a question, you can always reach out to me on Twitter at @tlberglund that's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video or reach out on Community Slack or on the Community Forum. There are sign up links for those things in the show notes, if you'd like to sign up. And while you're at it, please subscribe to our YouTube channel and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple podcasts, be sure to leave us a review there that helps other people discover it, especially if it's a five-star review and we think that's a good thing. So thanks for your support and we'll see you next time.