Streaming Audio: Apache Kafka® & Real-Time Data

Top 6 Worst Apache Kafka JIRA Bugs

December 21, 2022 Confluent, founded by the original creators of Apache Kafka® Season 1 Episode 249
Top 6 Worst Apache Kafka JIRA Bugs
Streaming Audio: Apache Kafka® & Real-Time Data
More Info
Streaming Audio: Apache Kafka® & Real-Time Data
Top 6 Worst Apache Kafka JIRA Bugs
Dec 21, 2022 Season 1 Episode 249
Confluent, founded by the original creators of Apache Kafka®

Entomophiliac, Anna McDonald (Principal Customer Success Technical Architect, Confluent) has seen her fair share of Apache Kafka® bugs. For her annual holiday roundup of the most noteworthy Kafka bugs, Anna tells Kris Jenkins about some of the scariest, most surprising, and most enlightening corner cases that make you ask, “Ah, so that’s how it really works?”

She shares a lot of interesting details about how batching works, the replication protocol, how Kafka’s networking stack dances with Linux’s one, and which is the most important Scala class to read, if you’re only going to read one.

In particular, Anna gives Kris details about a bug that he’s been thinking about lately – sticky partitioner (KAFKA-10888). When a Kafka producer sends several records to the same partition at around the same time, the partition can get overloaded. As a result, if too many records get processed at once, they can get stuck causing an unbalanced workload. Anna goes on to explain that the fix required keeping track of the number of offsets/messages written to each partition, and then batching to force more balanced distributions.

She found another bug that occurs when Kafka server triggers TCP Congestion Control in some conditions (KAFKA-9648). Anna explains that when Kafka server restarts and then executes the preferred replica leader, lots of replica leaders trigger cluster metadata updates. Then, all clients establish a server connection at the same time that lots TCP requests are waiting in the TCP sync queue.

The third bug she talks about (KAFKA-9211), may cause TCP delays after upgrading…. Oh, that’s a nasty one. She goes on to tell Kris about a rare bug (KAFKA-12686) in Partition.scala where there’s a race condition between the handling of an AlterIsrResponse and a LeaderAndIsrRequest. This rare scenario involves the delay of AlterIsrResponse when lots of ISR and leadership changes occur due to broker restarts.

Bugs five (KAFKA-12964) and six (KAFKA-14334) are no better, but you’ll have to plug in your headphones and listen in to explore the ghoulish adventures of Anna McDonald as she gives a nightmarish peek into her world of JIRA bugs. It’s just what you might need this holiday season!


Show Notes Transcript Chapter Markers

Entomophiliac, Anna McDonald (Principal Customer Success Technical Architect, Confluent) has seen her fair share of Apache Kafka® bugs. For her annual holiday roundup of the most noteworthy Kafka bugs, Anna tells Kris Jenkins about some of the scariest, most surprising, and most enlightening corner cases that make you ask, “Ah, so that’s how it really works?”

She shares a lot of interesting details about how batching works, the replication protocol, how Kafka’s networking stack dances with Linux’s one, and which is the most important Scala class to read, if you’re only going to read one.

In particular, Anna gives Kris details about a bug that he’s been thinking about lately – sticky partitioner (KAFKA-10888). When a Kafka producer sends several records to the same partition at around the same time, the partition can get overloaded. As a result, if too many records get processed at once, they can get stuck causing an unbalanced workload. Anna goes on to explain that the fix required keeping track of the number of offsets/messages written to each partition, and then batching to force more balanced distributions.

She found another bug that occurs when Kafka server triggers TCP Congestion Control in some conditions (KAFKA-9648). Anna explains that when Kafka server restarts and then executes the preferred replica leader, lots of replica leaders trigger cluster metadata updates. Then, all clients establish a server connection at the same time that lots TCP requests are waiting in the TCP sync queue.

The third bug she talks about (KAFKA-9211), may cause TCP delays after upgrading…. Oh, that’s a nasty one. She goes on to tell Kris about a rare bug (KAFKA-12686) in Partition.scala where there’s a race condition between the handling of an AlterIsrResponse and a LeaderAndIsrRequest. This rare scenario involves the delay of AlterIsrResponse when lots of ISR and leadership changes occur due to broker restarts.

Bugs five (KAFKA-12964) and six (KAFKA-14334) are no better, but you’ll have to plug in your headphones and listen in to explore the ghoulish adventures of Anna McDonald as she gives a nightmarish peek into her world of JIRA bugs. It’s just what you might need this holiday season!


Kris Jenkins (00:00):

This week's Streaming Audio sees the return of one of our favorite guests, Anna McDonald, with her annual roundup of the year's most noteworthy Apache Kafka bugs. She's got scary ones, surprising ones, and a few enlightening corner cases, things that make you think, "Ah, that's how that works under the hood," that kind of thing.

Kris Jenkins (00:25):

Along the way, we're going to learn some interesting details about how batching works, how the replication protocol actually works, how Kafka's networking stack interacts with Linux's one, and which, in her opinion, is the most important scala class to read, if you're only going to read one of them.

Kris Jenkins (00:44):

What can I tell you? Those of you who know Anna will know she's a force of nature. This is my first time talking to her properly and all I can tell you is hold on tight because she drives like a New Yorker. So this podcast is brought to you by Confluent Developer. More about that at the end, but for now, I'm your host, Kris Jenkins. This is Streaming Audio. Let's get into it. Joining me on Streaming Audio today, the infamous Anna McDonald. Hey, Anna. How you doing?

Anna McDonald (01:19):

I'm doing excellent. I would like to, again, say that I'm shocked every time I'm allowed back on this show. So I'm excited.

Kris Jenkins (01:28):

Since last time, we switched hosts. So I have no real idea what I'm getting myself in for.

Anna McDonald (01:31):

Yes. I may never ... After this, it'll finally be [inaudible 00:01:37]-

Kris Jenkins (01:37):

This could be the last one. Let's make the most of it. I checked, you have the honor of being the most frequent guest, probably because of your repeated annual series of Screaming Audio Halloween specials.

Anna McDonald (01:51):

Yes. I think I did that on pur ... Yeah, volume. Right? What do they say? Quantity over quality? There's some quality puns.

Kris Jenkins (01:59):

Yes, never mind the quality, feel the thickness.

Anna McDonald (02:00):

Yeah, quality puns in there. No, I do. I enjoy podcasts because you can listen to them while you're doing something else. So I've always been a fan of the medium.

Kris Jenkins (02:10):

Yeah, excellent. So we were going to try and get you in for this Halloween and life got in the way. So you're back here for what I think we'll call the Jira Nightmare Before Christmas episode.

Anna McDonald (02:21):

That's right. The Nightmare Before Jira, which, as you had said, it's always kind of a nightmare. Sorry. 

Kris Jenkins (02:26):


Anna McDonald (02:26):

[inaudible 00:02:28]. 

Kris Jenkins (02:27):

Sorry, Jira people.

Anna McDonald (02:29):

Yeah. They're nice people, they are, but still, it's paperwork. Yes. Nightmare Before Christmas. And hold on, I don't know if you can see this or not. I was supposed to be making something else, but I made an Apache Kaka Christmas tree.

Kris Jenkins (02:43):

Oh. For those who are just listening to this on audio, Anna has a Kafka logo made out of ... Is it holly and bauble?

Anna McDonald (02:51):

Yes. I chopped up, with these awesome planter sheers I have, this fake greenery and then I stuck the Christmas little things together, your Christmas bulbs or whatever they are, on the end of them with a glue gun, and then I made the Apache Kafka logo. I was supposed to be doing something else. It was not as interesting and fun as that.

Kris Jenkins (03:17):

Yeah, I know that kind of feeling. All the good stuff happens while you're supposed to be doing something else. Right?

Anna McDonald (03:20):

Absolutely. Got to make your own fun.

Kris Jenkins (03:22):

But you do work hard. Let's not get off the topic here. You work hard finding some very obscure bugs with Kafka and fixing them. Right?

Anna McDonald (03:31):

I do. Well, sometimes they find me.

Kris Jenkins (03:34):

They know where to look.

Anna McDonald (03:35):

Yeah, that's usually ... And so I've got some really good ones for us to chat about. One of them, my note was you know this as it took up a lot of your life. So I didn't really have to take a lot of notes about that one, but some of them are very fun to dive into and I'm excited. Let's do it.

Kris Jenkins (03:55):

Yeah. I have a preview list. One of them is actually related to something that's been on my mind a lot lately, which is producer batching and sticky partition assignment.

Anna McDonald (04:05):


Kris Jenkins (04:06):

So let's start there. Tell us the bug.

Anna McDonald (04:08):

So what ended up happening is, if you ... you have the sticky partitioner and it had the best intentions. It didn't set out to sabotage people, but what ends up happening in a producer, just to level set, is you have this thing called batch size that you can set. There's a default. You have this thing called Those are your two weapons, in terms to optimize your throughput., as the name might sound, I love it because, every time, you just think of that Cranberry song. [Singing 00:04:39].

Kris Jenkins (04:40):

It's not just me. Yeah.

Anna McDonald (04:41):

And sometimes I sing that and I don't care on Zooms because I need to entertain myself. But it's really like, how long should I wait in order to make sure this batch is full? Because, in some situations, maybe it never gets full, if you have spotty traffic, and you don't want to wait forever, so you send it. So what ended up happening is, for a sticky partitioner, let's say that you want your data to go a fast as coming in. Let's say you have a huge fire hose topic, it's fantastic, tons of data all the time, set to zero, go, go, go, go. 

Kris Jenkins (05:15):

But no. So what they ended up finding, and this is something that was readily apparent on ... I ran into this more than once on customer calls is, all of a sudden, throughput would drop. It would be horrible, all of a sudden. And we when looking at this inside of the producer guts, as I like to say. You would notice there's this thing, it's called the record accumulator. Shockingly enough, it accumulates records. And that's your producer batching, as you said, right, Kris? We try to batch up. In Kafka, we don't really ... You can use it, obviously, and people do use it for single message, but really, it's made to produce batches. Records are stored in a record batch, they're evaluated, they're compressed. It's a batching kind of thing when we produce.

Kris Jenkins (06:07):

Right. It's got a timestamp for when it started for the sake of lingering. Right?

Anna McDonald (06:11):

Right. Yeah, exactly. And so when you look at this, it was fascinating. You would see, some partitions, it had accumulated two records and, other partitions, it accumulated 37,000. And you're like, "Wait, what now? What's going on here?" And what ends up happening is the original implementation of the sticky partitioner, if there was a slow partition ... And there are many things that can make a partition slow. Most of the time, it's something going on the node. Right?

Kris Jenkins (06:40):


Anna McDonald (06:41):

So if there was a slow partition, that would give that partition a ton more time to accumulate records because, if you're using especially AX equal AL, it's going over, while that partition, let's say that you have max in flight set to one, while that partition, while you're waiting for that request to come back, the producer is accumulating records for the next send. So it takes forever for it to come back and say, "Yeah, I sent that batch." Then you're just going to sit there being like, "And more and more."

Kris Jenkins (07:14):

And filling up this buffer, larger and larger.

Anna McDonald (07:15):

Exactly. Absolutely. What I would say is you know how somebody's making you dinner, but you're super hungry, so you start eating a box of crackers? And the longer dinner takes, the more Triscuits you're munching on. You're like, "Yeah, okay, you said 15 minutes, but it's been 30 and I'm starving," like a small child. So that's how this works. The longer it takes, the more crackers you're like popping in, to the point-

Kris Jenkins (07:42):

Yeah, and you're absolutely full before it's even got the chance to ship out the next meal.

Anna McDonald (07:46):

Absolutely, to the point where, in some cases, it's almost unrecoverable, depending on that slowness. By the way, we should've said, and this is normally how we start, this is Kafka JIRA 10888. We are going to have links. I'm going to update a document with them.

Kris Jenkins (08:08):

We'll put them in the show notes.

Anna McDonald (08:09):

Yes. So people can see that, but it is, in case, again, if you're just listening to this, you might even be driving, don't stop and look. That's not safe. That's bad. Don't do that. But later, while you're on your laptop, sitting on your bespoke couch, that was for Dennis, who keeps telling me, the Def Witkin, who's been on the show, keeps telling me the definition of bespoke and says I use it improperly. I think it's a fun word. He said it means there's only one of them in the world and it's made by hand, because I said I would give someone a bespoke beer. And he was like, "Are you going to brew it yourself?" And I was like, "I don't know, maybe I am."

Kris Jenkins (08:49):

I thought bespoke meant custom made.

Anna McDonald (08:52):

That's what I thought. 

Kris Jenkins (08:53):


Anna McDonald (08:54):

Yeah, but there could be two of them that were ... Right?

Kris Jenkins (08:59):

Yeah. If you do a small batch of beer, especially for that person, that's fine.

Anna McDonald (09:02):

That's what I'm saying. And I brought my friend, Chris Matta, who's awesome, he also works for us, I bought all his brewing equipment and brought it up to my farm so I could. So suck it, Dennis. That's right.

Kris Jenkins (09:13):

Okay. We have derailed slightly from the Jira bug-

Anna McDonald (09:16):


Kris Jenkins (09:16):

... to throw shade on Dennis.

Anna McDonald (09:18):

And that's okay. I love Dennis to death. Also, I want to ... when we're talking about ... So the sticky partitioner, to get back to it, what do we do about this? So there's another person that works for us. His name's Artem. He's awesome. He is one of the best dancers I've ever seen. I asked him if I was allowed to share that on this call or on this podcast. He said yes, so I'm allowed to share it. And what ended up happening is KIP-794 came out of this. And if you haven't read KIP-794, it's great. And it talks about how we can have ... it's called the Uniform Sticky Partitioner. It's almost like, except for me, who likes VI better, VI and VIM. I don't use VIM, though, because I just feel like it's too easy.

Kris Jenkins (10:10):

How do you feel about ED?

Anna McDonald (10:11):

Yeah. Well, it's not bad. What about SED? It's like, yeah, exactly. But it's like VIM because it's better and it's great. And the KIP is wonderful. So one of my favorite things to do is to read ... I always read KIPs. If you want to understand Kafka internals, reading Kafka Improvement Proposals, is what KIP stands for, is one of the best ways to do it. People spend a lot of time and effort putting details in there. They're updated based on feedback from the dev mailing list. I strongly recommend that everybody read KIPs. And this KIP is no exception. This was all taken into account in the KIP and, now, we have the Uniform Sticky Partitioner, which does not have this problem.

Kris Jenkins (10:57):

How does it work? How does it solve the problem?

Anna McDonald (11:02):

In many ways. There's a couple different mechanisms. Let me find the best way to say this because it is a very ... I'm thinking of a good analogy for this. So instead of saying, "Hey, every time we create a batch, we're going to switch partitions," it's every time this batch size got produced to the partition. And there's a really good ... Now, I'm just going to read this. So if you're producing to partition one using the default batch size of 16KB, KB, if that got produced to partition one, we switch to partition, let's say, 42. And also nice use of the 42. When I turned 42, I was the answer to everything. It was amazing. I had been waiting so many years to say that.

Kris Jenkins (11:58):

I'd assumed you're in your late 20s. So I have no opinion on that.

Anna McDonald (12:01):

Oh, no, I'm 43, man. I'm old. That's why I'm allowed to use these things. And to anyone who's older than me, don't be offended. I'm jealous. Angela Lansbury, and I should say this, saddest day of my ... She's my hero, has been my hero since I was a kid. And when your hero is an elderly-ish lady solving mysteries on a bicycle, you really grow up wanting to be old because I'm like, "Man, not there yet," but I'll get there. 

Kris Jenkins (12:30):

Yeah. Inevitably.

Anna McDonald (12:31):

So embrace your oldness. It's awesome, man. So anyway, back to this. So after they produce 16KB to partition 42, you go to partition three and so on. You just kind of do it, regardless. And this is the idea with uniform. The distribution is uniform of the records because, if you look at the original issue with the sticky partitioner, the batching was incredibly non-uniform if you had a slow node, or a slow partition is better to say. Look, spoiler alert, I'm getting into our next one.

Kris Jenkins (13:06):


Anna McDonald (13:06):

I know, right? It's foreshadowing. So yeah, if you have a slow node, it was like batch size, batch size, batch size. Ooh, I went outside the frame. Batch size, batch size, batch size. And so this is an attempt to keep those batches uniform. And again, we could talk about just this KIP on the entire episode.

Kris Jenkins (13:26):

Hang on. Now, I'd misunderstood something here because I thought the whole point of a sticky partitioner was that you stuck to the same partitions, by and large. So how come you can move partitions in batches?

Anna McDonald (13:39):

No. There used to be round robin. The sticky partitioner is basically when ... And again, it would be good to read that KIP, too, the original sticky partitioner. I just want to make sure I could find the KIP number. So that's KIP 480. 

Kris Jenkins (13:58):


Anna McDonald (13:59):

Yes, KIP 480. So basically, we used to have this thing, and it was called Round Robin Fashion. So we just would go produce this one, this one, this one. This is when there's no key. The sticky partition sticks to a partition until a batch is full. It's the idea of, instead of just going, "Boo, boo, boo, boo, boo, boo, boo, boo," let me think about a good analogy for this, it's the idea of ... I know what it is. So let's say that you're at one of them things that people do where you have to make a cookie tray. What do they call them? Cookie swaps. I'm sorry, I love Christmas. You ever been to that, where everybody brings cookies and you're not supposed to-

Kris Jenkins (14:42):

I can imagine it, but I don't think it's a thing we have over here in England.

Anna McDonald (14:45):

Oh, my gosh, you should just do it every day. It's great. So you bring cookies and you make up plates of cookies. So it's the idea of saying, "Okay, I have to make up eight plates of cookies." I can eat and I have these big, huge tray of cookies I made. Let's say I know that I need to put eight cookies on all of my trays. Is it better to say, "One tray, one for you, one for you, one for you, one for you, one for you, and again, one for you," or is it better to stand there and go, "Here's eight cookies, here's eight cookies, here's eight cookies, here's eight cookies, here's eight cookies?" Which one's faster? It's the batching. It's the one where we're trying to make sure that, when we produce, we produce full batches to everyone.

Anna McDonald (15:28):

That's why they call it sticky, but you're right that it sounds like, "Hey, I'm going to stick to these partitions and tell you what to do," but really, it's about trying to equalize batching, sticky partition, but you're 100% right. What you also may be thinking of, too, a little bit, which is very close, same name, is static partitioning. So static ... yeah.

Kris Jenkins (15:52):

I am. You're right.

Anna McDonald (15:53):

Well, but sticky sounds like you're sticking. Maybe we should ... The names are very ... So static partitioning, that's more perhaps what we're talking ... yeah.

Kris Jenkins (16:07):

In your original example, then, if you've got two messages in one batch and 37,000 in another batch, what's going to happen, post this fix?

Anna McDonald (16:17):

Oh. So the 37,000 would not be accumulated.

Kris Jenkins (16:22):

It would get to 20 and then move to the next batch, the next partition.

Anna McDonald (16:26):

Correct. Right. So when we accumulate the right batch size, then we hop. We don't just say, "Wait until you get the batch size and go," then go ... because all those partitions are accumulating records in the background. And again, maybe we should. You should get Artem on here and Justine to discuss this, both Artem and Justine. Justine wrote the original KIP for sticky partitioner. Artem, again, great dancer. Oh, I'm such a bad egg. He did a Kafka Summit talk about this, too, which I'll put in the show notes, and goes over the whole thing, but you should have him on.

Kris Jenkins (17:00):


Anna McDonald (17:01):

I don't know if Artem's ever been on the show, but he's fantastic. So he should come on. And I don't think he'll dance, though, but if you get a chance-

Kris Jenkins (17:09):

It won't really work on radio, but we'll figure something out.

Anna McDonald (17:12):

No, that's true. So the idea, I think takeaway is that the original sticky partitioner did not maintain a uniform distribution for batches. When you have a slow partition, for whatever reason, that can really get you in a cycle you're never going to get out of because of the backlog of records.

Kris Jenkins (17:36):

Yeah. Okay. Hey, Anna, what kind of things can lead to a slow partition or a slow node?

Anna McDonald (17:45):

Yay. And again, I did so much foreshadowing, SYN cookies, because I was just talking about cookies.

Kris Jenkins (17:52):

Oh, wow. Yeah. Over here on the TCP stack, we call them SYN biscuits, but it's the same idea.

Anna McDonald (17:57):

That's okay. Yeah, exactly. That's fine. I love that. And maybe there's a SYN biscotti somewhere as well.

Kris Jenkins (18:05):

Oh, nice. I think somewhere around Rome in the Vatican, I assume biscottis are very different.

Anna McDonald (18:08):

Yeah, they're dunking it in their coffee in smaller and smaller pieces. Okay, so this is Kafka 9648.

Kris Jenkins (18:14):


Anna McDonald (18:17):

Yeah. 9648. And I picked this one because this is the actual fix for this issue. But this is kind of the condition that caused me to realize and kind of understand and spend so much of my life on the sticky partitioner, right? And basically, here's what happens. And this is other also why I like this. So what you find in Kafka, and I think you find this in any distributed system, probably in most systems. I personally though think that in a distributed system, probably just because they're more tricky, this just shows up so much more prevalently, is that certain types of configurations and use cases will be the ones that hit things where other people will be absolutely fine. And that's just computers, but man does it show up in a distributed system.

Anna McDonald (19:12):

So this one in particular has to do with the number of connections. So in Linux, Linux is amazing, lovely, friendly, tries to be fair about things. And so when in Kafka, then in socket server dot scala, we have this thing and there's a backlog, a backlog queue, and it's kind of like, "Hey, I'm accepting connections." I can't accept at the same time everything that's getting sent to me. So if I'm busy for a sec, put it in the backlog queue. As one might imagine that backlog queue has a size. The default size that we had, I believe was 50. Yes. Okay. Yeah. Queue length. So the Queue length is 50, right?

Anna McDonald (20:02):

And so the question comes up and it says, "Okay, if I'm running my Linux distribution, Kafka is running on a Linux note, and I roll my cluster..." Which by the way should be an everyday occurrence. If you're afraid to roll your cluster, it's not the position you want to be in. You should not be operating Kafka or you should make some changes to make sure that you feel good operating Kafka. It's a huge red flag for me. One of the things I ask people a lot of times is: when was the last time you rolled your cluster?

Kris Jenkins (20:33):

And are you saying because it's a good thing to... The old problem with Windows, you rebooted the service every week or they crash after a fortnight?

Anna McDonald (20:40):


Kris Jenkins (20:41):

Are we saying that or are you just saying you should be so confident in the recovery abilities?

Anna McDonald (20:45):

Absolutely. Kafka's a distributed system, it's by default durable. And if you're afraid to roll your system because of outages, you're not configured properly. That is kind of a huge red flag for issues. Every time I roll my cluster, my customers have an outage. What's your RF?

Kris Jenkins (20:58):


Anna McDonald (21:00):

It better be... Your replication factor. Is it one? You know what I mean? There's all these things that pop up when people are afraid to roll a cluster. And so rolling your cluster, your customers shouldn't notice. They should be like, "Yeah, whatever." Outside of some SLA things, which yes, they're working on it and hopefully with KRaft it gets better. I mean leader and isr time for metadata refresh takes a while, and a while is subject. That's neither here nor there. That's another podcast.

Kris Jenkins (21:31):

That's another podcast. But okay, so you're saying that regular chaos engineering monkey thing.

Anna McDonald (21:37):

Well, plus the world that we live in, a lot of times you've got to do patching of your OS. It's mandated in regulatory invoice. So you better be comfortable rolling it, at picking up new Kafka updates. All kinds of good reasons to roll your cluster that are not the restart your Windows machine. Good reasons to roll it.

Kris Jenkins (21:58):

So here we are with the Linux Kernel. We've just done a rolling restart.

Anna McDonald (22:02):

And so when that happens, and we move back to preferred leadership, so leader election occurs, first of all, leader is going to change to a non-preferred. When we take down this note, all the leaders on that note go boom. Then when we bring it at back up, we move back to preferred leadership. What ends up happening is all the producers that we're producing to that node that had leadership, whether it's going forward or back, whether it's moving back to preferred or it's going away from preferred, all of a sudden they change and they start producing. Now, normally if they're doing a... And this is why I say too, it's a very bespoke situation. Not bespoke. Again, see, haha, Dennis. Let me see. It's a situation that people will find themselves in if they have a lot of clients, is the way to say that. When there's a metadata update, clients will, many times they'll reestablish their connection. There are always exceptions to this, if you're using your own logic, yada, yada, yada. But by default usually it forces a connection reestablished. If all of a sudden you have a crap ton of clients establishing a new connection to a Linux server, and you're using the defaults in cookie settings, and you only have a backlog of 50 in your back channel, what ends up happening is Linux is like, "Hello, I am the arbitrator of fairness so I'm going to engage SYN cookies." And so when that happens, and this is if TCP SYN cookie equal one too, by the way, which I think most people have this set. There's not a lot of people who say, "Hey, I'm going to set my SYN cookies to zero, just reject any new requests." Most of the time people want this type of auto scaling. They don't want it to kick in this scenario, but there's-

Kris Jenkins (24:05):

For those of us who haven't delved into the TCP stack recently remind me what SYN cookies do?

Anna McDonald (24:10):

So basically what they're doing is they're acting as almost like a throttle. So they're making sure a server doesn't get overwhelmed when it has a burst of network connections. You can think of them as DDoS protection. They're like, "Don't try. I don't think so. Not today people." So that's what they're doing. They're a scale. And so there's this parameter, an external parameter, it's called W Scale, and it's kind of windowing.

Anna McDonald (24:34):

So W Scale is your friend when it comes to batching. It lets you send a lot more bites per TCP package. As soon as the SYN cookie mechanism gets triggered, that goes away usually. Usually W scale, it's like, "No." So all of a sudden, because it wants to be fair to all the network connections that are bursting on. And it's also a way because if you're trying to DDoS someone, you don't want to let somebody send you a crap ton of bites. That's not what you're about. So basically, and this is kind of the worst part, once your W Scale goes away and you don't have that, that persists until the connection's closed.

Kris Jenkins (25:23):

Oh God, really?

Anna McDonald (25:24):

Yes. So you're never going to recover. So if I'm a producer, and again, let's take us back to the previous one we talked about, and I'm like, "Hey, I have this original, the OG, sticky Partitioner," and I am producing, and they're rolling a cluster, and all of a sudden SYN cookie just kicked in this cluster, and W Scale's gone, and now all of a sudden I have a slow partition that will never speed up.

Kris Jenkins (25:53):

So let me check I've got this right. So you are saying new leader election, ordinarily that Linux box would let you send larger packets than usual, and that's great for throughput, but you hit over that 50 connection window trying to connect to Kafka, which is a natural consequence of there being a new leader in town. And Linux not only says, "Whoa, back off a second," but it also says, "We're not going to let you have large window sizes at all until you reconnect,"?

Anna McDonald (26:23):


Kris Jenkins (26:23):

Ooh, ouch.

Anna McDonald (26:26):

Oh yes. And the best part is, so if anybody ever wonders everybody... If anyone's sitting here listening to us and going, "Is that happening to me?" Just run, D Message. It's like all over there.

Kris Jenkins (26:37):

D Message?

Anna McDonald (26:37):

Yeah. You could see it. It'll say, "SYN cookies, SYN cookies, SYN cookies." And it's the holiday season, so don't get mad at it. Just eat a cookie and then fix it. So again, the reason why I picked this one and not the flurry of mystery problems... Actually, if you wanted to see the first one that discovered this, it was Kafka 9211. I'll put that in there too, the original problem ticket. Is because how do we fix this? Well, we allow you to extend your backlog. That backlog size for the acceptor socket, making that configurable is really the rubber stamp for this.

Anna McDonald (27:14):

Because what ends up happening is it allows you to continue to have a protective mechanism, which you do want. Because it isn't only... And I'm not thinking of people with mustaches who are like, "Ha ha ha." I'm going to angrily... But a poor and misbehaving client can be awesome. You know what I mean? There're plays you can configure a client that could take your frigging node down. So you do want SYN cookies there to protect against people deciding to aha or not caring, so to speak. But you don't want that to impact good clients. So this is my favorite type of fix, where we're allowing and enabling the correct behavior, but we're not losing the original protection. So you can bump this way up from a hundred.

Anna McDonald (27:59):

And one thing I will say is trying to remember what backlog size... So when you increase a backlog size, I think, I'm not even going to say, because I have to look it up again, there's a safe number that you can increase it to without messing with any of the other TCP stuff. If you go over that number, you will hose your node. So be careful when you increase this to make sure that you're following recommended settings for the other things that need to go along with increasing this backlog size. And I should have looked that up and I did not. I was probably eating cookies instead.

Kris Jenkins (28:33):

Send us some notes and we'll stick those in the show notes.

Anna McDonald (28:36):

Yes. I have to. Well see, but the thing is that it really comes down to what your individual Linux customization is. So I want to say off the top of my head, you could bump it to a hundred. I think that's probably, but other than that, anything else, make sure you're looking into what your configuration is and adjusting those other parameters in order to account for it.

Kris Jenkins (28:55):

Right. So just recap that then, if someone's seeing on their cluster that some partition, some nodes are getting really slow, especially during new leader elections, then they're going to check D Message, and they're going to see all this whole tray of SYN cookies.

Anna McDonald (29:10):

Oh yeah, with frosting.

Kris Jenkins (29:12):

And then they're going to adjust... With frosting, with evil, evil frosting. Which parameter are we going to look at next? What's it called?

Anna McDonald (29:18):

So again, too, this is only fixed in 3.2.0. in AK 3.2.0. So another reason to upgrade and be comfortable with cluster rules. If you don't want to upgrade your cluster, don't operate Kafka. We have a whole cloud service and a team of experts, just let us run it. Do it well, or don't do it at all. Can you tell us...?

Kris Jenkins (29:41):

That's a very fair if slightly aggressive pitch for Confluent Cloud.

Anna McDonald (29:43):

I'm from Western New York. I think that's our motto; Fair, but slightly aggressive. Go Bills. I think that is kind of who we are as a people.

Kris Jenkins (29:55):

Okay. Yeah.

Anna McDonald (29:57):

Also nice, but slightly aggressive. I think you should combine that, nice, fair, and slightly aggressive. Welcome to Western New York.

Kris Jenkins (30:05):

We're going to get you that t-shirt made for next Kafka center.

Anna McDonald (30:07):

That would be amazing. It would be bespoke, Dennis.

Kris Jenkins (30:10):

It would be bespoke.

Anna McDonald (30:12):

He's going to kill me. I'm going to love it.

Kris Jenkins (30:13):

He really is.

Anna McDonald (30:17):

And again, I think it's important that we did these two in order because it's kind of like, "Well, what could cause a slow partition? And how long could it last?" It's like, "Well, rolling your cluster in forever until you restart."

Kris Jenkins (30:30):

That's nasty.

Anna McDonald (30:30):

Yeah. And that's why I like to do these podcasts too, is because there could be somebody out there sitting around going, "Why is this happening to me?" And maybe here's something to look at. And even if it's not this problem, something to look at as you're batching. Check your produce for throughput, your timing. Check D Message, always check D Message if you have a problem on your node. I'm consistently surprised... Now, Grafana, Prometheus, very important metrics. Your OS is also very important. Check it. Run that stat. Figure it out. I think having... And that's the thing, operating distributed systems is not easy, operating Kafka is not easy. You also need to know about your OS and need to understand this type of stuff on a server level. People have to connect to Kafka in order for it to work.

Kris Jenkins (31:16):

Yeah. Yeah.

Anna McDonald (31:16):

All right.

Kris Jenkins (31:18):

Nice. But before we move on, you haven't told me the name of that parameter.

Anna McDonald (31:23):

Oh, the backlog? It's just backlog. It's just TCP Max SYN Backlog.

Kris Jenkins (31:30):

Okay, cool. Thank you.

Anna McDonald (31:32):

That's the second parameter for bind. Yes.

Kris Jenkins (31:34):

So the next one I thought we'd talk about, we've foreshadowed leader election input, leader election protocols. Tell me something about that that goes wrong.

Anna McDonald (31:44):

Okay, so this is Kafka 12686. And so this one is very, very, very, very deep internals of Kafka. It's also in my favorite class, which is Partitioned at Scala. Which I love. It is, it's my favorite class out of all of them.

Kris Jenkins (32:02):

Why is it your favorite class?

Anna McDonald (32:03):

Because it does so much. If you want to understand the way that Kafka works specifically for resiliency and anything about partitions, shockingly enough, go look at Partitioned at Scala. Partitions are what make Kafka work. It's consumed from, what's produced from, what falls in and out. And that class is just a wonderful wealth of understanding about Kafka.

Kris Jenkins (32:25):

Oh, okay.

Anna McDonald (32:26):

And sometimes things go wrong. Partitioned at Scala has featured prominently in many of these, I think previous to this. It's not the first time I've ever said that on this show. I like to bring it back, that and purgatory. I love Purgatory.

Kris Jenkins (32:41):

We're getting to purgatory.

Anna McDonald (32:42):

Yes, we are. Yes. Foreshadowing again. So this one we're going to talk about it at-

Kris Jenkins (32:47):

We've talked about SYN, right, so we have to talk about Purgatory.

Anna McDonald (32:49):

That's right. See, look at this. How Halloween is this? Or Nightmare Before Christmas.

Kris Jenkins (32:53):

Nightmare Before Christmas.

Anna McDonald (32:54):

That's right. So this one, I call this attack of the overloaded cluster because-

Kris Jenkins (32:59):

Attack of the overloaded cluster.

Anna McDonald (33:00):

And that may be... Is that judgmental? Yes, yes it is. Because the only time we see this is really a race condition. And the only time we would see this problem was on clusters that had, I would say north of 200,000 partitions for sure. And the reason being because believe it or not, when you roll a cluster like that, that has that many partitions, especially, and I'm not talking about a 25 node cluster either by the way, I'm talking about a cluster that maybe has seven nodes at most, maybe eight nodes maybe at most. So there is a ton of leaders that are sitting on every node.

Anna McDonald (33:41):

So when we roll these nodes, the number of leader and ISR changes are enormous. So when you look at a race condition hitting it's usually a little more likely how wide your field is. So that's kind of what this goes in. And what happens is... And this is a very, like I said, again, this is a really, really deep kind of KIP. So what happens is, anytime you want to look at ISR... So ISR by the way, in-sync replicas, right?

Kris Jenkins (34:15):


Anna McDonald (34:16):

So when you're defining Kafka topic, you say, "I want this many partitions. I want the RF, replication factor on this topic to be this."

Kris Jenkins (34:25):

Yeah. Three being very standard, right?

Anna McDonald (34:28):

Yes. Three being very standard. Absolutely. Two being an abomination. One being right out. Don't do that.

Kris Jenkins (34:38):

Shades of the holy hand grenade of Antioch.

Anna McDonald (34:41):

Yeah. See, exactly. That was Monty Python, right?

Kris Jenkins (34:46):

That was Monty Python and the Holy Grail.

Anna McDonald (34:47):

Yeah. You may call me Tim. Is that right?

Kris Jenkins (34:50):


Anna McDonald (34:50):

Yeah. Look at me. What's up? My dad loves that.

Kris Jenkins (34:53):

Oh yeah.

Anna McDonald (34:53):

He's a huge fan.

Kris Jenkins (34:54):

Do you know why he is called Tim? I learnt this recently.

Anna McDonald (34:56):


Kris Jenkins (34:57):

It's because I think it's John Kleese, he forgot the line. The character had a much longer name and he forgot the name. And so he just went, "Tim."

Anna McDonald (35:05):

Oh my gosh. That's amazing.

Kris Jenkins (35:07):

And that's what... Yeah. Anyway. So three years-

Anna McDonald (35:09):

I really like that. It's better than Burt. Tim is funnier than Burt. I like that too.

Kris Jenkins (35:14):

It is inherently.

Anna McDonald (35:16):


Kris Jenkins (35:16):

We can say that now. The old host has gone from this podcast.

Anna McDonald (35:19):

Well yeah. Tim is... Yeah, that's right. Yeah. Hi Tim, by the way. How are you doing? Look at me. We're surviving without you. This has been going very well. We're having fun.

Kris Jenkins (35:29):

We are.

Anna McDonald (35:32):

No, Tim's amazing. Yeah. And first of all, he let me do this. Now it's kind of a thing. So I think people can't really... It's like it's a tradition. But he was the first person who allowed me to do this. And although he made it very clear in the first one, I don't know if you've ever heard it, he's like, "Just to be clear, this was all Anna's idea." And I know he said that because he was on the fence as to how it was going to go. So this one is technically named Race Condition in AlterIsrResponse handling. So when we-

Anna McDonald (36:00):

... race condition in AlterIsrResponse handling. So when we look at ISR, that in-sync replica, we've got an RF of three, let's say. We've got a leader and then we've got two replicas. Those replica's job is to become leader, should something happen to the original person. They're like, what's his name over there? Prince Tim, the guy in England over with you. If something happens to the queen, he steps up and is like, "Hey, I'm Charles."

Kris Jenkins (36:27):

Yeah, we have a redundant array of monarchists.

Anna McDonald (36:30):

You do. You have a redundant array. Exactly, see, there you go.

Kris Jenkins (36:34):

Oh yeah. There used to be the rule, you wouldn't like this because the number is two, there used to be the rule that you would give birth to an heir and a spare.

Anna McDonald (36:40):

Oh, I've heard this. See, I watch a lot of English television, so I... Yep, yep. Mostly mysteries where people are killed in small villages and stuff like that. I do. That's how I understand cricket. I think we talked about this, because I got so mad with all the cricket episodes where someone's killed with a cricket bat. And I'm like, I don't understand what they're talking about. So I watched a whole documentary on cricket, and now I'm like, oh yeah, I see that. Right? So I got it now.

Kris Jenkins (37:04):

So improving on monarchists, we have a replication factor of three.

Anna McDonald (37:09):

Right, you do.

Kris Jenkins (37:09):

And we are sending all the information to the other two.

Anna McDonald (37:10):

Correct. And much like the monarchy, because she's been dead for a while, he's still not been crowned king officially, has he? Does he have to have-

Kris Jenkins (37:18):

No, he hasn't.

Anna McDonald (37:19):

Do you see what I'm saying? This is actually perfect. This happens, and it's a pretty rare scenario... By the way, what's up David Arthur, how you doing? Because David wrote this, and I like David Arthur. He's a cool egg, cool beans.

Anna McDonald (37:33):

But he said this is a pretty rare scenario, and it involves alter ISR, the response, being delayed for some time, much like the coronation of that guy Charles. Right?

Anna McDonald (37:44):

And what ends up happening is, you can think of alter ISR like having a state. So when it does, there is this kind of thing where there's an in-flight state. And this is a great way to think about anything that's a-sync. You've got to be very, very clear on state, right? And have good state machines and good coverage for this kind of stuff, if things are asynchronous in your system, right?

Kris Jenkins (38:20):


Anna McDonald (38:21):

Kafka's very asynchronous. I love it that way. Synchronous things are annoying, and brittle, and I do not enjoy them. I don't want to have to wait. I like that thing where you call and they're like, if you want to call back, just push... Yeah, call me back. Why do I have to sit on the phone? Same thing with this.

Kris Jenkins (38:36):


Anna McDonald (38:37):

I'm not going to sit here. I'm going to send you something, eventually it'll come back. And so the bug in this is there is an alter ISR manager. That is in charge of altering ISR. It takes things in, it takes them out, right? Changes that ISR set, the in sync replica set.

Anna McDonald (38:56):

And the problem is, it's not checking to see if there's anything in the in-flight state. It doesn't go look, and it doesn't say, hey, do I have stuff in flight? It doesn't do that before it clears away pending items. Right? Which is not good.

Kris Jenkins (39:18):

I feel like you ought to explain why we have an alter in-sync replica message.

Anna McDonald (39:23):

Well, so let's say that a node dies, that replica sitting over there on the dead node is no longer in sync. It's dead.

Kris Jenkins (39:29):

So we have a different list of in-sync replicas?

Anna McDonald (39:31):


Kris Jenkins (39:32):


Anna McDonald (39:32):

Yep. And there's some settings you can tune on that too. You can say, okay, this is how long I want to wait. This is what I consider in sync, basically. Some people set it to be more, less, depending on your tolerance. So when we look at a partition, the issue that you end up with is the state doesn't match what we're expecting to find in unsent items.

Anna McDonald (39:59):

So basically it says, hey, I'm in flight. There's something going on with me. There's an in-flight change to my ISR. I'm a partition, like ooh, what's going on? I don't know. We should wait and find out.

Anna McDonald (40:10):

But when that happens, there should be something pending, because you're in flight. There's something in flight, right? There should be something, a corresponding thing in pending. And when we call partition, and this is why I said, again, it's the number of these that happen, because as soon as you increase the number two, it's more likely that there might be a delay, because it's doing so much at once.

Anna McDonald (40:34):

So for example, that cluster role, when we're calling make leader and that's in there, it's partition make leader, it basically says, okay, get rid of this and no pending items. But the in-flight state was still there. So basically what happens is there was still something outstanding going on, and somebody came in and just said, eh, and clobbered it. Now any time, and this is also why I brought up the a-sync system thing, anytime... And we do a lot of protection in Kafka. A lot of it has to do with EPOCs. We'll look at an EPOC and we'll say, okay, you have been out to lunch for a long time, and what you're trying to deal with, the leader you're trying to deal with, is from four EPOCs ago. Refresh your metadata and carry on with your life, right? You don't get to make any decisions, you're old.

Kris Jenkins (41:29):


Anna McDonald (41:30):

And there's a similar blocking concept that happens here, right?

Kris Jenkins (41:34):


Anna McDonald (41:35):

So what we do is basically kind of block. We say, hey, now this in-flight partition response has come back, but whoa, there's been things that have happened since this thing. Because we've cleared stuff, we've done... So even if that delayed response finally comes back, it's like, no, things have changed, so get rid of anything for this partition.

Anna McDonald (42:07):

And the way this looks like in real life, and I've seen it in multiple real life scenarios, is you run, you restart your cluster. Nothing ever moves back to your preferred leadership, ever, because your ISR is basically locked. Every single ISR request you're ever going to send, once while this leadership is owned, is batted to the ground. Because it's like, no, something's still in flight. I don't know. It's my state, my state's still in flight. Even though there's no pending items. So anything you send after that, it's like no, no.

Anna McDonald (42:41):

And so that ISR, I like to call it frozen in time. So if you ever find anything where your ISR looks like it's frozen in time, this could be what's going on.

Kris Jenkins (42:52):

Right. Okay.

Anna McDonald (42:56):

And like I said, this is very, very... Trying to, and this is why I'm going to read more about this. I'd love to have real life analogies to describe everything. This one, it's kind of like it is a decoupling of the state machine. It breaks, right?

Kris Jenkins (43:12):


Anna McDonald (43:12):

The state that I'm holding does it match what I expect to see. And so the fix for this, the workaround I should say for this... By the way, this has been fixed. It was fixed in 3.0. But the workaround for this is to force leader election. So once you force leader election on this, you move leadership away from the node that has this kind of state, all this stuff is cleared out. Your ISR state is reset and you're good to go.

Kris Jenkins (43:46):

Could it happen again when you try and move it back again?

Anna McDonald (43:48):

Absolutely. Not back again, but it could happen again the next time you roll your cluster if you don't get the fix for this.

Kris Jenkins (43:54):

Oh right. Okay.

Anna McDonald (43:54):

Yes, yes. And this is, the other thing too is, I have never... This is definitely due to scale. This race condition gets triggered by people who are doing a crap ton of leader and ISR changes per no. Right?

Kris Jenkins (44:09):


Anna McDonald (44:09):

And I think that has to do with the nature of a delayed response. And also, I don't know how to feel about this, David, that you said this is a pretty rare scenario and I've seen it multiple times. I think it's just me.

Kris Jenkins (44:26):

The queen of rare scenarios.

Anna McDonald (44:30):

But kudos for getting on it and fixing it. But if you haven't done and read, if there's one class that you're going to read ever...

Kris Jenkins (44:37):

Anna McDonald (44:37):

... in AK, it's partition.scala.

Kris Jenkins (44:38):

Okay. Okay. I'm going to confess my biases here, I try to avoid reading scala if I can. But for you...

Anna McDonald (44:45):

It's Java Scala though. I think many people would.

Kris Jenkins (44:48):

Ah, Java.

Anna McDonald (44:49):

I don't think I'm going to offend anybody by saying it's Java Scala.

Kris Jenkins (44:52):

Java Scala? Okay. But that's a classic thing. That's a classic pattern, isn't it? Like the combination of a state machine which assumes a series, an exact series of transitions, and a-synchronicity?

Anna McDonald (45:04):


Kris Jenkins (45:04):

Screwing that up.

Anna McDonald (45:05):


Kris Jenkins (45:06):

Yeah, absolutely.

Anna McDonald (45:08):

All right.

Kris Jenkins (45:09):

Okay. So that takes us to... Oh, now this one stuck in my mind because the ticket number is 12964, which is how you can start dialing my mother.

Anna McDonald (45:24):

That's awesome. Don't you love it when that happens?

Kris Jenkins (45:27):

Yes. Like is that spooky or are there just lots of integers in our lives?

Anna McDonald (45:30):

I know, right? I don't know, but I like it when that happens. It makes me happy. Whenever I see, I'm like, hey.

Kris Jenkins (45:37):

But this gives us an opportunity to learn something about segments.

Anna McDonald (45:40):

Yeah, so if you were to go look on any Apache Kafka node, you would go look at the file system, you would see log segments. That's how we store data. I know I think some people are like, that's not what... But it is.

Anna McDonald (45:53):

Kafka is a durable log. That's all it is. As much as, perhaps, I don't know how to say this in a way that isn't... Some people like to have a different spin on that, but technically at its heart Kafka is a durable log. Sorry, that's what it is. Right?

Kris Jenkins (46:08):


Anna McDonald (46:09):

And so when you go look at it, there are log segments. And so this is, the title of this one I called, A Killer From the Past Strikes When You Least Expect It. And yeah, it's Kafka 12964.

Kris Jenkins (46:23):


Anna McDonald (46:25):

And the actual title of it is Corrupt Segment Recovery Can Delete New Producer State Snapshots.

Kris Jenkins (46:32):

That sounds scary. By the time you've got corrupt and delete in there, I'm already worried.

Anna McDonald (46:37):

So here's the thing, right? Again, I don't know what this says about me, but in Kafka, and this is, if you go to reboot a node, shut down a note, there is an time amount of time which, if you exceed it, and that's configurable, the amount of time, if you exceed it will just shut the node down. We call that an unclean shutdown. When Kafka comes back up, it assumes that the segment files are corrupt. And it goes through and does this type of a thing, which is another thing to look at. If you see really long startup times, when you roll your cluster for a node, go see if you're getting a clean shutdown.

Anna McDonald (47:24):

One of the things which I think would be great to have would be a property, an additional property that's a flag that says, hey, if you can't do a clean shutdown, don't shut down at all.

Kris Jenkins (47:34):

Yeah, okay.

Anna McDonald (47:35):

Right? We don't have that. We just have an amount of time.

Kris Jenkins (47:38):

It shuts down... For some reason, for that reason, it comes back up and it assumes the files are corrupt. And it's going to what? Try and fix them? Re-sync them from other nodes?

Anna McDonald (47:47):

Yeah. So it looks and it's like, hey, what do I need to do here? Do I need to truncate the log? What am I looking at? It runs through all this code. And that's in log.scala too, just in case you would like to know. Right?

Anna McDonald (47:59):

And when we do this, there's... Because think about it, this node could have been down for a very long time.

Kris Jenkins (48:05):


Anna McDonald (48:06):

And so we're also doing cleanup, right? So we're like, hey, let's look at, do some cleanup, figure this out. Because again, it's replaying from the leader. It's figuring this stuff out from the current leader, for the partition segment. So this is the scenario we're in.

Anna McDonald (48:24):

So maybe that node was down for, I don't know, long enough where we also have some segment files we can delete because they're no longer valid. They've rolled off due to the amount of time, or settings, or whatever it is.

Anna McDonald (48:37):

And so it doesn't hurt anything for us to schedule that delete asynchronously. Again, I love a-synchronous-ness.

Kris Jenkins (48:46):


Anna McDonald (48:46):

Right? It's awesome, it's like, whistle while you work. I'm over here doing my work and this is just going to delete asynchronous.

Kris Jenkins (48:52):

When things are quiet, you can get rid of those.

Anna McDonald (48:55):

So I'm going to read this verbatim.

Kris Jenkins (48:57):


Anna McDonald (48:58):

So we make sure to do this. We cover this for log, again for our log segment files, by renaming. We basically said, hey, if there's anything we're going to delete asynchronously, rename those to have a log dot deleted file suffix.

Kris Jenkins (49:14):


Anna McDonald (49:16):

And the reason for this is because, if we truncate the log, the actual, this is why we did it for the segment files, it may result in deletions for segments with matching base offsets, to segments which will be written in the future.

Anna McDonald (49:35):

And the reason for that is, you could-

Kris Jenkins (49:39):


Anna McDonald (49:40):

Yes. So there's a case where the base offset for a log segment file, if we didn't rename it, in the future... And this is, again, you have to think in an async world, where anything can happen.

Kris Jenkins (49:55):

Yeah, yeah.

Anna McDonald (49:58):

Right? So this is also before we had anything like topic IDs, or anything like that, right?

Kris Jenkins (50:04):


Anna McDonald (50:04):

So you could rename, delete, rename a topic, recreate it from start. Let's say that the async stuff hadn't run yet. Who knows what could happen because deletes happen asynchron-

Anna McDonald (50:14):

There's all kinds of things that can happen, right?

Kris Jenkins (50:16):


Anna McDonald (50:17):

And so just to be safe, we say, okay, while we're running this, we know that this is what we want to delete. So we're going to rename it and have a suffix. And this is for log segment files. Unfortunately, we were not doing that for producer state snapshots.

Anna McDonald (50:31):

So producer state snapshots would have, and this again, it says, it leaves us vulnerable to a race condition. We could end up deleting snapshot files for segments written after log recovery. And producer state snapshots, the reason we take those, and I love this, is because of people who are esteemed, like Kafka Streams.

Anna McDonald (50:54):

And so Kafka Streams aggressively deletes stuff and truncates stuff when it doesn't need it anymore. After you do a partition topic, we used to actually say, okay, well producer state is based on the last time this producer ID produced to this topic. But if you're aggressively deleting out of that source topic, it invalidates producer state super fast.

Anna McDonald (51:16):

And so instead of that we're like, okay, that doesn't work. Instead of that, we use these snapshot files. So we take a snapshot, that's an example where we do this for transactional producers, right?

Anna McDonald (51:26):

And so that's what this is saying. It's like, hey, take a snapshot, make sure that this is more durable and long-lived than the actual topic, because there's reasons for aggressively deleting. And if you think about this, just like a log segment could suffer from asynchronous delete, where other things have happened and now you've got something named the same, so also could a producer state. So we weren't doing that.

Anna McDonald (51:54):

And I'm betting, and this is a little disappointing, I never saw this, which is... Sometimes it's like seeing a dodo bird in the wild. Or I think those are all gone. What are those called? Oh, like a palliated woodpecker in the wild. Because those are really cool, they're huge. Like Woody Woodpecker, that's what they look like, but real life. They're crazy. You just want to see one. You don't want it destroying your house, but you're kind of like, it would kind of be neat to see that.

Anna McDonald (52:24):

And I haven't seen this one. I'm not happy either way.

Kris Jenkins (52:26):

I see what you're saying.

Anna McDonald (52:27):

Either I see them too much, or I don't. But I think this is really cool, and I'm just glad. Anything that helps us, so EOS and Kafka Streams is passion of mine. A lot of people use it. It bundles a transactional producer and a recommitted consumer inside of it. So anything that hardens transactions is pretty important to me. So I was really glad that we found this.

Kris Jenkins (52:48):

Okay, so let me make sure I've understood this. You're going through, your recovering the file. You say, I'm doing recovery, those files can be deleted at any time. So you take a note of those and delete them asynchronously. Then I come along and I re-sync a file, which happens to be the same file.

Anna McDonald (53:07):

It's not the same file.

Kris Jenkins (53:08):

It was not the same file?

Anna McDonald (53:08):

But the base offsets, right? Base offset-

Kris Jenkins (53:11):

The same base offset, yeah.

Anna McDonald (53:12):

Yeah, the same base offset-

Kris Jenkins (53:13):

The thing that identifies that file.

Anna McDonald (53:13):

And there are things that can happen to... You could reset. You could do all kinds of... There are reasons why another file with the same base offset could exist and that one would not want to be deleted.

Kris Jenkins (53:25):

Yeah. And so, I've marked it for deletion, I've written a new one, and then the deletion happens, and it takes out my new one.

Anna McDonald (53:32):


Kris Jenkins (53:32):

Yeah, okay. I'm with you.

Anna McDonald (53:33):

Yep. Exactly, exactly. And we cover for that in log segments. We just didn't cover for it in producer state snapshots. And that was resolved in 3.0 too. So that made me happy.

Kris Jenkins (53:42):

3.02 or 3.0 as well?

Anna McDonald (53:45):

3.0 as well. 3.00.

Kris Jenkins (53:46):

Right, okay.

Anna McDonald (53:47):

As well. Thank you for clarifying that.

Kris Jenkins (53:49):

Just to be sure, just to be sure. This is an aside, but one of my favorite albums of all time is called Soft Music to Do Nothing To. And the musician just released a sequel, annoyingly called Soft Music to Do Nothing 2.

Kris Jenkins (54:00):

... annoyingly called soft Music to do Nothing 2. I mean, that's really not helpful to anyone.

Anna McDonald (54:06):

Oh, I wish it was To-Two.

Kris Jenkins (54:09):

Oh no, he went just with... I mean, I'm never going to get Alexa to play it.

Anna McDonald (54:12):

Or Two Squared.

Kris Jenkins (54:14):

Yeah. But-

Anna McDonald (54:16):

That would be cool. He should put math in there.

Kris Jenkins (54:17):

Anyway. Yeah. That was an aside we didn't need, but it was very [inaudible 00:54:23]-

Anna McDonald (54:23):

Did he do it on purpose, like a pun?

Kris Jenkins (54:25):

I think so. I think so.

Anna McDonald (54:26):

That's awesome. I like that. That's kind of a devious mind, like good luck getting anything to play this.

Kris Jenkins (54:31):


Anna McDonald (54:32):

I kind of want to now design unplayable album titles. This is amazing.

Kris Jenkins (54:37):

If you come up with an unplayable album title, I'll write the unlistenable album to go with. How about that?

Anna McDonald (54:41):

Oh, that'd be great, because I am not musical in any sense of the word, so that would be awesome. All right, so the last one.

Kris Jenkins (54:50):

Yeah. So, the last one, Sin, Purgatory. This is a really dark note to end the podcast on.

Anna McDonald (54:55):

Yes, it is. It's kind of like my favorite thing in the entire world. And Jeff Kim found this and when he did, I slacked him immediately. I was like, "What, son? How did we not do this?" You don't ever get that thing where you're like, woo. And here is what happened, that was me. I was delighted. I was delighted, because I was like, boom, I'm putting that on the podcast. So this is... I love this.

Kris Jenkins (55:21):

Airing your dirty laundry.

Anna McDonald (55:22):

Yeah. Well, AK, it's open source. That's the best part about open source, is there's no dirty laundry to air. It's everybody's laundry in a mass pile. You see it. It reeks of this. And also it's just being transparent, and it's nobody's fault, because the other thing I love about open source is there's a natural prioritization that occurs, and the initial design, and the initial intent of this feature, there is a reason why I believe that this didn't come up. So, that's kind of also interesting to see, is that in open source, squeaky wheel gets the grease.

Kris Jenkins (56:01):

Okay. Yeah, yeah. Yeah.

Anna McDonald (56:01):

And so I don't find this embarrassing at all for anybody. It's just really like, oh, it's one of those. I did, I'm going to giggle about it again. So it's Kafka 14334. And I call it, Whoops, I forgot to buy you a gift by Christmas, because what it is in fetching, when you're a consumer, we want to be good stewards. And I always say this, everything in Kafka is a request. There's a consumer fetch request, producer fetch request, replica fetches. Everything is a request. And that is if you can't make requests, if your request pool is saturated, then Kafka doesn't work. 

Anna McDonald (56:41):

And there are things that you can do on the consumer side to be a good steward, to make sure that when you're fetching it's worth it. It's almost like setting constraints. One of them is Min Bytes. So that means, you know what, don't keep giving me these pittily Fetch requests from a consumer. Don't actually send the data back to me until you have at least a mg.

Kris Jenkins (57:04):

Yeah. Make it worth my time.

Anna McDonald (57:06):

If you get really annoyed about it, you could look at your fetch and be like that. That's what I would do, is like please, what is this? This is nothing. I need more data. So that's another one. And people do this, often to be good stewards of their infrastructure, and also because for whatever processing they're doing, they want it to be worth it. There's a lot of reasons why you might want to set Min Fetch Bytes, right?

Kris Jenkins (57:29):

Yeah. I've seen recently, like just for catching up on consumer lag, that makes a huge difference. 

Anna McDonald (57:35):

Yeah. Yeah. It's for throughput too. Let's batch it up. Like death by a thousand paper cuts, we don't like that. So that works great when you're fetching from a leader. And what happens is you say, "Hello, I have a criteria for my Fetch. And when that criteria is not met, I must go somewhere to wait." And where does one wait? Purgatory.

Kris Jenkins (57:57):

In purgatory. Yes.

Anna McDonald (57:59):

It's a definition of it. It's my favorite thing in the world. By the way, Lucas Bradstreet, they better not change the name of Purgatory.

Kris Jenkins (58:11):

Okay. You're serious. You are very serious about that. Yeah.

Anna McDonald (58:12):

I am serious. I'm watching you. I'm watching you. This is live, not live but it's taped. But it will be on the air. So, I love Purgatory.

Kris Jenkins (58:21):

It's indelible, it's immutable.

Anna McDonald (58:22):

It is, that's right. So when you're fetching from a leader, it works great. So I have a criteria for my fetch request. I go and I go sit in Purgatory and I just wait. When that criteria is met, I'm popped out of Purgatory and the data goes back. All is well and good for leaders. We never did that for followers.

Anna McDonald (58:42):

So Follower Fetch was introduced. It was introduced as a way to... There are a couple of reasons why it was introduced, I believe. If you look, it's really had to do more with location. It had to do with spanning multiple data centers, and we wanted to make sure you could have your consumers consume from the closest data center, or the closest availability zone, stick there for cost purposes.

Kris Jenkins (59:12):

Yeah, yeah.

Anna McDonald (59:12):

Absolutely. But there's another reason to do it. And this reason really wasn't... And that's, by the way, if anyone cares, it's KIP 392. That is allow consumers to fetch from the closest replica. One of the things that came up, and has come up since, is Fetch for Follower is also used to scale out. So, one of the things in Kafka is your lowest unit of scale is a partition. So if I am trying to span out consumption, I can only have one instance of a consumer group consuming from a partition at a time. 

Anna McDonald (59:55):

If I use Fetch from follower, then I can span out my consumption. Not so much... And this is actually, if you look at it, not so much... I mean you can do it this way, but I think something like a threaded consumer, like a parallel consumer is a better fit for when you actually need to consume and then process, thread that kind of processing.

Anna McDonald (01:00:24):

But let's say that you're running something and you're like, "Hey, I've got a ton of consumers that need to consume from this topic, and I need some way to scale that out, and it's not enough to have one consumer group. I need multiple consumer groups," if you don't have Fetch from Follower, you concentrate all that on a single node. 

Anna McDonald (01:00:44):

So let's say I have 7,000 consumer groups and they all need to consume. Another way to scale this out, because again, remember, we have those replicas that are just sitting out there on other nodes, is to do Fetch from Follower. When we do that primarily, a lot of times it's a performance issue, because if not, you would just have one consumer group. So it can be a performance issue. And that is why I think this was found, because basically, because we ignore any criteria you've set if you're doing Fetch from Follower, you're like in Purgatory with that wait music. I'm looking at an imaginary watch. That's like who you are, except for you're absolutely going to get bounced out of Purgatory when you hit Fetch Max Wait Milliseconds. That's still enforced.

Kris Jenkins (01:01:38):

Right. Okay.

Anna McDonald (01:01:39):

So it's really like we're like, "Yeah, yeah, yeah. We know about Fetch Max Wait Milliseconds. We'll bounce everybody out, because that's the longest you can wait." But any of those other criteria-

Kris Jenkins (01:01:49):

... any criteria, they'd just all be ignored and you sit there until next time.

Anna McDonald (01:01:56):


Kris Jenkins (01:01:56):

Oh, fun. Oh, God. 

Anna McDonald (01:01:56):

I'm sorry, but all I can picture is somebody setting that and then looking at a PERF test and being like, what the heck?

Kris Jenkins (01:02:02):


Anna McDonald (01:02:03):

And I think the default Fetch Max Waits is either 50 milliseconds or 500. I have to remember which one it is, but whatever it is, it's just going to be a line. That's your PERF test. That's when they all come back. Wouldn't that be hilarious?

Kris Jenkins (01:02:16):

I'm sort of surprised that didn't get picked up.

Anna McDonald (01:02:18):

Do you see what I mean? And that's where I think the usage comes in. So if all I care about is fetching from the closest, maybe I'm not tuning on performance. Maybe I don't really care. I think that this came about because of that second use case. Where we're going, we really need multiple instances. We need to consume from multiple instances of a partition for speed. And that's the kind of thing you end up PERF testing, and that's the kind thing you might end up tuning something, making it worth your while and setting a criteria. 

Anna McDonald (01:02:48):

And then all of a sudden you're like, what the hay, son, I've got a fire hose topic, I've got plenty of throughput. I'm hitting my Min Fetch Bytes.

Kris Jenkins (01:02:57):

And there's some imaginary wall there. 

Anna McDonald (01:02:59):

Yeah. Why are my consumer fetch requests like doo doo doo? And huge props to Jeff Kim for finding this too. 

Kris Jenkins (01:03:11):


Anna McDonald (01:03:11):


Kris Jenkins (01:03:12):

When was that fixed? Which version did that get fixed in?

Anna McDonald (01:03:15):

3.4.0. This is my most recent one. Yeah. And 3.3.2.

Kris Jenkins (01:03:19):

That's not even out yet, is it?

Anna McDonald (01:03:20):

It was back ported to 3.3.2. Yeah. 

Kris Jenkins (01:03:24):

Okay. 3.3.2. Cool. Crikey. Do you think if we do a Halloween podcast next year, do you think you're going to find any new interesting bugs, or you think it'll fix there?

Anna McDonald (01:03:33):

Oh my gosh, yes. It was really difficult this year to pick, because I still have so many other ones. Yes. I do try to pick ones that are fixed, unless they're really interesting and good. I do try to pick one, because that's always a nice ending. You know what I mean? Nobody wants a cliffhanger, like, "Yeah, this sucks. Bye." That's not good. So I try to pick ones that have been fixed. 

Anna McDonald (01:03:53):

I always take recommendations and suggestions. I will add my... I am on Mastodon. I'm on Hachyderm, because it's like the best pun ever. So I'm JB Fletch on Hachyderm. On Mastodon, if you want to ever hit me up with a thing.

Kris Jenkins (01:04:09):

Awesome. We'll put your link in the show notes.

Anna McDonald (01:04:11):

Yes. Yeah. But thank you very much for letting me come on again, because this is always-

Kris Jenkins (01:04:18):

No, I've learned some interesting things. I've learned some scary things. 

Anna McDonald (01:04:20):


Kris Jenkins (01:04:21):

And I'm sure our listeners have. Very cool. How long have you been learning about this? How did you get this much knowledge?

Anna McDonald (01:04:31):

So I think, and I always say this, I work as a customer success technical architect. And our job is pro... I know, it's the longest title ever. It's proactive. It's really proactive support. We like to stop people from running into problems. So I play with use cases all the time. I get to go and look at real things that people are doing, which is why I went here over Eng. And so in order to be able to talk about the entire, and I always say this, by the way, I am the Apache Kafka Jeopardy champion. I don't know if you knew that. Like current.

Kris Jenkins (01:05:10):

I did not know that.

Anna McDonald (01:05:11):

I am the Apache Kafka Jeopardy champion too. And I think it's because in my job I am exposed to the whole horizontal aspects of the entire AK ecosystem, all the client libraries, all this kind of stuff. And to me, I want to understand things so I can explain it to my kids, and if I can't, then I don't understand them good enough. I have to be able to understand them. And I think when you do, you can reason about those, and you can give valuable insight into our roadmap and our direction and stuff like that. 

Anna McDonald (01:05:41):

So, I just really enjoy... I don't like surface level knowledge. So I think it's more of a me thing, where I'll read KIPS, I'll read JIRAs, I'll go look at AK source code. I'll figure out the underlying framework for this kind of stuff. Replica Fetchers is a great example. Those are highly misunderstood. And I did a talk about them, and that's kind of what I try to do too. I also try to do talks to demystify parts of AK that I feel aren't understood enough.

Anna McDonald (01:06:12):

But again, I think, please, anybody feel free, if I've got something wrong, to speak up and pipe up, because again, yeah, I am sharing what I know at this point. Some things I know very well. Other things, I haven't deep dived into to the extent where I could of other areas. But I think it's fun too. It's much more fun to be on a call and be able to know the actual deep internal, so if people ask you the what ifs you can answer them. I just think that's part of our job, is to understand and be able to reply to just ask me anything about AK.

Kris Jenkins (01:06:50):

Yeah. Yeah, especially when you are the Jeopardy champion.

Anna McDonald (01:06:52):

I know. I've got to defend that. I don't know where I'm going to defend my title, but I need to, and I'm happy to.

Kris Jenkins (01:06:58):

That leads me to my last question for you, because you might have to defend it at Kafka Summit London, which is coming up, called The Paper is Now Open. Do you have a topic in mind for Kafka Summit London? 

Anna McDonald (01:07:07):

So, I kind of do. I might do one on pragmatic event streaming patterns for legacy industries. And I'm not really sure how to say, "You have an on-prem DC," other than to say, "Legacy industries," or I could just say, "For companies that have an on-prem DC," because I've been getting... I think it's very unfortunate that you have people who have never worked in an actual place that has an on-prem DC, or existing code base, or existing infrastructure. We're talking about companies, and that's where I grew up, I worked at SaaS Institute.

Anna McDonald (01:07:50):

There are pragmatic patterns that are best in class, best practices, and pretending like everybody is greenfield and that's what moves them forward is nonsense. And so I am about done with that. So, I may present on that because I think... People ask me all the time, "Well, where are those patterns written down," and I'm like, "In my head," which isn't helpful. So it's another way for me to document. So that's kind of what I'm thinking I might do now. And of course I'll talk about Kafka streams, as usual.

Kris Jenkins (01:08:18):

Yeah, as always. Cool. Well that's one to look forward to. Anna, it's been a pleasure. I wish we could have got you in time for Halloween, but it's nice to do before Christmas.

Anna McDonald (01:08:26):

This is fun though, because look, can I go show this? I just want to show it because-

Kris Jenkins (01:08:30):

Yeah, absolutely. And I will describe it for the people who are just listening. Hold that up high. Oh, it's the Apache Kafka, bauble, K.

Anna McDonald (01:08:39):

Yeah, it's a K. People should know this. Don't put the stickers on the wrong way like I did.

Kris Jenkins (01:08:45):

Are you going to stick that on your Christmas tree?

Anna McDonald (01:08:48):

I don't... I think I might keep it in my office, because it's like a tiny Christmas tree and it makes me happy. 

Kris Jenkins (01:08:52):

Yeah. That's good. And next year, next year maybe at Current we'll give away little Christmas baubles based on that design.

Anna McDonald (01:08:59):

See, that would be cool. That would be really cool. And I can bring my glue gun. I have a glue gun. It's awesome. 

Kris Jenkins (01:09:02):


Anna McDonald (01:09:02):

Everyone should have a glue gun.

Kris Jenkins (01:09:04):

Anna, until then, thank you very much.

Anna McDonald (01:09:06):

No problem. Thank you so much. It's been fun.

Kris Jenkins (01:09:09):

Cheers. Catch it again. 

Anna McDonald (01:09:10):


Kris Jenkins (01:09:11):

The one, the only Anna McDonald there. Shall I tell you my favorite Anna McDonald fact? This is how dedicated she is in her fandom of Angela Lansbury; Anna owns a boat and she named her boat Murder She Floats, which I think is genius. If you want to get more from Anna's brilliant and unique mind, then head to, which is our free education site for all things Kafka. There you will find her complete course called Thinking and Events, which will help you to design better event-driven systems. It's there along with a raft of other useful free courses. So go and take a look when you get a chance. 

Kris Jenkins (01:09:52):

Meanwhile, if you have the knowledge you need but not the Kafka, then take a look at our Kafka as a service... Service. Kafka as a Service Service. Then take a look at our Kafka as a service Confluent Cloud. You can get a cluster up and running in minutes and let our engineers worry about maintaining it for you. And if you would like to get a hundred dollars of extra free credit to your account, then use the code: Podcast100 after you've signed up and it will be added on behalf of us at Streaming Audio. And with that, it remains for me to thank Anna McDonald for joining us and you for listening. I've been your host, Kris Jenkins, and I will catch you next time.

Kafka JIRA-10888: The sticky partitioner
Kafka JIRA-9648: SYN cookies with evil frosting
Kafka JIRA-9211: TCP delays after upgrading
Kafka JIRA-12686: Attack of the overloaded cluster
Kafka JIRA-12964: A Killer From the Past Strikes When You Least Expect It
Kafka JIRA-14334: Whoops! I forgot to buy you a gift by Christmas
It's a wrap!