Streaming Audio: Apache Kafka® & Real-Time Data

Apache Kafka 2.6 - Overview of Latest Features, Updates, and KIPs

August 06, 2020 Confluent, original creators of Apache Kafka® Season 1 Episode 113
Streaming Audio: Apache Kafka® & Real-Time Data
Apache Kafka 2.6 - Overview of Latest Features, Updates, and KIPs
Show Notes Transcript

Apache Kafka® 2.6 is out! This release includes progress toward removing ZooKeeper dependency, adding client quota APIs to the admin client, and exposing disk read and write metrics, and support for Java 14. 

In addition, there are improvements to Kafka Connect, such as allowing source connectors to set topic-specific settings for new topics and expanding Connect worker internal topic settings. Kafka 2.6 also augments metrics for Kafka Streams and adds emit-on-change support for Kafka Streams, as well as other updates. 

EPISODE LINKS

Tim Berglund (00:00):
Welcome to Streaming Audio, a podcast about Kafka, Confluent, and the cloud. We've got some great stuff for you today. Hope you enjoy.

Tim Berglund (00:12):
Hi. I am Tim Berglund with Confluent. I'm here by Clear Creek in the scenic Guanella Pass in the mountains of Colorado to tell you all about Apache Kafka® 2.6. As usual, I'd like to divide the update into three parts. That's Core Kafka, Kafka Connect, and Kafka Streams, and I want to highlight the most important KIPs that have been merged in for this release. And remember, if you don't know, a KIP is a Kafka Improvement Proposal, that's like a big package of change, sometimes it's small, sometimes it's big that gets merged into each new release of Kafka. There are a lot of great KIPs in 2.6, so let's get started. KIP-546 makes it a little bit easier to administer quotas. Now, quotas are fundamentally complex to administer because you've got this combinatorial matrix of user and client and you can have different quotas at each intersection in that matrix.

Tim Berglund (01:00):
So there can be a lot of things to keep track of. Basically, what this KIP does is add that functionality to admin client. So it's now a native API, a part of the native admin API as you see here, and there's the shell command Kafka client quotas that you can use to do this from the command line if you don't want to code it from Java directly. KIP-551 exposes some new disk read and write metrics. Disk activity can be a fundamentally limiting factor for bandwidth and for latency, it's a thing you need to keep your eye on, so it's just improves our visibility there. As you see, we can see the total number of bytes read, the total number of bytes written by a broker, and that's from all disks that that broker may have to babysit. We don't know how many disks that is for any given broker, but this is an aggregate of all that activity.

Tim Berglund (01:50):
KIP-568 gives us an API for triggering consumer group rebalance. Now, normally, if you're a consumer, you're always part of a group, whether you're a group of one or a group of many; if you're a consumer, you kind of want a hands-off attitude to consumer group rebalance. That is a thing that's supposed to happen for you. It's driven by the client library on the consumer side, but you don't touch it a lot from an API standpoint. That's how it's supposed to be. But sometimes you have to, fundamental design principle, abstractions always leak. So all that stuff that's abstracted away that's that layer below you in the API, now with this KIP you've got a way to reach down there and basically just trigger a rebalance by yourself. There's this method you see there called enforced rebalance. There could be some condition in the system that you detect that tells you, you have to trigger a rebalance.

Tim Berglund (02:40):
Now you have an API for doing that. KIP-574 is an improvement to the Kafka configs shell script. Now, this has always let us set individual config parameters, one at a time. So you've got this fairly interesting array of command line switches that lets you pick which configure you want to set, and then other switches that let you set that config as it gets set in according to its own particular idiom. But if you wanted to like script a bunch of those, maybe, I don't know, have your configuration act like code, you'd have to have a shell script that just called Kafka configs over and over and over again. One might ask, what year is this? Well, the opinion of KIP-574 is that it's this year, it's 2020, configuration might be code, now we can point to a file that will pull in those configs rather than invoking the config command over and over again 

Tim Berglund (03:29):
just like suck in the configs from this file. It's a much easier way to read and write and collaborate around that little bundle of config. So nice improvement to Kafka configs. KIP-606 adds metadata to metrics reporting. Now, this is the thing that is cross cutting, right? It affects clients, the broker, streams connect, really all parts of the system, but we put it in Core Kafka cause I think that's really where it belongs chiefly. The idea though is that when metrics are being reported you might not know where that event is coming from. There could be a lot of things that you're looking at, it could be something from connect, it could be something from a streams application. And now with this KIP, that context is reported, and so you don't have to go fishing or you don't have to find hacks and ways of injecting your own context to kind of back that out. This KIP makes that official as it should.

Tim Berglund (04:21):
Now let's talk about some Kafka Connect KIPs. First up is KIP-158. This deals with systems where topic auto creation is enabled. Now, a lot of admins shut that off. Right? So if you want a topic to be created, you have to create it through some administrative process, maybe there's a forum you have to apply and get permission, every organization has their own approach to this. But some clusters have that turned on; and so if connect is able to auto create a topic, when a new connector starts up or a connector detects some new input and needs to create a new topic, you'd like maybe to apply some metadata to some configuration to those new topics you create. Now with this KIP, you can put those new configuration parameters in the connector configuration itself and those will be applied to any new topics that the connector creates.

Tim Berglund (05:09):
KIP-585 gives us the ability to apply predicates to SMTs. Now, an SMT, if you don't know, is a Single Message Transform, and that's a little function basically you can apply to every message as it passes through Kafka Connect, whether from a source or a sink perspective coming in or going out, we can modify that in some stateless way, super handy thing to do. And what this KIP lets us do now is say, well, here's some logic that we're going to use to determine whether any given single message transform should be applied to a message. And you're able to define custom predicates, there's an interface that you can implement to do that. The predicates that are included are topic name matches; so if the message is from a particular or to a particular topic, then this SMT kicks in, has header key to see whether a particular Kafka message header is present.

Tim Berglund (05:59):
Or of course, if the record is a tombstone, then we'll apply the SMT or not. Those are the ones that come out of the bag. You can write your own. One note, do be careful with this kind of thing. SMTs are an absolutely essential part of Kafka Connect as a data integration technology, but you don't want to go like writing code in your SMTs. If you find yourself doing that or having trouble wrestling with the fact that it's stateless, you want to look to something like ksqlDB or Kafka Streams to actually do stream processing and not do that in connect. But predicates on SMTs, huge improvement, great thing for a lot of use cases. KIP-610 gives us a little bit more flexibility with the so called dead letter cues in Kafka Connect. Now previously, if there was a problem processing an SMT, a Single Message Transform or deserializing on the way out, then you could take that message and automatically write it to a topic you designate called a dead letter queue or some people say dead letter topic.

Tim Berglund (06:57):
If there were problems in another part of the sinking chain, not in the SMT, and not in the deserializing, then the dead letter queue wasn't an option for you. So now, KIP-610 says, any problem dealing with a message in a sync connector, whether it's in those two parts or somewhere else in the chain, you could still write it to a dead letter queue. All right, let's look at some Kafka Streams KIPs. First up is KIP-441, which helps to scale out of a Kafka Streams application work a little more smoothly. Now, when you are scaling out a Kafka Streams application, there are kind of two things that are going on. One, there is something very much akin to a consumer group rebalance where you're taking the partitions that had been assigned to say these three nodes and reshuffling them among the new five nodes that you've got in your new Kafka Streams cluster as you're scaling it out.

Tim Berglund (07:47):
So you have to rebalance partitions. You also have to move state. Right? Since most streams’ operations are stateful, you've got your RocksDB instances or, just in general, your state store holding that state in each node of the Kafka Streams cluster, and that state has to be moved to the new nodes. You've got these two nodes in this hypothetical three to five node scale that we're talking about here. And as those partitions move from the three to the five, the state has to go with them. Previously, with the KIP-441, you actually had to pause the availability of the cluster. You had to stop processing messages while you're doing that reprocessing. Now what happens is, while that data is being moved from its previous node to its new node, the previous one is still able to serve requests. So you're still handling traffic and you don't have to stop stream processing while you're scaling out.

Tim Berglund (08:38):
So huge benefit to elasticity in Kafka Streams clusters. KIP-447 is an improvement to the scalability of Kafka Streams applications with exactly one semantics enabled. This is kind of a subtle thing, and I encourage you to read the KIP for the details. But basically a streams application that's dealing with a large number of source partitions on an input topic, there could just be scalability issues there in terms of memory usage in the streams cluster and things like that. So again, dig into the details here, I'm not going to take you through all of that, but this KIP-447 makes that work a little bit better. So if you've got a big input topic and kind of a beefy streams application, you might see better results now. So check it out. All right, that's all I got for you in this summary. But what do you do if you want to know more? You always read the release notes.

Tim Berglund (09:27):
That's going to have all the changes in 2.6. You won't miss anything there. There'll be a release blog post. That's a good summary that might go into a little more detail on a few KIPs. So you want to check that out too. But no matter what you do, get started, download 2.6, start putting some of these changes into effect, and we always look forward to hearing what you build. So connect with us, however you do that, Twitter, Slack, we'd love to hear from you. Thanks. 

And there you have it, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me at, @tlberglund on Twitter. That's @tlberglund or you can leave a comment on YouTube video or reach out in community Slack. There's a Slack sign-up link in the show notes if you want to register there. And while you're at it, please subscribe to our YouTube channel and to this podcast wherever fine podcasts are sold. And if you subscribe through iTunes, be sure to leave us a review there. That helps other people discover the podcast, which we think is a good thing. So thanks for your support, and we'll see you next time.