Scaling Apache Kafka Clusters on Confluent Cloud ft. Ajit Yagaty and Aashish Kohli Artwork

Confluent Developer ft. Tim Berglund, Adi Polak & Viktor Gamov

Hi, we’re Tim Berglund, Adi Polak, and Viktor Gamov and we’re excited to bring you the Confluent Developer podcast (formerly “Streaming Audio.”) Our hand-crafted weekly episodes feature in-depth interviews with our community of software developers (actual human beings - not AI) talking about some of the most interesting challenges they’ve faced in their careers. We aim to explore the conditions that gave rise to each person’s technical hurdles, as well as how their experiences transformed their understanding and approach to building systems.

Whether you’re a seasoned open source data streaming engineer, or just someone who’s interested in learning more about Apache Kafka®, Apache Flink® and real-time data, we hope you’ll appreciate the stories, the discussion, and our effort to bring you a high-quality show worth your time.

All Episodes

Confluent Developer ft. Tim Berglund, Adi Polak & Viktor Gamov

Scaling Apache Kafka Clusters on Confluent Cloud ft. Ajit Yagaty and Aashish Kohli

May 11, 2022 • Confluent, original creators of Apache Kafka® • Season 1 • Episode 214

0:00 | 49:07

How much can Apache Kafka® scale horizontally, and how can you automatically balance, or rebalance data to ensure optimal performance?

You may require the flexibility to scale or shrink your Kafka clusters based on demand. With experience engineering cluster elasticity and capacity management features for cloud-native Kafka, Ajit Yagaty (Confluent Cloud Control Plane Engineering) and Aashish Kohli (Confluent Cloud Product Management) join Kris Jenkins in this episode to explain how the architecture of Confluent Cloud supports elasticity.

Kris suggests that optimal elasticity is like water from a faucet—you should be able to quickly obtain as many resources as you need, but at the same time you don't want the slightest amount to go wasted. But how do you specify the amount of capacity by which to adjust, and how do you know when it's necessary?

Aashish begins by explaining how elasticity on Confluent Cloud has come a long way since the early days of scaling via support tickets. It's now self-serve and can be accomplished by dialing up or down a desired number of CKUs, or Confluent Units of Kafka. A CKU corresponds to a specific amount of Kafka resources and has been made to be consistent across all three major clouds. You can specify the number of CKUs you need via API, CLI or Confluent Cloud UI.

Ajit explains in detail how, once your request has been made, cluster resizing is a two-step process. First, capacity is added, and then your data is rebalanced. Rebalancing data on the cluster is critical to ensuring that optimal performance is derived from the available capacity. The amount of time it takes to resize a Kafka cluster depends on the number of CKUs being added or removed, as well as the amount of data to be rebalanced.

Of course, to request more or fewer CKUs in the first place, you have to know when it's necessary for your Kafka cluster(s). This can be challenging as clusters emit a large variety of metrics. Fortunately, there is a single composite metric that you can monitor to help you decide, as Ajit imparts on the episode.

Other topics covered by the trio include an in-depth explanation of how Confluent Cloud achieves elasticity under the hood (separate control and data planes, along with some Kafka dogfooding), future plans for autoscaling elasticity, scenarios where elasticity is critical, and much more.

EPISODE LINKS

SEASON 2
Hosted by Tim Berglund, Adi Polak and Viktor Gamov
Produced and Edited by Noelle Gallagher, Peter Furia and Nurie Mohamed
Music by Coastal Kites
Artwork by Phil Vo

🎧 Subscribe to Confluent Developer wherever you listen to podcasts.
▶️ Subscribe on YouTube, and hit the 🔔 to catch new episodes.
👍 If you enjoyed this, please leave us a rating.
🎧 Confluent also has a podcast for tech leaders: "Life Is But A Stream" hosted by our friend, Joseph Morais.

Kris Jenkins: (00:00)
What do you always want more of until you've had enough? Cake, obviously, but also computing resources. Scaling is today's topic. And on today's Streaming Audio, we're going to talk about scaling Kafka clusters, adding and removing capacity dynamically. And then doing that at scale, because we're not just resizing one cluster, we're going to resize a whole clouds worth of them. To take us through it, we've got Ajit Yagaty and Aashish Kohli. Who've been taming Kafka and Kubernetes to make the most seamless scaling experience they can. But before we begin, let me tell you that Streaming Audio is brought to you by developer.confluent.io, which is our educational site for Kafka.

Kris Jenkins: (00:43)
It can teach you something new, whether you're a complete beginner or an event systems guru. And while you're learning, you can easily get a Kafka cluster started using Confluent Cloud. Sign up with the code PODCAST100, and we'll give you a $100 of extra free credit. And with that, I'm your host, Kris Jenkins, this is Streaming Audio. Let's get into it. My guests today, both work on cluster elasticity and capacity management for Confluent Cloud. We have Aashish Kohli, who is a product manager, and Ajit Yagaty, who is an engineer for those teams. Gentlemen, welcome.

Aashish Kohli: (01:29)
Thanks, Kris.

Ajit Yagaty: (01:29)
Thank you.

Kris Jenkins: (01:30)
Good to have you here. So, we're going to talk about elasticity and specifically cluster elasticity. But maybe we should start with the word elasticity. Maybe I can give you my lay definition of it and you can tell me what I'm missing. I think of elasticity as two constraints. I never want to pay for more than I need and I always want to be able to buy as much as I want. Aashish, do you think that's about the size of it?

Aashish Kohli: (01:59)
Sure. I think you got that right, Kris. When we think of elasticity, it's a table stake for a cloud service. Any cloud service today has to have the elasticity capability fully satisfied. And the idea here is that, just think of any utility that you use. You pay for a utility in terms of how much you use. You don't have extra capacity lying around all the time. Same way for a cloud service with elasticity. The idea is that you can very closely align how much you're using with how much you need. So that you never have excess wasted capacity. While at the same time, you don't have situations where you need that excess capacity and you don't have it available. So, it's the close alignment of your demand to how much is being available to you as a consumer.

Kris Jenkins: (02:51)
It should be like turning on a tap in the kitchen. You want some, you get some, you turn it off. You can always get as much as you want and no more ideally.

Aashish Kohli: (03:02)
Exactly.

Kris Jenkins: (03:06)
Okay. Oh, I've just thought of another thing, which I think is important. There should be minimal manual user intervention, right?

Aashish Kohli: (03:14)
That is true. That goes into the domain of auto-scaling. Elasticity is basically the capability of being able to add or remove capacity. But manual intervention also goes into the aspect of auto-scaling, which I think we can talk about that in the discussion as well. But yes, you're right.

Kris Jenkins: (03:35)
Okay. Let's translate this into our favorite world of Kafka. Give me an overview... Before we get into the cloud, what do I need to do when I'm hitting capacity problems? Where do I start?

Aashish Kohli: (03:52)
Start with Kafka you mean?

Kris Jenkins: (03:54)
Yeah. So my Kafka cluster is somehow reaching its limits, what limits am I going to see? And what would I, if I were managing it myself, do next?

Aashish Kohli: (04:04)
When you think of Kafka clusters in Confluent Cloud, we have this abstraction called CKU or confluent unit for Kafka. The way a CKU works is, it gives you some capacity along seven different dimensions. And the reason we have this notion of a CKU is because we've abstracted from the underlying infrastructure details that you might worry about if you're self-managing Kafka. Open source, or even running it On-prem. The idea is that when we give you this abstraction, you can expect the same kind of capacity or performance as you migrate across clouds.

Aashish Kohli: (04:42)
So, if you were self-managing Kafka on AWS versus GCP, you would have to go and try and pick different kinds of instances, whether they're memory optimized or compute optimized. And you would have to understand what performance you can get from those instances and what you should expect from your Kafka cluster. We've extracted all of that away, so you, as a customer, don't have to worry about all of these differences between clouds. And we give you a CKU as a unit of capacity, and you can get X number of CKUs on your cluster. And that guarantees the same kind of performance across different clouds. So, you don't have to worry about all of those underlying details and you can just focus on getting the right capacity from your cluster.

Kris Jenkins: (05:20)
Okay. I see what you're saying. That actually works in practice that I can stop worrying so much about the balance of disk and capacity and CPU.

Aashish Kohli: (05:31)
Absolutely.

Kris Jenkins: (05:32)
It's got a fairly predictable workload in that way.

Aashish Kohli: (05:34)
Yeah, absolutely. That's the whole promise of us as a fully managed SaaS service, that we don't want you to worry about those underlying infrastructure details. We also have benchmarks that we publish saying, "This is a two CKU cluster and this is the kind of performance you can expect." But we also recommend customers run their own benchmarks, because the performance you can get from a cluster also depends on your workload. How many clients you have, how many partitions you're creating and how you're distributing that load. So we always recommend that you go with our benchmarks, but you should also run your own workloads and see what kind of performance you get.

Kris Jenkins: (06:12)
Okay. I was thinking in terms of CPU or disk, but you are saying I can start thinking about hitting some limits in this abstract unit. But what's the shape of a limit? I find that things are slowing down. I find that it's taking... Is it as simple as my throughput drops and I think, oh, I need some more capacity?

Aashish Kohli: (06:39)
Sure. The seven dimensions of CKU, and let me see if I can remember them. I probably won't be able to rattle off all of them, but it's basically storage, which is getting mostly unlimited across all three clouds because we've moved to tiered storage. But it's ingress and egress, so throughput and then number of partitions in the cluster. Connection attempts per second, request rate. And I forget, there's a seventh one. But the idea is that, each of these dimensions, we specify that if you have a CKU cluster, you can get 4,500 partitions or you can get an ingress of 50 megabytes per second.

Aashish Kohli: (07:17)
As you up the number of CKUs, you get more and more from each dimension, just multiplying it by how many CKUs you have. So, as you're running a workload, it depends on how much you are consuming along each dimension. In general, you are not able to max out all of the dimensions at the same time. So, how much you are getting in your cluster from one dimension depends on your overall workload. And what we've done to make that easier for customers to understand is, we've recently launched, what's called a Kafka cluster load metric. And that tells you what the overall load your cluster is. It's a holistic metric that tells you the performance of your Kafka cluster.

Kris Jenkins: (08:05)
Where do I see that?

Aashish Kohli: (08:07)
You can see that in the UI, it's also available from our metrics API. So, if you go to the Kafka content cloud UI, there's a capacity setting page for the Kafka cluster that has a gauge that tells you how much load you have on your cluster. It's a value between zero and 100. As well as it's available through our metrics API. And that's how we really think most customers might end up using the load metric, is by querying the metrics API and getting the value.

Kris Jenkins: (08:36)
Okay. So, there are a few of these different metrics, I would then need to worry about maybe stepping up my capacity.

Aashish Kohli: (08:42)
Yeah. I think in general, when you're thinking of stepping up or stepping down capacity, it's the load metric you want to worry about. The other metrics, the seven dimensions of a CKU is basically your usage. And you might get throttled along some of those dimensions if your load is approaching a high value of, say, 80 to 90%.

Kris Jenkins: (09:01)
Okay. I know you've been working on the elasticity of this, but tell me what the world was like before you started work. What would I do when I ran out of capacity?

Aashish Kohli: (09:17)
I can probably sit back and tell you about the journey we've taken in Confluent Cloud for our customers. I think years ago, when we first launched Confluent Cloud, we didn't even have the ability for customers to self-serve and provision their clusters. We just launched the service, so we would provision clusters for them. The first step we took along that dimension was really giving customers the ability to self-serve and provision their clusters, both through the UI, API and CLI. And then the next step for us was cluster elasticity, so expanding and shrinking clusters. Prior to this, you would... Prior to launching the load metric or full elasticity, you would probably have to look at the CKU dimension and see what the usage is. And if you were running out of usage along some of those dimensions, you would probably contact us, contact our support team. And you would potentially do something to add capacity. But today, you have the ability to do that yourself as a customer. You can go to the UI or call our APIs and add or remove capacity depending on your usage.

Kris Jenkins: (10:24)
So it would've been a support call before this work.

Aashish Kohli: (10:28)
Yes.

Kris Jenkins: (10:29)
Okay. That's bad for everybody.

Aashish Kohli: (10:32)
Yeah. We are-

Kris Jenkins: (10:33)
As much [crosstalk 00:10:34] nice to talk to engineers.

Aashish Kohli: (10:36)
Our goal as a fully managed SaaS service is that we want you to be able to control the aspects of your cluster and take over the burden as much as possible of running the service so that you can focus on your business. We never want you to have to be able to reach out to support, unless it's a dire situation, like the service is down or something. But other than that, we want customers to be able to fully self-serve. And in most cases, also not even have to worry about having to increase or remove capacity.

Kris Jenkins: (11:12)
Ajit, let me grab you for some technical details [inaudible 00:11:16] software engineer. Let's start with what you would've done behind the scenes if you got that support call to change it, and then you can start telling me how you automated this. What does it take to expand a cluster with Kafka?

Ajit Yagaty: (11:33)
I think before we get into that, one thing that we have to clear out is that, Confluent Cloud is a Kubernetes shop. I mean, almost all of our infrastructure runs on Kubernetes, including the customers Kafka cluster, or KSQL cluster. So, all of those things also run on Kubernetes. Prior to actually having this specific feature, if a customer would see that their load is increasing and they look at those specific metrics and they decide that they have to expand the cluster. So they would open up a support ticket and that support ticket would come to the engineers. And typically, the on call folks who would be receiving that ticket. It was a manual process that we had. You had to update state in our database and then generate certain configurations and then go ahead and actually bring up the compute storage and networking on the Kubernetes cluster.

Ajit Yagaty: (12:33)
In the Kubernetes world, all of our... We make use of a pattern called operator pattern within Kubernetes, it allows us to extend the Kubernetes API. We have a service that does that. So, all of our Kafka clusters... Customer created Kafka cluster or a KSQL cluster, they're all defined by an entity called customer resource definition. This is again a Kubernetes concept, where you can define what your resource looks like. For us, a resource would be like a Kafka cluster or a KSQL cluster. And let's say, if you're talking about Kafka itself, you would specify the number of brokers that needs to be part of the cluster, how much storage each broker should have, what is our networking story there and the configuration of Kafka cluster itself.

Ajit Yagaty: (13:22)
So all of those things are encapsulated in this custom resource, which we would generate and then this operator service that I spoke about would go ahead and expand the Kafka cluster for the customer. And the customer's cluster would be expanded, we would monitor it, once it's all done. And by the way, there's also a data balancing part of it. You're adding new resources to it, so you need to balance the data out, or the load out across all the new brokers that have come. So we need to do that, and once that is all done, we would actually let the customer know that... We would close the support ticket saying that, Hey, your cluster has been expanded and the customers would get to know. So, as you can see, this was a multi-day ordeal, multiple teams had to come together to make this all happen, which is a considerable toil. That's how we used to do earlier.

Kris Jenkins: (14:16)
Okay. Cranky. We should split that up into pieces, because I definitely want to talk about the kind of logical rebalancing once you've got the resources separately. I want to dig into that. But I know precious little about Kubernetes beyond the marketing. So take me through how you change that into an automatic process.

Ajit Yagaty: (14:39)
Kubernetes is a container orchestration platform that's almost cloud-native standard these days. It has a simple yet powerful pattern called the controller pattern, where you have a specific resource. You define the desired state of that resource, declaratively, typically a YAML file. And you present it to Kubernetes saying that this is how my resource should look like. That is an input to Kubernetes and there'll be a controller that would be running specifically for that resource. That controller would get that input. And what it actually does is, it compares the actual state of that resource with the desired state. And if the actual state has deviated from it, it would try to bring the actual state towards the desired state. It's a continuous on-running process. If the actual world deviates from the desired world, it'll bring it back to the desired state. So, it's a declarative API that is there.

Ajit Yagaty: (15:43)
That's the nicety about Kubernetes. And for us, specifically, one of the value adds from Kubernetes is that, it serves as a cloud provider abstraction layer. So we are available on AWS, Azure and GCP. We don't ever have to call the cloud provider APIs ourselves to get the resources. We use Kubernetes. Kubernetes actually abstracts that are all out, depending upon where the Kubernetes is running, which cloud provider it is running. But we have a standard Kubernetes API that we would say, "Hey, get me some compute." And Kubernetes underneath would talk to the right cloud provider API to get the resources for us. So, that's one of the value adds that we get out of Kubernetes. It's rough edges, Kubernetes in general, but I think the benefits that we get out of it far outweigh the pain that we have to deal with, the idiosyncrasies. That's our standard, that's the infrastructure layer that we have. And we put it to good use. We extend the Kubernetes API, specifying to manage our resources.

Kris Jenkins: (16:54)
Presumably in the old days you were just changing the controller definition on a per customer basis and that got you to the first base?

Ajit Yagaty: (17:01)
Exactly. The way things are laid out... Maybe we can talk about the overall cloud architecture that we have, then that would actually tie the story together. Like I said, everything runs on Kubernetes and Confluent Cloud architecture itself is like a decoupled architecture that we have, with two main components. One of them is a centralized control plane. It's a global control plane, with its own database where it would stay as stored state. And the control plane is responsible for orchestrating the provisioning requests, de-provisioning requests. With the advent of the elasticity features that we added, they're also orchestrated by the control plane. So let's say, as a user, you are interacting with the Confluent Cloud UI or the CLI or API, all of those requests funnel into the control plane.

Ajit Yagaty: (18:02)
That's where the journey starts. And then we have something called a data plane. The control plane is referred to as mothership in our parlance, and then we have this data plane, which are again Kubernetes clusters. That's where the customers clusters would be running. Customers Kafka, KSQL, or Connect cluster would be running in the data plane. And now these two disjoint entities are tied together by a Kafka bus. We use our dog food, our own technology there. So the Kafka bus is responsible for ferrying the instructions from the control plane to the data plane. It also shuns the status updates back to the mothership from the [crosstalk 00:18:46]. That's a very high level overview of the Confluent Cloud architecture itself.

Ajit Yagaty: (18:54)
And within the data plan, that is where the component operator service runs. That is the component that extends the Kubernetes API. We have a CRD definition, which encapsulates how a Kafka cluster should look like. I was explaining about that, details about compute, storage, networking, all of that is encapsulated in that. Now, for every single customer cluster that we have, there'll be one instance of that definition that is there. The CRD, we can think about it as a class in OOPs paradigm. And the custom resource is like an instance of that, of the class. So we have one instance for every single customer cluster that we maintain.

Kris Jenkins: (19:43)
Is there one class, if I can use the same term? If there are one class for all customers, or there are slightly different ones for different needs?

Ajit Yagaty: (19:56)
The fields in the class are common across all types of resources that we support, but then you would instantiate them one for every single customer's cluster. As a customer, if I have 10 clusters, there would be 10 instances of that CRD running in our infrastructure. So, it's-

Kris Jenkins: (20:15)
Do you have like, we've got an enterprise class that gets instantiated from enterprise customers?

Ajit Yagaty: (20:22)
Exactly. Basically, it depends upon... Like you said, it's an enterprise class. We have different types of offerings that we have. As a customer, if I'm using the UI, I could create a standard Kafka cluster, which is essentially a multi-tenant offering. You would get a sliver of that big cluster that we would run and you would interact with that. The other option is that you could create a dedicated cluster too, where the entire cluster is just designated for a given customer. In these two things, like the multi-tenant offering that we have, there'll be one instance of that CRD representing that multi-tenant cluster that we run internally. And for every dedicated cluster that gets created, there will be a separate instance managing just that Kafka cluster.

Kris Jenkins: (21:14)
Okay.

Ajit Yagaty: (21:17)
And the configurations that we add within the thing, it differs depending upon what type of cluster you are. A Kafka clusters' configuration will be completely different from a KSQL clusters configuration. But the class that we have encapsulates all of these things. And the data that we store within the objects themselves, differ based on the cluster type.

Kris Jenkins: (21:44)
That seems like a lot to figure out, both technically and from a product perspective, Aashish, deciding what shapes you want to offer to people.

Aashish Kohli: (21:53)
Yeah. I think the idea there is that, we want to give people the right kind of capability. If customers need dedicated clusters for certain reasons, whether it be compliance or regulatory, we want to make sure that they have that available. While at the same time, our multi-tenant clusters are fairly powerful to handle a lot of workloads.

Kris Jenkins: (22:20)
The scalability part, you're changing that instance... You are flicking a switch that changes which instance you've got of that class, right?

Ajit Yagaty: (22:31)
Yeah. That's [crosstalk 00:22:32].

Kris Jenkins: (22:31)
And then Kubernetes just magically fills in the rest of the details. Is that right?

Ajit Yagaty: (22:38)
Yeah. There is some amount of shepherding that we have to do. In the sense that, yes, the customer resource definition that is there, so we need to change it to say that... Let's say you were running a two CKU cluster that maps to our internal representation in terms of number of brokers. And let's say you go from two CKU you go to three CKU, then that translates to the additional brokers that are needed, additional storage, networking capability and the compute that is needed. So all of those things get encapsulated, the CRD gets modified. Maybe we should just walk through the whole flow, that way it's clear. So as a user, I jump onto Confluent Cloud UI. Let's say, I just choose the UI to be my initiating point.

Ajit Yagaty: (23:23)
And in the UI, I say, "Hey, expand my Kafka cluster from two CKU to three CKU." And I hit the button expand. And that request comes to the control plane. The control plane level, what we do is that, for every... Like Aashish was explaining the dimensions of metrics that we maintain. Those are actually quotas for us internally. We have certain things like, for every CKU we can actually have 4,500 partitions supported. And there's ingress and egress throughput that we can support for a given cluster. So those quotas will... If it is an expansion, then we have to update those quotas saying that, because the customer is asking for more resources, naturally the quotas will also get updated.

Ajit Yagaty: (24:11)
And then the configuration also gets updated, because we are now saying that, "Oh, you'll be adding more brokers into the cluster so that configuration details will get updated." Then the control plane would store this information in its database. Then we have this event-driven... What do I say? Event-driven approach, where once the state is stored in the mothership DB, based off of that state, based on changed data capture, we generate the CRD instance that I mentioned about, the customer resource definition with the new details in it. And that gets pushed down to the data plan, which is the satellite through the Kafka bus. And that CRD arrives in the Kubernetes cluster itself. Once that is there, the operator service that is running, which essentially runs the controller for this resource. That gets notified saying that there's a change that has happened.

Ajit Yagaty: (25:14)
Then the operator would get in and figure out what exactly has changed. In this case, we are expanding, so we need to bring up more compute. We have to allocate more storage and also set up the networking part of it. The operator goes ahead and does that, it brings up the... In the Kubernetes, a computer is represented by a pod, that's where the Kafka broker would run. It's containerized. A pod can actually run multiple containers. So one of the containers would be a Kafka broker itself. It'll go ahead and bring up those new brokers. Then it'll also... As part of this bring up, it'll also attach the storage that is necessary. And the networking part of it is taken care of. Now, at this juncture, now that the resources have been brought up, we need this additional step of balancing the data. That's where-

Kris Jenkins: (26:09)
Let me just check I've got the first bit, and then I want to move onto the logical rebalancing. So you're saying that the control plane is writing a YAML file and sticking it in a topic. And eventually, the data plane is going to pick up its new YAML file and just do as it's told.

Ajit Yagaty: (26:23)
Exactly. At the very high level, that's exactly what happens.

Kris Jenkins: (26:29)
Okay. That's cool. Now you are in the position where you've got a cluster that has a lot more capacity, but there's a logical step. You've now got to make use of that capacity.

Ajit Yagaty: (26:39)
Exactly.

Kris Jenkins: (26:40)
Take me through that.

Ajit Yagaty: (26:43)
One of the features... This is where we make use of a feature in Kafka called self-balancing cluster feature. We put that to good use here. So, one of the capabilities that self... The name suggests that it's a self-balancing feature that it has. One of the cool things about that is that, when you actually add new brokers into the cluster, it has a way to figure that out. Typically, in the ZooKeeper world, these new brokers would register with ZooKeeper. Then the SBC component running inside the brokers themselves would know that new brokers arrived into the cluster. And then what it does is that, it adds them into the cluster. It's called an add broker step.

Ajit Yagaty: (27:28)
What it essentially does is, it figures out what is the current utilization of the cluster and there's new resources that have been added to it. So it generates a plan to figure out how should the rebalancing look like. Then it starts moving the data over to the new brokers that have been initialized. This happens automatically. So the moment we bring up the new resources from the data plane services, the moment it brings it up, Kubernetes brings up the resources, the SBC kicks in. It starts the process of rebalancing the data. While this is happening, from a customer's point of view, the cluster is still not completely rebalanced. So we wouldn't tell the customer that your cluster is already expanded.

Ajit Yagaty: (28:17)
As a customer, in the UI I would see expanding as my status. So now the operator would have to pull the amount of... The rebalancing means that we are actually moving data around. While that is happening, the operator actually has to wait for the rebalancing to complete. And once the rebalance is completed, that's when we send status information back to the control plane. [inaudible 00:28:46] saying that the rebalancing is all done now. Then at that point in time, there'll be a notification sent out to the customer saying that your cluster has actually expanded. And the UI would also show the status accordingly.

Kris Jenkins: (28:59)
Okay. Makes sense.

Ajit Yagaty: (28:59)
One of the key things to remember here is that, we do have... Like Aashish was saying, the customer would be charged for the resources that they use. So in terms of expansion, we would start charging them only after we have rebalanced the cluster completely, even though we have brought up the resources upfront, we don't charge for that. We charge them towards the end. For shrink, it's the other way around. We reduce their bill upfront and do that. This is the lifecycle like. Once the operator figures out that everything is rebalanced, it will send a notification back to the control plane, and the control plane will then notify the user saying that the expansion is completed.

Kris Jenkins: (29:43)
Okay. Out of interest, how does it happen the other way around? What's the flow of commands and control [crosstalk 00:29:50]?

Ajit Yagaty: (29:51)
Predominantly, the flow remains the same. I think the magic is the operator component. The interaction with Kafka is where the heavy lifting is. Again, as a customer clicks a button saying that shrink my cluster, and we update the quotas again, because we are reducing the size, so naturally we have to update the quotas for the CKUs that have been chosen. Then we start, we send out a notification to our billing servicing that start charging this customer lesser, because they've shown their intent to reduce their cluster size. So it's favorable to the customer that way. Then the request flows down to the data plane. So, here in this world, in the shrink part, first we have to move the data out of the existing brokers.

Ajit Yagaty: (30:45)
We can't bring them down, so we need to... Because that would lead to availability issues. So we wait for the data to be moved out. Again, we use the self-balancing cluster feature here. For this, we developed... Both the SBC team and our team work together to come up with APIs that makes this integration much more smoother. One of the guarantees that is there is that, any of these things can go down at any point in time. It's a distributed system. And the interactions that we have between the operator and the SBC. So there are some guarantees that we would need. One of the things that we have insured with these APIs are that they're important in nature.

Ajit Yagaty: (31:38)
Because you call an API and the operator can go down and come back up, for whatever reason. And the operator picks up... It just stores enough state so that it can restart the state machine of any of these operations that are being performed. Those APIs, if they're important in nature, it is helpful. That is one thing that we can build into the system. And the other thing is that these... We remove multiple brokers at a time. Because a single CKU can actually map to any number of brokers. And it makes sense to actually remove multiple brokers at the same time, because... Let's say, just for example sake, one CKU maps to 10 brokers. This is just an example. The user, his intent is to move all the 10 brokers. Now, if you move one by one, there's a lot of wasted effort that is going on. We keep [crosstalk 00:32:37].

Kris Jenkins: (32:37)
If you rebalance 10 different times.

Ajit Yagaty: (32:38)
Exactly. So it takes a long period of time, and also we are moving the same data over and over. That's why we try to remove multiple brokers at the same time. And this API that we... SBC folks did a wonderful job there. One of the design aspirations there was that this API to remove multiple brokers at the same time would be atomic. If anything fails in between, none of them will be removed. But if it succeeds, all the brokers will be removed successfully out of the thing.

Kris Jenkins: (33:07)
That's going to be hard to...

Ajit Yagaty: (33:07)
I'm sorry.

Kris Jenkins: (33:07)
That's going to be hard to implement.

Ajit Yagaty: (33:14)
Yeah. That's why I said I think the heavy lifting was done on the SBC side, so the operator's life became easier because of that. There's very little things to reason about. If an API fails, we know exactly what has happened and we can just continue from where we left off. So in the shrink part, again, we call remove broker API from SBC world to remove the brokers. I mean, to move the data over first. Once all the data has been moved over, we bring down the brokers in that particular cluster. One thing to keep in mind is that, while this is all happening, we don't want newer data to go onto the brokers that are being removed. So SBC again provides this exclusion part.

Ajit Yagaty: (33:59)
The moment we say, "Hey, remove these brokers from the cluster, SBC would ensure that no new topic creation or no new configuration changes would place newer data onto the brokers that are being removed. That way, there's only a finite amount of movement that we have to do. Otherwise, you can be in this phase where you're trying to move, but there's new data arriving. And you'll never converge. That is another feature that is quite important here. The overall flow is just that, you remove that. You wait for the removal to complete, and once that is all done, you bring down the storage, the compute and networking for those brokers. And then mark the operation as successful and turn it back.

Kris Jenkins: (34:45)
Okay. On the way down, it's actually structurally very similar, except the controller needs to know to rebalance first and then take the resources away.

Ajit Yagaty: (34:56)
Exactly. [crosstalk 00:34:56].

Kris Jenkins: (34:56)
And most of that, it actually delegates to the SBC.

Ajit Yagaty: (34:59)
Yeah. The heavy lifting, the actual data movement is done by SBC. The operator just shepherds the whole process. Because the API waits for the API to say it's all done, and then it goes ahead and removes the Kubernetes resources that are not needed anymore. So, in all of this, if you think about it, the data balancing part is quite heavy. In the sense that it is a function of how much data you have. One of the nice things about Confluent Cloud, at least on... We have this tiered storage feature. Only a hot set will be stored. That means that, the amount of data that is stored locally on a storage attached to every single broker is very small. And the whole-

Kris Jenkins: (35:44)
Define very small.

Ajit Yagaty: (35:49)
This is Kafka, the typical use-case is an event streaming kind of a thing. In such use-cases, there's a producer and a consumer. And the consumers are mostly interested in the latest data that has come, that has arrived. So we define that as the hot set. Let's say last 13 minutes of data is what we are interested in. So only the last 13 minutes of data will be stored locally. And beyond that, everything gets tiered to an object store in the backend. So the amount of data stored for every single broker locally will be small. And this aids in the data movement too. We can be done very quickly. Removal or addition of new brokers will be very quick, because there's very little data to move around. So, that is another feature that makes the elasticity story quite good. Otherwise, there's huge amounts of data we have to move, so the amount of time it takes to handle these requests will also be larger otherwise.

Kris Jenkins: (36:58)
Cool. Okay. That relates back to the product end of it. Aashish, how long does this process take? What's the user experience of this?

Aashish Kohli: (37:09)
I think per CKU, it can take a couple of hours to expand or shrink the cluster. And it varies. I think, as Ajit mentioned, if you step back and think of this whole expansion or shrink process, two main steps. One is additional removal of brokers and the second step is this data rebalancing before we tell the customer, "Your cluster is ready with the expanded or removed capacity." The cluster is available through this whole process. But the second step is very critical. The time it takes for a cluster to expand or shrink really depends on the size of the cluster.

Aashish Kohli: (37:45)
So, how many CKUs you're adding or removing, because that's where comes the number of brokers you need to add or remove. That's one step that takes time. And the second step is, how much data there is on the cluster? As Ajit mentioned the concept of tiered storage, decoupling computer and storage, that allows us to have much less data on the brokers that needs to be rebalanced. So it's definitely made things much faster. But even so, depending on one cluster versus another, the amount of data on the brokers that needs to be rebalanced will affect how long it'll take for the overall expansion or shrink process to complete.

Kris Jenkins: (38:23)
But it's a matter of hours rather than days.

Aashish Kohli: (38:26)
Yeah.

Kris Jenkins: (38:26)
[crosstalk 00:38:26].

Aashish Kohli: (38:26)
It's definitely not days. We are always optimizing things to make it faster and faster, but if you think about it, in these two steps, we could... Let's say you're expanding a cluster. As a service, we could really say, "Hey, you're going from two CKUs to four CKUs, we've added these brokers and here's your cluster that's expanded." And that would be super fast, much faster than having to rebalance the cluster and saying expansion is done. But in that case, we are doing our customers a disservice. The idea is that, we are a fully managed SaaS service. We want you to focus on your business, it's our job to give you a cluster that performs well. So that's why we add the brokers, we rebalance the data. We don't want to give you a cluster where you need to then go ahead and rebalance the data to optimize it for performance. That's why it takes a little bit longer, but it's the right customer experience where we take care of both of these steps rather than having the customer worry about any of the operational aspects.

Kris Jenkins: (39:23)
That makes sense. Speaking of customers, you probably can't tell me specific customers. But tell me, generally, how customers are using this out in the real world.

Aashish Kohli: (39:34)
You can think pretty much all customers that are using a cloud service that have any sort of variable workload, will want to use cluster expansion or shrink. The scenarios could fall in certain categories. I think Black Friday comes to mind as the most obvious use-case where retail customers might want to add more capacity going into Black Friday, and then after the season is done... So, there's a seasonality aspect to it. And after the season's done, they might want to shrink it down to what their average workload is. At the same time, there are other aspects where maybe airlines going into the summer might want to do something like this, because in the summer-

Kris Jenkins: (40:21)
It's a very wide season.

Aashish Kohli: (40:23)
Yeah. Travel season and stuff. Gaming companies might want to do it right before a game launch, because when a game launches, there's all this interest. And then over time, they can reduce their capacity. I think going back to the timing aspect of this, how long it takes. I think when we talk to our customers, we tell them not to think of this as a daily or hourly experience, where you're just adding and removing capacity all the time. You have to plan a little bit, because adding capacity, rebalancing the data does take at least a couple of hours. And the more capacity you're adding, the longer it'll take. So I think we want them to think about this more in terms of days or weeks, than just like every hour you are increasing or removing capacity.

Kris Jenkins: (41:07)
Okay. Is that going to take us into a future where things are a little bit more fine-tuned?

Aashish Kohli: (41:14)
Again, I think I touched on this earlier in the conversation. When we think of our customer's journey, we've given them the ability to self-serve their clusters. We've also now given them the ability to expand and shrink their clusters. And that experience is same from the UI, APIs or CLI, whatever you might want to choose. Most customers, we think, will likely use the APIs to include this expansion and shrink into their overall workflow. And we can stop here. We have given customers the ability to create clusters. We've given them the APIs, they can program these APIs. They can watch the load metric and they can expand and shrink their clusters. So technically, we can say we are done and we've given customers all of these primitives. But again, thinking of the customer's experience, now a customer that wants to do auto-scaling, that doesn't want to have to worry about tracking the cluster load and expanding and shrinking, would have to instrument something.

Aashish Kohli: (42:15)
Where they stitch together these APIs and say, "Hey, by the way, when my cluster load goes above 80% and stays there for 24 hours or 72 hours, add two CKUs to expand the cluster." That would be then on them. But that would be a disservice to them, so we don't want to stop here. Our next step, which we are thinking about, is what we refer to as policy-based scaling or auto-scaling. And the idea there is that, customers can define some minimum or maximum CKUs for their cluster. And we can, based on the load, basically say, "Anytime the load drops below 30% and stays there for some time, we shrink the cluster down by a CKU or two CKUs. If the cluster load has been over 70% and sustained for some time, we start expanding the cluster." So, that's auto-scaling, that customers can set their policies and we'll do it for them.

Kris Jenkins: (43:10)
Cool. What's your timeline for that, or you're not giving deadlines?

Aashish Kohli: (43:15)
As a product person, I am very careful about giving timelines, but that's something we are actively-

Kris Jenkins: (43:20)
Very wise.

Aashish Kohli: (43:20)
... thinking about. I think our goal is to, as I said, take over all of the burden of running the service or tuning the service from our customers. So, it's definitely something we are very strongly thinking about, but I cannot go into timelines, unfortunately.

Kris Jenkins: (43:38)
Don't make commitments. That's fair. It does make me want to just check one technical thing, and either of you can fill this. If you are in this world where it was automatically shrinking and growing, or today when you're manually doing it. Does the process of shrinking or growing affect performance? In that, if I say grow my cluster, will it slow down a bit before it speeds up?

Aashish Kohli: (44:06)
Well, I can say something and then maybe Ajit can chime in. I think one thing I want to cover about expansion and shrink, which we didn't get into is that, I think Ajit touched on the billing aspect, that when we expand a cluster, we start billing at the higher CKUs at the end of the process. When we shrink, we start billing at the lower CKUs at the beginning of the process. So, it's very customer friendly. At the same time, when we think of cluster expansion, we can be a little bit more aggressive about it, because we are adding capacity. From a cluster's performance perspective, we are adding capacity.

Aashish Kohli: (44:41)
So we can add more capacity faster. When we are shrinking a cluster, we try to be a little bit more conservative. Because again, we don't want to affect the load on a cluster while we are doing these operations. So, when we are shrinking your cluster, we'll warn you in terms of the cluster load being high or certain limits that you might be hitting. And there are safeguards in place to prevent changes that will affect the performance of the cluster. Other than that, while the process is happening, the cluster is available and running. So there should not be any noticeable impact to the clusters' performance in terms of the capacity that's available. I don't know, Ajit, if you want to chime in [crosstalk 00:45:26].

Ajit Yagaty: (45:27)
Let's say if you're expanding, when the cluster has... From the UI, let's say you initiated an expansion. The UI says that now the cluster is expanded. That's when we expect the customers to start pushing more data. But while this is happening, we are rebalancing stuff. So in Kafka, the way people store data is that there is a topic and then every single topic has a number of partitions. And these partitions are evenly balanced across the number of brokers that you have. So as part of the rebalance, we would be moving some partitions over to the new brokers. Once the movement actually happens... So from a client's perspective, when they're trying to access that...

Ajit Yagaty: (46:10)
We had a smart client that figure out, oh, the partition does not reside on broker number X, we have to go to broker number Y. So, it figures that out and goes to the right broker. But then that also means that for that particular partition, if there was any performance deficiencies that you were seeing, you can start pushing more data into the cluster at that point. So, it depends upon where we are in our rebalancing phase. That's why we say that once the cluster says that it is already expanded, that's when you would be able to realize the full potential of the cluster.

Kris Jenkins: (46:50)
So there is going to be a slight rebalancing phase, just like you would if you added more consumers [crosstalk 00:46:56].

Ajit Yagaty: (46:56)
Yeah. So whatever performance they were getting before, they would continue to get it. Only that the guarantee of getting more performance out of the cluster is served only after we say the cluster is completely expanded.

Kris Jenkins: (47:12)
That makes [crosstalk 00:47:13].

Ajit Yagaty: (47:13)
The rebalance completed.

Kris Jenkins: (47:14)
Okay, cool. It's obvious why it's an important feature and I can see it's usefully done. And I think I've got a good sense of how I could be dangerous enough to try and implement it myself, in the mythical business that exists inside my head. Thank you very much. That was very enlightening. Ajit, Aashish, thank you very much for joining us on Streaming Audio. And I hope you'll join us another time.

Aashish Kohli: (47:42)
Thank you, Kris.

Ajit Yagaty: (47:43)
Thanks a lot. Thanks for having us. Cheers.

Kris Jenkins: (47:46)
That brings us to the end of another episode of Streaming Audio. The episode with so much scale, we needed to double the number of guests. If that leaves you wanting to kick the tires of cluster scaling, you can give it a try at Confluent dot cloud. Sign up with the code PODCAST100, and we'll give you $100 of extra free credit. And if you're still learning how this world all fits together, well, take a look at Confluent Developer, which will teach you everything you need to know about Kafka and event streaming. And if it doesn't, let us know and we'll add it. For that or anything else you want to let us know, head to the show notes, for our contact details and my Twitter handle. Or if you're watching this on YouTube, leave a comment or click one of the friendly thumb icons they have, we'd love to hear from you. And with that, it remains for me to thank Aashish Kohli and Ajit Yagaty for joining us, and you for listening. I've been your host, Kris Jenkins, and I'll catch you next time.