Streaming Audio: Apache Kafka® & Real-Time Data

Using Apache Kafka as the Event-Driven System for 1,500 Microservices at Wix ft. Natan Silnitsky

September 21, 2020 Confluent, original creators of Apache Kafka® Season 1 Episode 119
Streaming Audio: Apache Kafka® & Real-Time Data
Using Apache Kafka as the Event-Driven System for 1,500 Microservices at Wix ft. Natan Silnitsky
Show Notes Transcript

Did you know that a team of 900 developers at Wix is using Apache Kafka® to maintain 1,500 microservices? Tim Berglund sits down with Natan Silnitsky (Backend Infrastructure Engineer, Wix) to talk all about how Wix benefits from using an event streaming platform. 

Wix (the website that’s made for building websites) is designing a platform that gives people the freedom to create, manage, and develop their web presence exactly the way they want as they look to move from synchronous to asynchronous messaging. 

In this episode, Natan and Tim talk through some of the vital lessons learned at Wix through their use of Kafka, including common infrastructure, at-least-once processing, message queuing, and monitoring. Finally, Natan gives Tim a brief overview of the open source project Greyhound and how it's being used at Wix. 

EPISODE LINKS

Tim Berglund (00:00):
You've probably heard of Wix.com, but did you know they're a Kafka shop? They are, and they use it, as we say, at scale. I talked to Wix backend infrastructure engineer Natan Silnitsky about how he serves 900 developers maintaining 1,500 microservices and what lessons the organization has learned. All on today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the Cloud.

Tim Berglund (00:31):
Hello, and welcome to another episode of Streaming Audio. I am, as ever, your host Tim Berglund, and I'm joined in the studio today by Natan Silnitsky of Wix. Natan, welcome to the show.

Natan Silnitsky (00:45):
Hi, Tim. Thank you so much for having me.

Tim Berglund (00:49):
You bet. First of all, I'm going to assume Wix is a company that needs no introduction. It's a website builder. It's Wix.com. If you don't know what it is, go look at it. I am personally not a Wix user, but I can say I am married to a Wix user, so this is a thing that gets spoken of in my own house. So, exciting for me to have you on.

Tim Berglund (01:14):
Natan, you're, by your own description, a backend infrastructure developer at Wix, but tell us a little bit about what you do and how you came into the job. What did you do before this, and how did it lead to what you do.

Natan Silnitsky (01:28):
Sure. So I started off at Wix working on Wix payments solutions, so how those site visitors pay those site owners' stores or restaurants, stuff like that. There were a lot of challenges to do basically the payment infrastructure for all the different verticals at Wix. But then I got an offer to work on migrating our build system for all the backend to Google's Bazel build tool, and I jumped on it. I thought, "Okay, so that's going to be very different from what I do day-to-day." Focus on infrastructure, helping developers instead of thinking about the product.

Natan Silnitsky (02:21):
And it was a really great experience, and once that migration was over, I thought, "Wow, I really like developing infrastructure, thinking about writing libraries and tools to help developers achieve their goals. It's very interesting, and also you get exposed to a lot of new, fascinating technologies."

Natan Silnitsky (02:47):
And once I heard that there was an opening to work on the data streams team at Wix, which was a relatively newly formed team dedicated solely to Kafka and data streaming, I jumped on it because I thought, "Wow, this is great." About event-driven style of development. It's cutting edge, working on a lot of different areas, basically, because we have expanded, but more about that later on.

Natan Silnitsky (03:26):
But we're working on libraries, working on production tools, interacting with our developers, making sure they have a good experience when they work with event-driven style with message passing, with streaming of data.

Natan Silnitsky (03:44):
So, I was really excited to join the team, and I've been there for a year and a half now.

Tim Berglund (03:53):
Cool. In the backend infrastructure role?

Natan Silnitsky (03:57):
Yeah. Specifically for Kafka-related area.

Tim Berglund (04:02):
Oh, wow. Okay. Cool. So that puts you in the top few sigmas of Kafka users in the world, probably, just in terms of direct exposure. And scale, really, Wix is not a small-scale company.

Tim Berglund (04:18):
The question I'm about to ask, I'm aware, is rather broad. But how does Wix use Kafka? Where does it fix into the stack, to the extent that you want to and can describe the architecture? Lay that on us and tell us where Kafka fits in.

Natan Silnitsky (04:34):
Yeah, sure. I know that Kafka is used tremendously by a lot of big data companies. It's the entire pipeline of processing it with ETL and all of those buzzwords on big data processing. At Wix, there is a different sort of scale here. We use it for message passing between all our different micro-services, and we have a lot of them. We have roughly 1,400 micro-services communicating. And while some of the communicating is done via gRPC, I will say it's 50/50 now is going through events and then Kafka. So, we produce over a billion messages every day in the system. It's a big scale data, so it keeps growing for sure.

Natan Silnitsky (05:40):
So there a lot of services working with it, a lot of developers. We have around 300 backend developers and a lot of front-end developers as well, which run Node.js middleware, also use Kafka for their services. A lot of developers that we need to support on our team. So, the scale here is huge but not necessarily trillions upon trillions of messages every day, but a lot of different use cases, a lot of different needs for our various products out there.

Tim Berglund (06:27):
You said some services still talking over gRPC. I appreciate you being willing to confess that in public. I think that we can still be friends. Kidding aside, is the direction from synchronous to asynchronous? Is that the desired next state?

Natan Silnitsky (06:45):
Yeah. Definitely. The way I see it when you deploy a service globally, it's found in multiple data centers and points of reference around the world. And then you have the service calls, the service calls, the service calls, the service. And you expect to get the results then and there. You end up with cascading failures and a lot of issues where... You have a brittle system, the way I see it, and the more you use a message broker, you'll have better fall tolerance. You'll be able to decouple these services from one another.

Natan Silnitsky (07:37):
I really like the series of blog posts by Consulate on how Consulate sees the data, data unbundled, or something like that. Where you take the database, the traditional database, and you change it conceptually completely, and each service can consume its own streamed, materialized view of what it needs. And in an event-driven manner, where it's less dependent on other areas of the system, so it's a much more resilient and, I would say, more decoupled and easier to reason about in each small granular component of it, and that's the recommended way. So, of course, gRPC is not going to completely vanish, but I see the benefits are clear to most of our developers, and we witness it by the growth of the usage. It's phenomenal.

Tim Berglund (09:00):
I just loved every word of that. I say those words. There were certain key phrases you use that I say a lot, and every time I say them I'm aware that I'm playing that record again. And I always wonder, do people get tired of hearing me say that? Does it feel like a cliché at this point? And then when you come here and say that, you're like, "No, really, we're a billion messages, 1,400 micro-services. I run this thing, and these are architectural outcomes." It's just nice to hear. So, thank you.

Tim Berglund (09:36):
Don't let me... We're here to talk about infrastructure. We're not here to talk about the broader architectural implications of all this. Maybe that's a future episode because it sounds like you've got good things to say about that as well. But seriously, good stuff. If we did, some podcasts open with a little cold open of a quote from the thing, something from what you just said would have been that cold open, because that was gold.

Natan Silnitsky (10:00):
Cool, and sometimes a cliché becomes the mainstay of the industry, so I think that's a good way forward.

Tim Berglund (10:10):
Yeah, where it's actually true. That's why it gets said all the time.

Tim Berglund (10:16):
But scaling like this. Wix, obviously a large global company. One would probably call it web-scale, if we could use that term without irony. And certainly a large Kafka installation. You've given a talk about lessons learned in using Kafka at Wix, and I've got a list of those here. And I'd like to just talk through some of them, especially from your infrastructure perspective, where you are providing infrastructure tools to application developers. And I heard you say Node. Are there other languages in the stack that you need to take care of?

Natan Silnitsky (10:58):
Yeah, definitely. So I would love to talk about these lessons. I would say the main stack is JVM Scala based, but Node.js is growing.

Tim Berglund (11:12):
Nice. Okay. So I've got a list of these here from your talk, and I'm just going to go through as many of them as we have time for. So number one, your first lesson was that I think you found it desirable to build common infrastructure, by which I think you mean something like a client wrapper. Is that right?

Natan Silnitsky (11:34):
Yeah, so it was apparent from the very beginning that we use the Kafka SDK as is, but as a growing company... We started using Kafka in 2015, and already then there were quite a few services. I'm not sure. I think it was already called micro services back then. And from the beginning, it was clear that we can have each service do its own thing with Kafka, but there are broad features that can be put into all the services, especially if you want to have a coherent way of messaging between them.

Tim Berglund (12:20):
How many application developers would you be serving with that layer?

Natan Silnitsky (12:24):
Today?

Tim Berglund (12:26):
Yeah.

Natan Silnitsky (12:28):
Today, we have around 300 backend developers and a few hundred more front-end developers with Node JS. So I would say 500, 600.

Tim Berglund (12:41):
Got it. And I just want to say, it's very typical at this scale for organizations to do what you're doing. Some people have discomfort with this, wrapping the clients. And honestly, Natan, I'm one of them. I always feel like your wrapper is going to be too opinionated. You're not going to be able to maintain it enough. New features aren't going to get exposed to your application developers in time. And blah blah blah, all the reasons you don't want to do it. And I feel those.

Tim Berglund (13:08):
But it's interesting that basically 100% of organizations who have on the order of 10 to the two developers using Kafka, as you guys have, 100% of them seem to create a wrapper like this to solve all the problems that you just mentioned. And expose features that are Wix-specific, in your case, to your internal developer community. Everybody does this, and it seems like it's the right thing to do.

Tim Berglund (13:39):
You may have had more to say. I interrupted you there. I couldn't help it.

Natan Silnitsky (13:42):
It's great. You mentioned there's a new feature coming, Kafka SDK, but think about the need to upgrade the versions. So, you want to have that taken care of for you. If there's maybe breaking API potentially or stuff like that, you really want to have more control of it when you hundreds upon hundreds of developers that rely on it. And they really want it to have it just work, right? Even on upgrades. They don't want to mess around with it too much, so there's that as well.

Natan Silnitsky (14:29):
You mentioned that it's also nice to have extra features. For instance, our client wrapper, it's called Greyhound, by the way. It will inject automatically the broker locations, so the user does not need to worry about that. And it does the message commit for our users.

Natan Silnitsky (14:56):
We, from the very beginning, thought that having at least once processing is the way to go instead of at most once because like I said before, it's not the big data like sensors data. It's really Wix user requests, incoming user requests now. You want to handle them without missing any messages.

Natan Silnitsky (15:27):
So, Greyhound handles the commit for our developers, which just specify the processing code and the lambda, basically. It also handles the message [inaudible 00:15:47] in case maybe they have a lot of code there that's kind of slow and messages keep building up, message queuing keeps building up. So, it will stop requesting, pulling more messages from Kafka, stuff like that.

Natan Silnitsky (16:04):
And also parallel message handling. So, we want to utilize our CPUs as much as possible. We don't want them to be like 20% idle. We're paying our cloud providers for these CPU units, for these core units, so want to have as much utilization as possible with as little amount on instances as we need.

Natan Silnitsky (16:38):
So, in Greyhound, you get parallel processing out of the box. You can scale it to be more. Of course, it's limited to the amount of the partitions that the Kafka topic has, of course. But other than that, you can get a very high degree of parallelism, and with Scala, it's really easy to do, actually implement that.

Tim Berglund (17:02):
It's funny you say that. I was about to stop and provide a content warning to the audience. I was going to say, "Ladies and Gentlemen, we have a Scala developer talking about using multiple cores. I can't be responsible for what gets said next."

Tim Berglund (17:16):
But no, you're right. This is a thing that a Scala team is going to be thinking about this. A good Scala team is going to deliver on this. It's just going to happen.

Natan Silnitsky (17:26):
I mean, Scala has it's issued. It has a very rich set of styles, so what happens is that your code becomes a bit unreadable at parts for some developers. But almost all developers will share the immutability and the ability to provide concurrency in a safe manner, and that's what we do on our client wrapper, Greyhound, for Kafka.

Natan Silnitsky (17:56):
Another critical feature which I talked about, the need to never lose messages, never lose processing, at least most processing, is to retry processing upon errors both for the consumer side and producer side. We support a few flavors for that. That's a critical use case for us for Kafka in general.

Natan Silnitsky (18:24):
You can create a gRPC request and maybe even retry local a few times, but you don't have the Kafka message worker that has the message, and you know it's not going to go away, and you can safely retry... In offline jobs, maybe you can even retry after an hour and still succeed. It's just a wonderful, important feature for a serial system that needs to work flawlessly, eventually.

Tim Berglund (18:58):
Having built this sort of layer, this client wrapper... Oh, wait, no. I almost forgot the question. Do you guys, being partially a JVM shop, is there any Kafka Streams involved in the clients at all?

Natan Silnitsky (19:12):
Good question. So, we start off in 2015, and at that point, I think Kafka Streams maybe already was out in the open, was really just starting. So, they were still used producers and consumers slowly. There are separate units at Wix that are not part of the main bulk part of the product services that utilize Kafka Streams. It's mostly also in the business intelligence unit and the data organization part, but not on the microservices product-oriented part of Wix.

Tim Berglund (20:02):
Got it. How about any KSQL? We didn't talk about this at all, so this is just a random curiosity on my part.

Natan Silnitsky (20:12):
Okay, so you say that there is a lot of common use case for ksql without Kafka Streams? I wonder.

Tim Berglund (20:21):
No, it tends to be a trade-off where like a Java shop, everybody's got to commit to Java. It's typical for those Java developers to learn Kafka Streams and use it. And then when presented with KSQL, they'll usually say, "Well, we learned Streams. We're okay. We can get this done with Streams," because they've made the investment, and they want to continue to have that investment pay off.

Tim Berglund (20:44):
A lot of the things that you can do with Streams, you can do with KSQL, and so in non-JVM shops, like you said there were lots of Node.js, KSQL tends to pop up for stream processing in those use cases or those contexts more.

Natan Silnitsky (20:59):
So, we haven't utilized KSQL on our side. I'm not sure about the other parts of our organization. Wix is a big one. I think they're more Kafka Streams oriented than KSQL.

Tim Berglund (21:12):
Now, having built a robust client wrapper that does all those things, and having literally hundreds of developers who are using it, that is an internal product development effort. You don't know all those people. You're not communicating with, any sort of reciprocal way, all of the people who use the wrapper, which brings up internal tooling, internal documentation, and things like that for your wrapper, and those aspects of productizing Greyhound.

Tim Berglund (21:48):
I should start calling it Greyhound since that's what you guys call it. So, tell me about tooling and docs around this.

Natan Silnitsky (21:54):
Sure. So, like you said, we have this whole additional layer infrastructure, and while I earlier praised the event-driven style of development, you do need to have the tools in place to debug production in an easy manner. Because with gRPC requests, it fails, you get the failure. But here, if you produce a message, you need to somehow understand...

Natan Silnitsky (22:29):
That's the end of your job. Then you have the consumer side that needs to make sure that the processing is done correctly. And when you have it end to end, sometimes it gets a bit tricky to understand what's going on. So, you definitely need to have the tool in place and the education, basically, of your developer corps, I would say.

Natan Silnitsky (22:56):
And we only like a small infrastructure around Kafka and a lot of product developers. So, we decided to create self-service tools. The most important one is we created our own control plane user interface. There are, of course, other tools out there. There's the famous Kafka CLI with Kafka Console Consumer, et cetera But we thought that a graphical user interface will help out our developers use it easily and not get into the details of different flags that you need to support in the CLI, including broker location and [inaudible 00:23:41] location.

Natan Silnitsky (23:43):
And there are graphical user interface control planes out there, both open source and commercial. I know that Confluent has one, right?

Tim Berglund (23:52):
Do indeed.

Natan Silnitsky (23:54):
But because of our unique infrastructure layer, we decided to create our own tool, and we called it Greyhound Admin. So, we have all kinds of features there, some typical, like configuration of topics, straining the messages, debugging messages for a topic.

Natan Silnitsky (24:21):
But interestingly, one of the key issues with debugging Kafka production is lag monitoring. The producer produces the messages at some rate, and the consumer is not really dealing with it in a speedy manner, and the lag is built up between the producer and the consumer, and you want to investigate it. You want to know, wait a minute. Is it one partition that is lagging now, or is it all the partitions?

Natan Silnitsky (24:55):
And so, we built a nice feature on this control plane user interface where you can check out, okay, this topic. Who consumes it? Oh, nice. Okay. Oh, here I see my consumer group. Okay, let's look at the partitions and see... Oh, wait a minute. It's only one partition that is stuck, so let me now switch over and stream messages for this partition and understand why is this partition has this lag.

Natan Silnitsky (25:36):
We also have monitoring [inaudible 00:25:40] in place to see if handlers are stuck for each application. And this way, it's really, in a few clicks, the developer can basically understand what's going on. I hope I put the example in a coherent manner.

Tim Berglund (25:58):
Absolutely.

Natan Silnitsky (25:58):
Good.

Tim Berglund (26:05):
Can you tell me, Natan, how many topics you guys have, approximately? Is that a thing that you're allowed to say?

Natan Silnitsky (26:13):
Sure. I think... Let me refresh my memory. I'm not sure the order of magnitude for some reason, but I think it's either something like 5,000 or 50,000.

Tim Berglund (26:26):
It's always the exponent. [crosstalk 00:26:27]

Natan Silnitsky (26:29):
Sorry?

Tim Berglund (26:29):
You know the mantissa, but those pesky exponents.

Natan Silnitsky (26:36):
I probably remember it's something like 5,000 topics and 50,000 partitions approximately.

Tim Berglund (26:43):
There you go. That sounds reasonable. That's big, and that introduces... With 1,400 services and, we'll say, 600 developers, that introduces the question of schema management and schema discovery. So, just basically, how do you handle that? Does Greyhound participate in that? How you do that from a tooling and documentation perspective?

Natan Silnitsky (27:09):
It's a very important topic because you don't want your developer to go and ask each team, "Hey, how are you doing? What's your message schema?" And stuff like that. It's really non-scalable in a larger organization.

Natan Silnitsky (27:33):
Because we use gRPC, our message schemas are in protobuf. Across all of Wix, everything is protobuf, and so Kafka messages are in protobuf format. And basically, each service has a swagger output that provides for the message details, what are the events, and what do they have. And then, it's the job of each developer that wants to consume it to go to a site that we have... Everything on Wix is a site, so they go to a site, and they can see all of the different services and all of the events, messages that they produce. So, the developer can consume it and see exactly how to consume it.

Natan Silnitsky (28:35):
But we don't have a very good versioning story. There's a basic protobuf compatibility that you can have. If you have a new version, it's always difficult to get the developers to switch to the new version and stuff like that. It doesn't happen a lot, but when it happens, it will probably be a very manual process.

Natan Silnitsky (29:07):
So, I heard about the schema registry that Kafka has, right?

Tim Berglund (29:16):
Yes, that's a Confluent schema registry. It's not a part of Apache Kafka, but it's free to use source available thing.

Natan Silnitsky (29:24):
I understand that the versioning there is quite good, versioning management.

Tim Berglund (29:32):
There's kind of two things, as I see the value to schema registry. One is it encourages you to use what amounts to an interface description language, a file to describe your schema. And then, tooling to derive the actual objects, to generate source to define the objects in the language of your choice.

Natan Silnitsky (29:54):
That's similar to protobuf.

Tim Berglund (29:55):
So, you've got... Right, you can do it with protobuf. You can do it with Avro. The other thing is migration when you are evolving schemas. It doesn't automate that, by any stretch of the imagination. Migrating schemas is kind of always a terrible thing to do. But when you've got lots of consumers and producers who are potentially uncoordinated and don't communicate well... Human beings don't communicate well, not that that would ever happen ever, but imagine. Use your imagination. It will help make runtime assertions and build time assertions that producing that message will fail, or consuming that message will be a problem, based on your compatibility rules.

Tim Berglund (30:38):
However, it was like three months ago, as of the time of this recording, that the schema registry started supporting protobuf. Don't feel bad for missing the boat. That's a thing you guys could look into now. It's an option, but literally half a year ago, it was not an option for you as a protobuf shop.

Natan Silnitsky (30:56):
Oh, that's good to know.

Tim Berglund (31:01):
It's a good future thing, and it is not a boat you have necessarily been missing, given your other commitments.

Tim Berglund (31:08):
Is Greyhound a thing, publicly? Is it open source?

Natan Silnitsky (31:11):
Yeah, I forgot to mention that Greyhound is open source. We open sourced it a few months ago, and it has all of the features I mentioned. Of course, we have an additional Wix layer on top of it that is private, in a private repository. But all the features I mentioned and more are found in the open source version, and you can go and check it out on GitHub. It's on the Wix organization, just called Greyhound. So, GitHub.com/Wix/Greyhound.

Tim Berglund (31:48):
Awesome. I could be persuaded to put that into the show notes.

Tim Berglund (31:52):
And you mentioned you document these things internally with sites. How do you build this? All those sites?

Natan Silnitsky (32:02):
That's an entire show on its own, I guess.

Tim Berglund (32:10):
Exactly. The short answer is dogfooding, and that's a good thing. Wow, you can do cool stuff, too. If one team wanted to charge for access to premium documentation, you've got all the eCommerce stuff baked in already. You can just do that.

Natan Silnitsky (32:23):
That's a good idea. You should be on our business development team.

Tim Berglund (32:29):
Little hustle on the side.

Natan Silnitsky (32:34):
We do a lot of dogfooding. A lot of our back office tooling is done using the Wix platform. Actually the Greyhound control plane, we didn't use the Wix platform itself, but we did use Wix open source called Wix-style-react that allowed users to write React and don't even think about CSS and styling at all because you get it out of the box. That was nice.

Tim Berglund (33:10):
Nice, yeah. You had me at don't think about CSS.

Natan Silnitsky (33:17):
Yeah, I hate CSS.

Tim Berglund (33:17):
There are people who don't, and they are a gift to human flourishing, that they love CSS and are good at it. The rest of us need them and should encourage them and send them a fruit basket during the holidays is what I think.

Natan Silnitsky (33:35):
Absolutely.

Tim Berglund (33:38):
How about another lesson that you've got. You talk about it in your talk. It's about monitoring, and this is a thing that comes up.

Tim Berglund (33:49):
Monitoring, you're architecture doesn't matter. If you've got a monolith, if you've got synchronous microservices, if every single thing in the universe is asynchronous and reactive, like it should be, monitoring is still a problem.

Tim Berglund (34:04):
But it feels like a bigger problem with an asynchronous system because at the point that you have some request that you been made aware of from the outside world, and you validate that request and do some computation around it, and then you produce that unit of work to some topic that you know 15 other services are going to dance around until the saga is complete, or however you want to put it... Monitoring feels like a bigger problem because the code that you write to process that user request does not have direct responsibility for completing the action, and you're just trusting all these services to be adults and fulfill their responsibilities.

Tim Berglund (34:52):
So, tell me about monitoring and whether that's part of the Greyhound story or other stuff you've built.

Natan Silnitsky (34:58):
It's half and half. Dev culture at Wix is you have complete ownership of your service, including monitoring it. But of course, there are teams to support you. There's the data stream team that I'm part of, and there's a monitoring team.

Natan Silnitsky (35:20):
So, what we do with Kafka related monitoring is that there's an automatic dashboard for each service that is provided in terms of a Kafka related matrix, and it includes a lot stuff you'd expect, like produced throughput, consumer throughput, and stuff like that. But also something that can be monitored and metered because of our Greyhound wrapper layer is [inaudible 00:35:59] current handler.

Natan Silnitsky (36:01):
What do I mean by that? The developer provided Greyhound with the processing code so Greyhound can measure how long it took. It can also offer a gauge to see how long is it taking right now when you actually look at the graph. And then you can say, "Okay, I see a lag is building up because I got an automatic alert on our system for incoming lags." And now I can check out this automatic dashboard and see whether I have a stuck handler. Then I can go and open that specific application. I can see the message that is currently being consumed and decide what to do with it.

Natan Silnitsky (36:57):
So, you need to think about it end to end, and you need to think about the developer. When they run into an alert, do they have the tools in place to actually fix it? It's very important. You get the dashboards. You get the alerts. And everything has to be simple and out of the box, otherwise, you can't really make sure that all 1,400 microservices have the monitoring story as well as they should.

Natan Silnitsky (37:32):
And of course, on top of all that, we have the control plane, where they can continue debugging and view a lot of different aspects as well.

Tim Berglund (37:40):
Great point, and I think a broader architectural principle there for organizations of your scale. It's not just that you can't trust application developers to be competent. That's not the framing that you want. Which is typically... I was always on the application side of things, and the infra and app dichotomy, sometimes toxic narratives build up there where people who get beeped and paged in the middle of the night and are responsible for uptime just think that the people who write application code are all completely stupid and make mistakes all the time.

Tim Berglund (38:21):
You don't want that kind of thing, and I like the way you have framed this. You can count on them.

Natan Silnitsky (38:29):
Well, at Wix, actually, the application owner will get up at 3 AM if their application is burning.

Tim Berglund (38:35):
Oh, nice. That's dev-ops where it counts. Or, as we say, skin in the game. But at your scale, you're just not going to be able to expect that community of hundreds of application developers to also assume responsibility for those infrastructure concerns. Those are actually cross-cutting framework things, and part of what you've done is build that framework.

Tim Berglund (39:06):
As the infrastructure team, that's you're a problem to solve. You've solved it. You've given this tool to those people, rather than saying, "Oh, yeah. You have to do this other super complex job that is a calling all in itself and a career specialty all in itself in addition to understanding how restaurants operate and what they need in their front end." All the stuff that the front end people need to care about.

Natan Silnitsky (39:28):
Absolutely.

Tim Berglund (39:30):
I like that. That's a good way to handle monitoring responsibilities, and just cross-cutting concerns generally. Extract them into a framework. At this scale, it's worth the investment. And then support that framework, which is what Greyhound is all about if I am hearing you correctly.

Natan Silnitsky (39:48):
Yeah, it's a nice way of putting it.

Tim Berglund (39:50):
Let's see, we're coming up against time, and I'm thinking maybe we have one more thing. I want to let you pick. You've got a list here. What's you're next lesson you'd like to talk about?

Natan Silnitsky (40:00):
Well, I think I'll be greedy, and I hope we have time for two more lessons.

Tim Berglund (40:09):
All right. Let's go.

Natan Silnitsky (40:10):
One of them is that we talked about being a Scala shop and also a Node.js shop. So, the JVM, we have the great Kafka SDK, and for Node.js we used to work with... I think the library's called No Kafka or Yes Kafka, something like that. I know it wasn't officially supported by Confluent or Kafka Open Source Project. It had its own problems, a specific library for edge cases.

Natan Silnitsky (40:52):
So, that was the situation. We have the Greyhound in Scala services, and we have this other library for Node with a little extra infrastructure work by our Node platform developers. And we saw that this is not great because Node services should be first-class citizens and enjoy all the event-driven reactive style development with all the features that Greyhound offers.

Natan Silnitsky (41:25):
But porting all the code from Scala to Node.js doesn't really make sense, so we decided to create a sidecar inside the [inaudible 00:41:36] pod where the Node service runs. It just takes Greyhound as a service, basically, in a sidecar. It's a proxy, basically, where all the produce and consume is done via gRPC calls back and forth between the two containers in the pods.

Natan Silnitsky (42:07):
So, not only does the sidecar need to be a server that gets produce requests, but also the Node app is a server that gets the consumer requests. And we decided to do that and not use some commercial provided [crosstalk 00:42:25] proxy.

Tim Berglund (42:26):
The Blizzard Node.js library or any of the... There are a couple of other options that Node shops tend to use. So, in other words, you are not directly using a Kafka library in your Node clients, but you've got this sidecar. And you know what, Natan, I'm going to say, if there has to be asynchronous interface, if it's two containers in a pod, I'll allow it. That seems okay to me. And then, it's the sidecar that is using your... That's Greyhound, then, basically.

Natan Silnitsky (42:56):
Yeah, it just uses Greyhound and the Kafka JVM SDK.

Tim Berglund (43:03):
Oh, yeah, sure, because it's not just that, okay, there are Node.js libraries. Go bite the bullet and use the Blizzard one because everybody else does. But it's Greyhound, and you have to port all of Greyhound to wrap that in Node. Okay, that makes sense. Cool.

Natan Silnitsky (43:22):
But I'll go check out the Blizzard one. My curiosity's piqued.

Tim Berglund (43:29):
Totally, do. I mention this because this has come up. Somebody asked me on Twitter recently, "Hey Confluent, when are you guys going to have a formally supported Node.js library?" There's noise about that and, I think, growing community desire for a quote-unquote "official" one.

Tim Berglund (43:52):
The Blizzard one is the de facto official community library, just based on people I talk to. That's a thing, but that still wouldn't solve your Greyhound problem. You've got all this value wrapped around the JVM.

Tim Berglund (44:03):
Anyway, you said you wanted two, and the way we get to two is by me talking less. So, talk to me about proactive broker maintenance and lessons you learned there.

Natan Silnitsky (44:14):
We currently maintain our own Kafka brokers, of course, on the Cloud, but by ourselves. In terms of lessons learned here, it's really important to monitor the growth of your clusters. Wix is quite a big shop, and there's a famous blog post about the optimal amount of partitions and brokers in a cluster.

Natan Silnitsky (44:48):
I know that once the Zookeeper KIP will be fully blended, we'll have a lot more threshold, but I think it's currently something like 4,000 partitions per broker and 200,000 partitions per cluster. Correct me if I'm wrong.

Tim Berglund (45:12):
That would be prior to, I think, Kafka 2.0.

Natan Silnitsky (45:14):
Oh, okay.

Tim Berglund (45:17):
More up to date versions of Kafka within the last year or Confluent platform, you'd be around 20,000 partitions per broker. And mileage varies on that. I'm saying that number, and the operator might say, "Nope. Sorry, we tested it, and it was 12,000." And that's all totally cool. [crosstalk 00:45:33]

Natan Silnitsky (45:33):
You would agree that Zookeeper is a bottleneck here.

Tim Berglund (45:37):
I would agree, Natan, that Zookeeper is a bottleneck. This is why we have a KIP 500 and why we're excited about the future, but go on.

Natan Silnitsky (45:44):
I'm also excited about that. The main lesson here is to be proactive, and add additional brokers, and delete unused topics, and if you need to split up your cluster because it reached the limit, then go ahead and do that.

Natan Silnitsky (46:05):
Actually, we are now considering switching over to Confluent Cloud. That will...

Tim Berglund (46:12):
Go on.

Natan Silnitsky (46:15):
So, that will hopefully release us from the need to do the broker maintenance because like I said, as our scaling requirements keep growing, you need to be on the nose and keep maintaining and checking that everything is okay. So, if someone else can do that for us, why not? That would be great.

Natan Silnitsky (46:44):
But basically, we also want to split up the cluster. We have one big giant cluster now for each of our data centers, and it's reached the limits, I think, even if we probably move to Confluent Cloud. It's still something that we want to do. Then we have all the concerns of how to split the clusters and make sure that it's easy on the Greyhound API level to use it, maybe without even going to the API level, having it behind the scenes to go between clusters, et cetera.

Natan Silnitsky (47:29):
So, it's going to be an interesting time for us.

Tim Berglund (47:32):
My guest today has been Natan Silnitsky. Natan, thanks for being a part of Streaming Audio.

Natan Silnitsky (47:36):
Thank you so much, Tim. I really enjoyed the conversation, and I hope that some of the lessons I described will help any of our listeners.

Tim Berglund (47:46):
Hey, you know what you get for listening to the end? Some free Confluent Cloud. Use the promo code 60PDCAST to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 21st, 2021, and use it within 90 days after activation. And any unused promo value on the expiration date will be forfeit, and there are a limited number of codes available, so don't miss out.

Tim Berglund (48:15):
Anyway, as always, I hope this podcast was helpful to you. If you want to discuss it or ask a question, you can always reach out to me @tlberglund on Twitter. That's T-L-B-E-R-G-L-U-N-D. Or, you can leave a comment on a YouTube video or reach out in our community Slack.

Tim Berglund (48:32):
There's a Slack sign up link in the show notes if you'd like to join. And while you're at it, please subscribe to our YouTube channel and to this podcast wherever fine podcasts are sold. And it you subscribe through Apple Podcasts, be sure to leave us a review there. That helps other people discover us, which we think is a good thing.

Tim Berglund (48:50):
Thanks for your support, and we'll see you next time.