Streaming Audio: Apache Kafka® & Real-Time Data

Using Apache Kafka and ksqlDB for Data Replication at Bolt

August 26, 2021 Confluent, original creators of Apache Kafka® Season 1 Episode 173
Streaming Audio: Apache Kafka® & Real-Time Data
Using Apache Kafka and ksqlDB for Data Replication at Bolt
Show Notes Transcript

What does a ride-hailing app that offers micromobility and food delivery services have to do with data in motion? In this episode, Ruslan Gibaiev (Data Architect, Bolt) shares about Bolt’s road to adopting Apache Kafka® and ksqlDB for stream processing to replicate data from transactional databases to analytical warehouses. 

Rome wasn't built overnight, nor was the adoption of Kafka and ksqlDB at Bolt. Initially, Bolt noticed the need for system standardization and replacing the unreliable query-based change data capture (CDC) process. As an experienced Kafka developer, Ruslan believed that Kafka is the solution for adopting change data capture as a company-wide event streaming solution. Persuading the team at Bolt to adopt and buy in was hard at first, but Ruslan made it possible. Eventually, the team replaced query-based CDC with log-based CDC from Debezium, built on top of Kafka. Shortly after the implementation, developers at Bolt began to see precise, correct, and real-time data. 

As Bolt continues to grow, they see the need to implement a data lake or a data warehouse for OTP system data replication and stream processing. After carefully considering several different solutions and frameworks such as ksqlDB, Apache Flink®, Apache Spark™, and Kafka Streams, ksqlDB shines most for their business requirement. 

Bolt adopted ksqlDB because it is native to the Kafka ecosystem, and it is a perfect fit for their use case. They found ksqlDB to be a particularly good fit for replicating all their data to a data warehouse for a number of reasons, including: 

  • Easy to deploy and manage
  • Linearly scalable
  • Natively integrates with Confluent Schema Registry 

Turn in to find out more about Bolt’s adoption journey with Kafka and ksqlDB. 

EPISODE LINKS

Tim Berglund:
When you want to adopt Kafka and you want to adopt ksqlDB, but there's some resistance inside your organization. What do you do? Well, Ruslan Gibaiev, who's a data architect at European startup Bolt, faced exactly that problem. I talked to him about how he tackled it and really where he's got with that adoption effort up till now, on today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the cloud.

Tim Berglund:
Hello, and welcome to another episode of Streaming Audio. I am your host, Tim Berglund, and I'm joined in the virtual studio, stretching over an entire ocean and the better part of two continents by Ruslan Gibaiev of Bolt. Ruslan, welcome to the show.

Ruslan Gibaiev:
Hey Tim, thanks for having me here. Glad to be here with you.

Tim Berglund:
Excellent. Hey, tell us about what Bolt does and tell us about what you do there?

Ruslan Gibaiev:
Sure. You may think of Bolt as the European competitor to Uber. Basically, we started in the ride-hailing space, but over the years, gradually expanded to micro-mobility and food delivery as well. And right now, we are the leader micro-mobility platform across Europe and Africa.

Tim Berglund:
Awesome. And what is your role there at Bolt?

Ruslan Gibaiev:
I work as a data architect in the data engineering department. Mostly I focus on the two opposite sides of the world. One is being everything related to data warehousing technologies, like ETLs, data lakes, and so on. And the other one is everything related to real-time stream processing and real-time data.

Tim Berglund:
All right. That's what we want to talk about today, specifically, what I want to focus on in terms of your story is how you drove Kafka adoption and ksqlDB adoption. That's a question that I get a lot from other developers who are thinking about event-driven architecture and new trends for the future and how to be in the right place in the future, is how do I convince my company to adopt this technology if they're not already into it. And you've got the story there. Starts with change data capture, but it's not really a story about change data capture, but tell us about how you did that. What was your motivation and how did you start the process?

Ruslan Gibaiev:
Yeah, that's a very good question, actually. So before joining and Bolt, I did have a very good experience working with Apache Kafka and I really liked it. I understood how it makes the life of developers and the microservices world easier and how it brings the power of real-time data to the company. And when I joined Bolt at that moment, we were a growing up startup and we were growing really fast and it means that at some point we were in a state where we had all the different flavors of technologies, some services working over the API, some services working over directly calling the DBS of our services. There were some messaging queues also in place, but I would say we were lacking some standardization, and think in few years ahead of time, I understood that something has to change and we need to adopt Apache Kafka.

Ruslan Gibaiev:
And honestly, it took me some time to convince the people and make them believe in the technology. And that's exactly where we have the use case of change data capture coming into play. Because at some moment, we realized that our data analytics platform is not very reliable. At that time we had the process in place, which right now usually people call the query-based CDC, which is actually a fancy name for a dump process, which usually hummers your databases and pulls them like every 10, 15, whatever seconds. And this process had plenty of disadvantages. It was not very reliable. There were duplicates, it was hard to maintain, and so on. And the biggest problem was that, well, actually people didn't trust the numbers because every second day, something was wrong and some numbers were missing. And at that moment, I proposed to implement a new process, which is the log-based CDC or the binary log-based CDC.

Ruslan Gibaiev:
Basically, instead of pulling the database, you can simply read the internal binary logs of the databases, convert them to messages, send them to Apache Kafka and then do whatever you want with the data you have. We started with this project. It took us some time to migrate to this new pipeline, but I would say pretty instantly people started to trust in it because all the numbers were precise. Everything was correct. Everything was real-time. There were no delays. And it created a very good reputation for Apache Kafka as the technology and all the other stack of technologies, which we also adopted along the way, ksqlDB being one of them.

Tim Berglund:
Which we'll get into in a minute, was that Kafka and Debezium? Did you use Debezium for the CDC?

Ruslan Gibaiev:
Yeah. So heads off to all the guys from Debezium. They did a really good job implementing all their Debezium connectors. So internally we run on MySQL and when I started digging into this project, I found the Debezium project. I didn't know about it before. And I realized that's exactly what we need. And luckily Debezium one was developed to be perfectly matching with Apache Kafka and that brought all the stack.

Tim Berglund:
Yep. We've got a number of other episodes on CDC and I think at least one on Debezium. So we'll make sure to link to those in the show notes. But again, I'm interested, not so much in CDC because that's a known thing, but your adoption journey. So what was... I mean, this is a solid use case to start a Kafka deployment inside a company. There are things happening here. We need them here and hey, if they're inside a database, that's fine. There are Kafka connectors that'll get them out and we can get them to the other place. That's a great place to begin. It's easy to think about. It's like solving a known problem and all that makes sense. But what was the resistance to Kafka to begin with? You described an adolescent or young adult startup with a bunch of different tech stacks and the way life is, just different things everywhere. And you want to bring Kafka in because you'd had success with it. You know this is going to be a good thing. But why not... What was The pushback?

Ruslan Gibaiev:
Well, I would imagine pretty much every time when someone wants to implement new technology, other people, and that's actually a good thing to do; other people would challenge this guy, simply ask him why do we need to adopt this technology? Why do we need to learn it? Why do we need to support it? And speaking of Apache Kafka at that time, when we were starting this project, there was no cloud offering on the market yet. Confluent Cloud was not generally available to the public at the time, which meant that we have to run the infrastructure ourselves. And as you know, Kafka is a highly available distributed system, which means usually you run it on more than one machine. And my colleagues started simply challenging me and asking, because surprisingly to me, none of them had previous experience with Kafka.

Ruslan Gibaiev:
And that was the first obstacle to tackle. And after that, it gradually propagates to other teams and higher across the hierarchy and other people start to challenge you. Why do we really need to adopt new technology? Because it will take some time. I started describing all the benefits, speaking about real-time data, about connecting all the silos, having a single central place that streams the data in real-time, low latency, and being able to make the stream processing. And honestly, it didn't happen overnight. So it took some time for people to understand and even better. I'm really glad that we had this use case because it allowed me to show them a real example, like how it works, what it does, what's the benefit, and what can we do more.

Tim Berglund:
Yeah. And I guess with having to operate on your own and not having a fully managed service like Confluent Cloud at the time, that was a pain, this message brought to you by Confluent Cloud. And that would be a barrier honestly because there's a lot of operational lift there. But once you got over that and you've got some buy-in and you had Kafka deployed, you had Debezium deployed, it seems like there's a natural evolution. And if you want, you can dig a little bit more into the use case. You said before you were doing query-based CDC and the results were unreliable. Was this an analytics thing? Was it some kind of reporting thing? You switched to log-based in it got better. So take us from there. You mentioned ksqlDB. So it seems like it's an analytics thing maybe, but yeah. Keep taking us through your adoption journey.

Ruslan Gibaiev:
Cool. The initial use case was purely analytical. Basically, people wanted to get the data from OLTP databases and put them into some single place from multiple OLTP databases at the time. There were not that many of them, but more than one already. And to put them in a single place where they could create all the data, whereas they could join the data, where they can enrich it. And calculate different starts because like I said, we are working in the micromobility space and usually people are interested in how many rides we had over the last week, per every city you operate in, and so on. And we had plenty of such dashboards and also the reporting and management reports and investors report and for some kinds of forecast and all the other analytical use cases. At that time, honestly, there were not that many of them but still, the question was whether people can trust the data or not.

Ruslan Gibaiev:
And I realized that if you do this thing properly, we can get Kafka adopted. And this whole pipeline over time will get much more adoption. All the other use cases will be built on top of that. We can switch the microservices communication to this pipeline. We can also do different kinds of stateful stream processing. We can do anti-fraud prevention. We can do the machine learning there, pretty much all the ETL-based jobs or any other workloads can be switched to real-time. I mean, it's a no-brainer, it would be much better. Right?

Tim Berglund:
Absolutely. That's the goal, but you have to persuade people one step at a time. So as you've got this OLAP pipeline, you're extracting OLTP data and you've got this OLAP pipeline. How did you think about how to do that real-time analysis? You said ksqlDB, but obviously, Kafka Streams presents itself as an option there. How did you work through that trade-off and what did you end up building? What exists right now in ksqlDB?

Ruslan Gibaiev:
Yeah, that's a very good question. And like I said, at the time compared to now, we didn't have that many, let's say, databases to replicate or even source tables to replicate. Right? But to me, it was clear that if you assume Bolt keeps growing at the rate we had at that time, over the year, it will already transform into a huge number. And I realized when they're talking about replicating the OLTP systems data to some kind of a data lake or data warehouse, they have to also understand that you cannot simply replicate all the data. Some data is sensitive, you want to anonymize it, maybe hide it. Maybe you want to drop it, not replicate some fields. Maybe you want to filter out some records. Right? And it means that you have to do some kind of stream processing.

Ruslan Gibaiev:
And that made me think, right! Which options do we have on the table, obviously Kafka Streams is a candidate zero, but what else we had? There were other frameworks like Apache Flink, Apache Spark, something else. But for me personally, I wanted to stay native to the Kafka ecosystem because I wanted to work all the software, the closer it is to each other, the more native it is to each other, the better. And that's exactly when I started thinking about ksql. So honestly I was considering between two candidates, Kafka Streams, or the ksql. And imagine this, you have, let's say,  hundreds of tables to replicate and you want to have like hundreds of stream processors, right? And you assume that over the... Maybe a year or two years, this number can grow to a thousand stream processors. And sometimes when we're talking about database CDC, you have to also think about database migrations because, well, sometimes people add new columns, they drop them, they change types. Right?

Ruslan Gibaiev:
And it means if we are talking about strictly type pipeline, it means that you would also need to change your streams. Right? And I mean, when I thought about how Kafka Streams work, you have to deploy your own apps and it runs inside of them. And when you think about having the need to, I don't know, change 10, 20, 50, 100 such stream processors in Kafka Streams, it would be a pain. I would honestly rather kill myself than do that. And that's exactly where ksql shines because in ksql, what I really like, ksqlDB is like... I honestly believe it's a perfect fit for this use case. It is declarative and it's easy to deploy. Basically, ksql apps, it's like an engine; it runs your logic for yourself. You simply give it a config file with the queries and it runs them. And it's very easy to modify, change, add new queries, drop obsolete queries. So honestly, when I compared those two options for me, it was a no-brainer I decided to go with the ksql.

Tim Berglund:
That makes so much sense. And that's a great dimension. It's funny, that's a dimension that doesn't come up very often in the trade-off between ksqlDB and Kafka Streams when people are considering that, but in this classical data pipeline use case. Wow. Yeah. I mean, you're going to version a hundred or a thousand apps, which, I mean, you could get a different job before you kill yourself, but honestly, that's the list of rational next steps. So I'm with you on that. And so yeah that brings ksqlDB to the top of the list.

Ruslan Gibaiev:
Yeah. To add even more, when I was doing my investigation and research, even at that time and even now, I honestly don't see that many use cases of ksql. I mean, that many articles or companies saying that they use ksqlDB for this use case but honestly when we have already implemented it while having that I honestly think that over the years it will become like a defacto canonical industry-standard tool for this use case. Because like I said, it's so easy to use it and it does the job extremely well, it has a bunch of advantages over the Kafka Streams. I have even written a blog post on the Confluent Blog where I have listed all the advantages in my humble opinion. So I really believe, yeah, that's a perfect tool for this job.

Tim Berglund:
Tell us about serialization format and Schema Registry and that kind of thing. Were those a part of the pipeline?

Ruslan Gibaiev:
Yeah.

Tim Berglund:
I mean, sure there was a serialization format. Was it Avro? Was Schema Registry involved? Because you're talking about the key advantage being it's more resilient in the face of schema changes. So what did schema look like at a lower level here for you?

Ruslan Gibaiev:
Yeah. When I started working on this project, I also realized that... Maybe a bit of the context; internally in Bolt, most of our backend applications run on a TypeScript over the Node.js, and don't get me wrong. It works perfectly well, it has this language and this run time, has plenty of advantages, but it has also one small, just disadvantage, which is it lacks type checks in runtime. And I understood it very well. And given that time, I have just switched to data engineering. It was quite obvious to me that all the pipelines start to end, strictly typed is a must because come on, how can you guarantee something or how can you rely on something, if you cannot really impose the proper types. And that's why I realized that this whole pipeline, has to be strictly typed.

Ruslan Gibaiev:
So we have to have some kind of a format, which, getting keys, the types and God, how happy I was when I realized that ksqlDB is seamlessly integrated to Schema Registry. Come on, both of the technologies are developed by the same company. It's not that you have to integrate it yourself. It works out of the box. And yes, at that time Schema Registry was supporting only the Avro format, which for our use case was absolutely fine. I mean, we are still using it even though Schema Registry got the adoption of other formats as well, but Avro is a perfectly good format for our use case. And so yes, for this whole project, we adopted Kafka, Kafka Connect, Schema Registry, and ksqlDB. It was called ksql at the time. And yeah, it works seamlessly well.

Tim Berglund:
It's like chocolate and peanut butter plus marshmallows and graham crackers and a few other good things altogether that just taste better together. Each one is individually good but yeah, there's a sort of a gestalt thing, where you have them all together, holes are greater than some of these parts. One last question about schema changes, just out of curiosity, when they happened... Of course, they happened. Did you ever find that you always had to say tear down the old ksql query and create a new one? Or did you ever have schema changes that due to Schema Registry and Avro, the old ksqlDB query still worked because it was only picking out certain fields and everything? What was the experience actually like, of going through a schema change?

Ruslan Gibaiev:
Yeah, this is a very good question. And first things first, it was not just in that... First of all, we had to become friends with DBAs because previously, I guess in many companies it works the same way. DBAs, believe that they live separately in a separate world, they protect their databases and they can do whatever they want. Like all the types of migrations anytime they want. But now we started relying on them. We started depending on them because we had to react to any single schema change. And yes, you mentioned it correctly. There're some types of schema changes, which are, let's say, not compatible. You have to restart the ksqlDB query. There are some... The majority of them are compatible. Like, let's say you add a new column or dropping some optional column. It's totally fine.

Ruslan Gibaiev:
But for example, if you change the type of the column. Let's say you want to extend from an integer to a big integer. In the Java world and in the ksql world those are two different types, right? And that's why you have to internally create the new version of the schema and redeploy your query, which I guess brings us to the topic of deploying the ksql and having to manage all those queries. At that time when we started and even now ksql will support two modes of operation, right? The one is called interactive and the other one is called headless. Right?

Tim Berglund:
Yes.

Ruslan Gibaiev:
And initially, we started working in the interactive mode because I picked it because I liked the name; it means that it is interactive and you can simply input something and you see the results instantly.

Ruslan Gibaiev:
And we had it work in production, for a few months, I would say. But then at some point, I ran into the use case when... Imagine the situation when, as I said, for example, we changed the type of the field in some table, right? But this specific column is used as a foreign key in 10 other tables, which means that when its type changes, you have to modify and restart the 11 queries altogether. Right? And it means that in the interactive mode. First of all, I have to drop so-so 11 queries and recreate them and I realized that well, with all due respect, but this approach is not very scalable for our use case, because if I had like 50 or hundreds, such dependent tables, it would take a considerable amount of time. And we switched to the headless mode.

Ruslan Gibaiev:
And for this specific use case, in my opinion, it has plenty of advantages. Like for example, the vertical isolation or the ease of redeploying the queries in the chunks, not one by one, but in the groups. And we had all the queries, we have them versioned on the version control system. And once someone changes them, right. We can simply have a CI, which deploys the new file and restarts the ksql engine.

Tim Berglund:
Now we're recording this on the first day of June 2021 and migrations, I think about a month old in version 0.17 of ksqlDB. Has that been a thing you've been able to take a look at yet?

Ruslan Gibaiev:
Yeah. That's true. Migrations, very least in few prior versions of the ksqlDB. No, we didn't honestly have a chance to look into them yet, but as I said so far, we are quite happy with operating in a headless mode because it allows us to like... I agree, the majority of such changes and interactions imply only small changes, maybe like one to one table or two tables, but sometimes we indeed have to change queries in batches like 5, 10, 15, and 20. And for this use case, the headless mode is very, very nice in my opinion. We will definitely evaluate the migrations. But so far, as I said, it's not really an urgency for us.

Tim Berglund:
Right. Did they come out in 0.15, not 0.17? I'm just the host here. What would I know?

Ruslan Gibaiev:
Yeah. I guess even prior to maybe 0.17.

Tim Berglund:
Okay. All right. Because before I said, 0.17, I was actually debating with myself whether it was 0.15 or 0.17, we'll find out and we'll put a link in the show notes to the blog post. In the last question, you mentioned a future direction. You start with CDC and you end up with reactive microservices. How far along are you guys? How far have you been able to push Kafka adoption and really more broadly event-driven architecture inside Bolt?

Ruslan Gibaiev:
Yeah. That's a very good question. And like I said, the broader picture, it was my dream to get it adopted in the company two years ago. And right now, we've won the trust of many other colleagues of mine. And I see the pattern of different people from other teams coming to me and asking questions like, "Hey, do you think it would be better if we start building this whole new microservice straight away using the events we have?" So to answer your question, we are still happy users of the log-based CDC. After the time, our analytical platform has grown 10 times, we have much more reports, dashboards, and people making the decisions on top of this data, because like I said, all of them trust it now.

Ruslan Gibaiev:
Also, we opened this data up to the backend microservices, we also started doing the stateful stream processing for different use cases like anti-fraud or machine learning stuff. And the main benefit is that it's real-time. So you don't have to wait. You don't have to operate as a one-year four hours old data, it's maximum like 5, 10 minutes stale or... I'm talking about an extreme use case.

Tim Berglund:
My guest today has been Ruslan Gibaiev. Ruslan thanks for being a part of Streaming Audio.

Ruslan Gibaiev:
Thank you, Tim it was a pleasure talking to you.

Tim Berglund:
And there you have it. Thanks for listening to this episode. Now, some important details before you go. Streaming Audio is brought to you by Confluent Developer that is developer dot Confluent dot I-O, a website dedicated to helping you learn Kafka, Confluent, and everything in the broader event streaming ecosystem. We've got free video courses, a library of event driven architecture design patterns, executable tutorials covering ksqlDB, Kafka Streams, and core Kafka APIs. There's even an index of episodes of this podcast. So if you take a course on Confluent Developer, you'll have the chance to use Confluent Cloud. When you sign up, use the code PODCAST100, to get an extra hundred dollars of free Confluent Cloud usage.

Tim Berglund:
Anyway, as always, I hope this podcast was useful to you. If you want to discuss it or ask a question, you can always reach out to me at @TLBerglund on Twitter. That's T-L B-E-R-G-L-U-N-D. Or you can leave a comment on the YouTube video if you are watching and not just listening. Or reach out in our Community Slack or Forum, both are linked in the show notes. And while you're at it, please subscribe to our YouTube channel and to this podcast wherever fine podcasts are sold. And if you subscribe through Apple Podcasts, be sure to leave us a review there, that helps other people discover, and which we think that's a good thing. So thanks for your support and we'll see you next time.