Streaming Audio: A Confluent podcast about Apache Kafka®

Becoming Data Driven with Apache Kafka and Stream Processing ft. Daniel Jagielski

February 22, 2021 Confluent, original creators of Apache Kafka® Season 1 Episode 145
Streaming Audio: A Confluent podcast about Apache Kafka®
Becoming Data Driven with Apache Kafka and Stream Processing ft. Daniel Jagielski
Chapters
Streaming Audio: A Confluent podcast about Apache Kafka®
Becoming Data Driven with Apache Kafka and Stream Processing ft. Daniel Jagielski
Feb 22, 2021 Season 1 Episode 145
Confluent, original creators of Apache Kafka®

When it comes to adopting event-driven architectures, a couple of key considerations often arise: the way that an asynchronous core interacts with external synchronous systems and the question of “how do I refactor my monolith into services?” Daniel Jagielski, a consultant working as a tech lead/dev manager at VirtusLab for Tesco, recounts how these very themes emerged in his work with European clients. 

Through observing organizations as they pivot toward becoming real time and event driven, Daniel identifies the benefits of using Apache Kafka® and stream processing for auditing, integration, pub/sub, and event streaming.

He describes the differences between a provisioned cluster vs. managed cluster and the importance of this within the Kafka ecosystem. Daniel also dives into the risk detection platform used by Tesco, which he helped build as a VirtusLab consultant and that marries the asynchronous and synchronous worlds.

As Tesco migrated from a legacy platform to event streaming, determining risk and anomaly detection patterns have become more important than ever. They need the flexibility to adjust due to changing usage patterns with COVID-19. In this episode, Daniel talks integrations with third parties, push-based actions, and materialized views/projects for APIs.

Daniel is a tech lead/dev manager, but he’s also an individual contributor for the Apollo project (an ICE organization) focused on online music usage processing. This means working with data in motion; breaking the monolith (starting with a proof of concept); ETL migration to stream processing, and ingestion via multiple processes that run in parallel with record-level processing.

EPISODE LINKS

Show Notes Transcript

When it comes to adopting event-driven architectures, a couple of key considerations often arise: the way that an asynchronous core interacts with external synchronous systems and the question of “how do I refactor my monolith into services?” Daniel Jagielski, a consultant working as a tech lead/dev manager at VirtusLab for Tesco, recounts how these very themes emerged in his work with European clients. 

Through observing organizations as they pivot toward becoming real time and event driven, Daniel identifies the benefits of using Apache Kafka® and stream processing for auditing, integration, pub/sub, and event streaming.

He describes the differences between a provisioned cluster vs. managed cluster and the importance of this within the Kafka ecosystem. Daniel also dives into the risk detection platform used by Tesco, which he helped build as a VirtusLab consultant and that marries the asynchronous and synchronous worlds.

As Tesco migrated from a legacy platform to event streaming, determining risk and anomaly detection patterns have become more important than ever. They need the flexibility to adjust due to changing usage patterns with COVID-19. In this episode, Daniel talks integrations with third parties, push-based actions, and materialized views/projects for APIs.

Daniel is a tech lead/dev manager, but he’s also an individual contributor for the Apollo project (an ICE organization) focused on online music usage processing. This means working with data in motion; breaking the monolith (starting with a proof of concept); ETL migration to stream processing, and ingestion via multiple processes that run in parallel with record-level processing.

EPISODE LINKS

Tim Berglund: 
As people adopt event-driven architectures, a couple of themes sometimes arise. One is the way that the asynchronous core interacts with external synchronous systems, always a struggle. The other is the good old, how do I refactor my monolith into services question? Well, today I get to talk to Daniel Jagielski, about both of those things and how they came up in his consulting work with a couple of European clients. It's all on today's episode of Streaming Audio, a podcast about Kafka, Confluent and the Cloud.
 
Tim Berglund: 
Hello, and welcome to another episode of Streaming Audio. I am, once again, it seems persistently your host, Tim Berglund, and I'm joined in the virtual studio today by Daniel Jagielski. Daniel is a member of the Kafka community, sometimes seen at meetups. He's a consultant working with a firm called VirtusLab, and he's got actually quite a bit of interesting recent Kafka experience in all of the areas I like talking about. So, Daniel, welcome to the show.
 
Daniel Jagielski: 
Hi, Tim. Thanks for having me.
 
Tim Berglund: 
You got it. Tell us a little bit about your background. How did you get to do what you're doing? We're going to dig into what you're doing right now throughout the show, but how did you get here? How did you get started with software, how did you get started with Kafka? Tell us about you.
 
Daniel Jagielski: 
Sure. So, I'm a consultant, and I'm really passionate about distributed systems. I have been in the industry for a while, and during that time I was working for companies from different domains and different business sectors. Currently, as you mentioned, I'm involved in two initiatives closely related to stream processing, and I have been heavily involved in these organizations journey to become event driven. And I was lucky enough to participate in that journey from the very beginning.
 
Daniel Jagielski: 
At the moment, I am a tech lead and development manager at VirtusLab, where my responsibility is to lead and manage a team within Tesco, specifically in identity department where we are working towards providing Risk-based Adaptive Authentication experience for our users, besides that I'm consulting also in stream processing related project in ICE organization, which works in the music industry. And we are replacing existing bot solution with event-driven system.
 
Daniel Jagielski: 
I've done a few talks and written a few blog posts about Kafka, and this is the main area of my interest at the moment, because I love to share my findings and discuss them with community. Back in the day, I also have done some development of web applications. There's a classic stock based on spring relational databases, sometimes hibernate. I embraced the DevOps mindset as we're deploying applications that we created ourselves using public clouds or on premise, depending on the use case. And I have also jumped early on Big Data bandwagon as well, as I worked on few prototypes on top of Hadoop and I've even written master thesis on that topic.
 
Tim Berglund: 
Oh, cool. I didn't know about the master's thesis. That's really interesting. So, we're going to dig into the Tesco and the ICE stuff in a little bit, and again, that's ICE. That's not the United States Department of Homeland Security, ICE. What's the American version. Oh boy, my daughter-in-law's going to kill me for not knowing this. It's a copyright clearinghouse stuff, so you use a song and this is a way that you can pay a fee to the organization for a CCLI maybe is the American one. Anyway, it doesn't matter ICE is that music copyright clearance.
 
Tim Berglund: 
Before we dig into those, I want to pick your brain on a few other things. The overall trend, you've been around long enough to have seen the rise of the trend towards event-driven architectures and the pressures of systems that have to deliver real-time results. Reflect on that a little bit, I have a story that I tell and a story that I see and I know what the people I talk to say, but I'm always interested in hearing other people who have been watching it and in your case, helping to drive it, helping to build things and helping to nurture the community. What's going on, why the move to event-driven architectures, and separately why a move to real time?
 
Daniel Jagielski: 
Sure. So ultimately moving towards that direction, event-driven direction changes the way organizations operate. There are multiple different motivations for these organizations to move this way and probably different use cases, different organizations have different triggers for that, but the main things that can be the main benefits of that solution, and this architecture is that it really unchains the data within the organization.
 
Daniel Jagielski: 
So, it unleash the potential that was always there, however, due to some technical limitations or process limitation, it was not possible to utilize it 100%. So in general, when you consider the data, it lays somewhere in this classical approach. However, we have stream processing and event-driven architecture. You put the data in motion and final, you can have multiple derivatives of the same information. It also brings different benefits in specifically decoupling.
 
Daniel Jagielski: 
So, you are no longer coupled between the side that emits the information and consumes it. There's this simple but yet very powerful concept of producers and consumers within event-driven systems at Kafka specifically. So with that, you can achieve great benefits. You can for example, implement new features in non-disruptive way to production running system. And there's no centralized place where all the logic grounds. Back in the day, people had this habits, or there was this pattern of moving applications logic to the database, which didn't work out well for a variety of reasons.
 
Daniel Jagielski: 
So, with event-driven architecture, with stream processing, you can have multiple applications running on top of the data and you push data towards them and they are doing the thing, they are emitting results, which can be interesting for different teams, different parts of your organization. And this is what unlocks the potential in use cases that were'nt identified before.
 
Tim Berglund: 
So, playing devil's advocate for a moment here. I happened to be completely sympathetic to the point of view you just expressed obviously, but one might be skeptic and say, this fancy event-driven architecture, it's just this trend and I drink my coffee straight and all of you people with your coconut milk lattes and fancy new ideas, the type. From that perspective, why do you think this didn't happen with databases? Databases, all the stuff's in one place, I could integrate things. So, why not just do it that way? And why do it with event streaming? It in this case being integration.
 
Daniel Jagielski: 
Sure. So, when you reach a certain scale with databases, for example, relational databases, it's quite difficult to scale farther than that. Obviously, there are techniques sharding, which may help up to some degree. However at, for example, internet scale, often, this is not enough to run queries on the database, run some analytics for example. So you, what you really would to have is this paradigm shift which comes with event-driven systems. So services, your applications, react to the data which is being pushed towards them. And it's not like application is polling or acquiring one central place that drowns all this computation for you. So, there's a huge difference here.
 
Tim Berglund: 
That makes a lot of sense. So, it's easier to scale up. I guess that we should say partitioning a topic does less violence to the contract that a topic gives you, partitioning a relational table, sharding a relational table does a lot of violence to the relational model. It's a big thing that you have to think about partitioning topics, which is how we scale the topics in Kafka. Isn't as big of a deal. I mean, it's a thing you have to think about sometimes, but you still have the basic ordering contract
 
Daniel Jagielski: 
Exactly [crosstalk 00:10:22] Yeah. You should operate things as they were designed for. So, Kafka and even streaming architecture is perfectly designed for just that use case, to scale using different partitions. And this strategy works really work well here. 

Tim Berglund: 
Yeah. Nice. We mentioned real time also. Maybe you covered this and I'm just spacing here, but why do you think that's a pressure? Given that it is, everybody is under pressure to build systems that get answers faster. But what do you think is driving that in the broader world or in the technology world, the business world? Why does everything have to be real time? 

Daniel Jagielski: 
I think the life, the pace of how we live increased over the years in great extent, and everyone wants to have the information right now with all the waiting, for example, one day or one week to gather some information you would like to have this real time insights into how business operates, how your new strategy works, and this type of real time insight can help you to issue better decisions and change the way you operate as a business. 

Daniel Jagielski: 
So, this is the main driver. There are different notions of that real time meaning obviously, because for some, it may be increasing from one month processing to one hour is just enough. However, for some is just, you would really like to get as close to real time as possible with the things that we do, for example at Tesco where we have real time risk assessment platform. And we would like to have an information as soon as possible to block potential damage caused by attackers for example. 

Tim Berglund: 
How about other benefits architecturally? Those are the big things, there's a move towards event-driven architecture, there's pressure that I think I'd agree with you is even cultural pressure towards real-time systems. It's not a thing that a CIO or some software architect is making up. It's like, that's what's happening with the world and the computer programs we write have to conform to that, because those are the tools that people want. 

Tim Berglund: 
But, zeroing in a little bit more specifically on Kafka what are the things it does for you when you're designing a system? Where do you use it? Obviously, this is a Kafka podcast and I'm a person who's enthusiastic about it, and I will tend to see uses for Kafka everywhere, but where in your experience have been the problems that you've solved in systems that you've built? 

Daniel Jagielski: 
There are different use cases that Kafka can help you with. Some people start their Kafka journey with sending auditing events, for example. So, this is a perfect example. You have your application running and you would like to have some traceability. You would like to have maybe some analytics run on your data, on events that your system emitted, and you start from sending audit events. 

Daniel Jagielski: 
Then for example, you also have multiple great opportunities when it comes to integration of different systems together via Kafka. And Kafka Connect stents out here with tons of connectors created by Confluent community or third party companies. Some may use Kafka as a pop sub mechanism, but we all know that it's much more than that. And this ultimate goal where everyone is leaning towards at some point is event streaming and event-driven architecture. 

Daniel Jagielski: 
So, where there's Kafka and eventing system becomes central nervous system of your organization. And it is great point of exchanging information across different departments. And you can leverage on that easy access to different information, which previously were locked down there in a data basis or silos of certain department. 

Tim Berglund: 
Yes. Referring back to our discussion of a few minutes ago on how event streaming makes integration easier. And I think there's another note on that, and this just came to me, but there seems to be a safety in sharing immutable things, sharing mutable things. And obviously you can manage access rights, such that there's only one writer on a table and everyone else is a reader. It's not that that's hard to do or something, but fundamentally tables are mutable and applications tend to want the ability to mutate them. 

Tim Berglund: 
And the people who maintain those applications tend to use the persuasive and political mechanisms inside an organization to get their applications, the right to mutate the table. And when you're, let's say you end up with a pile of shared mutable state, which is never a good life choice. And with event streaming, you are explicitly sharing you immutable things, because they're events, they're descriptions of things that have happened. They can't change. 

Daniel Jagielski: 
Yeah, exactly. That's what event is. It's statement of the fact, there is this discussion sometimes whether Kafka is good for Securus and obviously if a different style of development is applied, it can be achieved the goals of Securus, however, in general event-driven architectures are, and when you design and implement event-driven architecture, you would like to admit facts and react to these facts, not comments really. And facts are immutable, you cannot change the past, right? 

Tim Berglund: 
No, no, you can't. You can't. He says ruefully. So, if I could keep picking your brain a little bit, provisioned versus managed, or self-managed versus fully managed, or on-prem versus cloud on-prem is a little misleading because often that's in the cloud, just a different way of doing it, but that dichotomy, in other words, in my worldview Kafka cloud, or Confluent Cloud, or I'm running it myself, you've done it both ways. Could you tell me about your experience there, and just how you found it to be, and what the trade-offs are? Help people listening think through that. 

Daniel Jagielski: 
Yeah. As you mentioned, there are always trade-offs and it's general rule in software development and distributed systems. You need to deal with trade-offs and using self hosted, or building self hosted cluster and using management solution is also the type of decision that you need to take at the earliest stage of your project. So, depending on the path that you take it may change your schedule, the roadmap of the project, and it may obviously impact how fast you are able to progress. 

Daniel Jagielski: 
So, whenever you would like to progress fast to proof your idea to start delivering business value, it's obvious that using managed solution from experts in the technology that will run the thing for you in a reliable way. It's about their choice to make. At ICE, where we are building this new event-driven system that is going to replace some legacy application, we wanted to jump straight into delivering business value, into replacing some functionalities that were painful, that were not scaling on the previous version of the system. 

Daniel Jagielski: 
So that's when we decided on using managed solution. And it works pretty well I would say, because we don't have any ops staff regarding Kafka cluster, which is fun thing to learn about. However, it's also stopping you from delivering features and is impacting your timelines. So, it's just different way of working. And when we had the different project that Tesco we run, and we are still doing that, we have a self hosted cluster. 

Daniel Jagielski: 
It takes its time to set up and as mentioned, it's really great to learn about Kafka internals and how it scales, what are the mechanisms within, however there's a cost in that. And it takes time, when you haven't heard about Kafka or you are just user of Kafka, it's totally different story to provision your own cluster. There's maintenance cost as well, so depending on how sophisticated your tooling around this cluster is, it's still probable that you will need to do some manual work around this cluster. 

Daniel Jagielski: 
So, update of Kafka version, for example, or if your cluster doesn't scale automatically adding new brokers, for example, these activities also takes your time. Oftentimes, when you run the self hosted cluster within organization, and it's not centralized, it's not one platform for our whole organization. And there's no specific team that owns this cluster, you end up with this situation where you have some cluster godfathers, where you have certain group of people which can operate the cluster, which can answer questions regarding its performance or some degradation of performance, or solves issues when they happen. 

Daniel Jagielski: 
So, you don't want to have this knowledge silo, and either you propagate this knowledge or you build some center of excellence as [inaudible 00:21:49] called it in one of your previous podcasts, or you end up in the situation where some team needs to develop features and maintain the cluster simultaneously, which obviously impacts its effectiveness. 

Tim Berglund: 
Well, no, that's just DevOps. Right? Of course, everyone is doing a full-time operational responsibilities of complex systems and also developing complex applications at the same time. 

Daniel Jagielski: 
Yeah. Yeah. We all have this DevOps mindset at the moment. This is absolutely the right thing to do. However, as mentioned that at some point it becomes painful, and this responsibility may become blurry as well. So, there are these hidden costs here. 

Tim Berglund: 
There very much are. And to be clear, I'm hardly a skeptic of DevOps as a set of sensibilities for responsibilities for how developers should think about the deployment of their code, and skillsets that it's desirable to develop, and so on. Those are good things. It's just an incredibly debated term though. If you want to start a fight among a group of software developers, just ask them to define DevOps and they'll be in a fight in 90 seconds. It doesn't take long at all. 

Daniel Jagielski: 
Yeah. As you know, use cases grow and as you develop new applications using your cluster and you produce more and more messages or events, at some point you will need to make someone responsible for keeping it stable, right? Or assessing whether current cluster capabilities are good enough to accommodate this new use case that you are just about to build. 

Tim Berglund: 
Uh-Huh. That's very, very true. And I think the real end game here is that there is specialization and that's okay, DevOps has been good push for what had been silos to understand each other better, and cross-trained in skillsets. There's my minimal definition of the term. But operating a Kafka cluster is hard and at a certain scale, it is a full-time responsibility. And that's a good thing to recognize if you're going the provisioned route. 

Daniel Jagielski: 
Exactly. 

Tim Berglund: 
And tell us, did you get to talk much or think much about the managed end of things, how has that gone? What's your experience there? 

Daniel Jagielski: 
So, generally the experience is good. It's extremely easy to provision the cluster. It's just a few clicks and you are there, you can use the cluster. You don't need to worry about clusters security for example, which always takes time and it requires some knowledge to be done properly. So, with managed solutions like Confluent Cloud, you'll get it built-in. As these platforms are still evolving, there are some potential places where few improvements can be done and fewer new features can be developed. 

Daniel Jagielski: 
Like, at the moment, you cannot for example, run your own connectors on such platform. So, even though you have the platform, you are using some connectors that are provided there, you are not able to run your custom connectors for example, which may be limiting at some point or may push you out of this fully managed experience, right? Because you would still need to provision infrastructure for these use cases, yourself. 

Tim Berglund: 
Exactly. You're fully managed for what can be fully managed, but you need a long tail connector, or just one that isn't available in Confluent Cloud yet. And so you have to run a connect cluster, which we're trying to avoid doing, but that's absolutely a thing. 

Daniel Jagielski: 
Yeah, yeah. The direction is clear, if it's managed, everything should be ultimately managed, but obviously it takes time. And sometimes there are other reasons for some connectors not being available, like licenses, for example. 

Tim Berglund: 
Right. We are getting there. Oh. And licenses. Well, yeah. Maybe Golden Gate, isn't going to be there unless there's some future marketplace thing or something. 

Daniel Jagielski: 
Yeah, yeah. This is the one that we also experimented with. 

Tim Berglund: 
Did I say[inaudible 00:26:36], I'm sorry, Daniel. 

Daniel Jagielski: 
Yeah, this is the one, but finally if you provision the cluster yourself and there are ways to provision Kafka Connect, for example using operator or there are patterns in ham, which you can use, which are generally available and you can use them to deploy your own Kafka Connect cluster, which also helps to some extent with running and provisioning infrastructure for that. 

Tim Berglund: 
Tell us about Tesco. Moving on to that, I think you said it was a fraud detection or risk mitigation thing. What's that system like, and what special problems have cropped up in building that? 

Daniel Jagielski: 
So, when it comes to the work I do a Tesco, which is mainly within identity department is, we would like to ultimately reduce friction for legitimate customers and make authentication as difficult as possible, or even impossible to be finished for attackers. And there are different tools and different modules that we have. We have layered security model, which helps to identify threats for example, however, this other side of the coin reducing friction for legitimate customers is not yet supported by these vendors that we use. 

Daniel Jagielski: 
So, we decided to create a component called the Risk Engine, which would assess the risk, which would calculate some properties of authentication session. And based on these properties, we could assess the risk related to this particular attempt. So for example, as we observe user coming from the same device, we could potentially require less inputs on this very session whereas, if it's coming from unknown device from unknown country, for particular user, we may want to test them or challenge them with some more difficult inputs at the beginning. 

Daniel Jagielski: 
After that, maybe we would like to reduce the risk as we identify new contexts for this given user. So, in general, from the top this is how this system works. There are different rule sets that that we can discuss here. And we obviously started from very simple use case. We wanted to re-implement some static rules mechanism that we had initially, that was our legacy. And it was based on top of JMS. So also messaging platform. It was very common in the past, it's not that common currently. 

Daniel Jagielski: 
So yeah, that's what we took as our first use case and we rebuilt it from scratch. We decided to use Kafka there, we were wondering how to join this legacy platform with new event streaming platform. So, there was an API that was written some time ago, and we couldn't really integrate Kafka clients, the Kafka producer. So, we decided on using some rest proxy that would just serve a risk standpoint, that would then emit event to our Kafka cluster. 

Daniel Jagielski: 
So, this is how we started. We emitted events from legacy system. We built the thing, the new version of risk engine module on the side. And we had really good testing environment, because previous platform which had its problems, however, most of the time he produced correct results, we were able to assert whether the new platform that we have built is actually meeting the criteria of the success. And not only were we producing the same results, but we were also producing more value in terms of being highly available. We could scale horizontally as we would like to. And we didn't have data loss, which was the case with the previous system. 

Tim Berglund: 
Excellent. And you said you had the old system still there to compare against, so you could see rate of false negatives and false positives. You could compare the new system to it before switching over, which is a rare gift. 

Daniel Jagielski: 
Yeah. It's a dryer and moat around that. And this is the perfect situation to be, because you can have something that you compare against. With time and with more features that we developed with more factors as we call them, as we introduce new factors, these differences only increased and increased in our favor. And at some point this old platform was just disabled and the cleanup was performed 

Tim Berglund: 
Withered on the vine, just like you always hoped that it would. Another quick question about this, because when I talk to people about just in general, the fraud detection use case, it seems there's a synchronous asynchronous bridging opportunity here. In other words, you have the problem of this nice asynchronous fraud detection system and all these services dealing with streaming data, and then you talked about a rest proxy getting into it, but you also have to expose the risk scores somewhere. 

Tim Berglund: 
And that could be a stream, right? That then gets joined with some transaction and evaluated. So, if everybody is asynchronous after the fact, that's fine, there's no friction there, but often in integrating with legacy systems, you have to provide an API. Also, these systems tend to take a fairly rich set of inputs. And some of those inputs may themselves be synchronous APIs, where your fraud detection code is the request store and not the request EDD. So, did you have those problems that either having to call a synchronous service or be called as one? 

Daniel Jagielski: 
Both, I would say. When we are talking about exposing the data that is calculated by this risk engine mechanism, based on Kafka and stream processing, as you mentioned, we need to expose it somehow for synchronous services so that they have an easy way to access the data. And our decision was to create materialized views out of the factors that we calculate out of the risk scores that we calculate. 

Daniel Jagielski: 
So, we have topics, we have certain values and these topics are exported into Cassandra using Cassandra connector, and our application when it's running, when it serves some API calls, it's just going to Cassandra. Most of that information can be calculated upfront. If we take this device recognition feature, whenever we see the device for the first time it obviously won't be present in Cassandra materialized view. However, as soon as the first login session finishes successfully, we would update this materialized view. So, on the subsequent attempt it's present and it can be accessed, which means that it will impact the risk score. 

Daniel Jagielski: 
So, this is how it works for us. And we have quite nice results using that approach. When it comes to calling external services, obviously with stream processing, you would like to avoid [inaudible 00:35:41] the APIs in the middle of your processing pipeline, because it may slow down the processing. There may be some errors of this external service, and you would like to have all the data in Kafka. 

Daniel Jagielski: 
So, in order to achieve that, what we have done when we integrated with third-party vendor that provides some of the information, some of the risks factors for us, we also integrated via Kafka. So, we have topics of our information that we expose to their streaming platform. They do their magic and gets back with an event. And we consume that event, and also update our views and update our statements which impacts the risk scoring. 

Tim Berglund: 
Excellent. So, those impedance matching transformers is what I call them from the asynchronous core to the synchronous outsides, they always exist. At some point, something becomes synchronous. Somebody is going to ask you a question and wait for the answer, or somebody on the outside is expecting to be asked and promising to tell you real fast when ask. And so, you've always got those interfaces and it's good to hear how people solve them. 

Daniel Jagielski: 
Yeah. Definitively, and then again, always trade-offs, right? So, we need to consider different things. 

Tim Berglund: 
So there are no solutions, there are only trade-offs. Onto the other project you mentioned before, this is called Apollo. And it was for an organization called ICE, which if you've just jumped in now and didn't hear us before is not immigration customs and enforcement in the United States, but is a German copyright clearing house akin to I think as cam is what it's called in the U.S. I said, CCLI before, which is a humorous domain-specific mistake. If you know you probably got a chuckle out of that when I said it. But yeah, it's an organization that helps people use copyrighted music in a way that has permission and pays the artist I think, if I understand correctly. Tell us about that project. 

Daniel Jagielski: 
Let me start with quoting our CTO here. "We are going to disrupt the online music processing, using events stream, and that's a poll. The main point of the project is to rebuild heavy ETL batch processes using modern approaches, like event-driven architecture in the first place. We are just at the beginning of our journey of stream processing adoption, which you have nicely covered in one of the previous episodes about streaming maturity model. But we have definitely appetite for much, much more. 

Daniel Jagielski: 
We have started with the successful POC, and now we are approaching productionizing it and connecting different systems together via events. After initial considerations, we picked data ingestion and data export as perfect candidates for the start. The end goal here is to break the monolithic system and rebuild other processes that are mission critical for this domain. As we started spreading the word around the organization, we reach the point where business gets really creative and is able to identify some new use cases and places which are unlocked by this new approach.For instance, existent batch process was using a single data source for data enrichment, which is the process we call matching. With new services and new data pipelines we are able to switch to layer matching using multiple different data sources that are piped to our Kafka two topics. This will reduce amount of manual matching thus reduce costs of whole process and improve the data quality as well. 

Daniel Jagielski: 
As we are consuming lots of data from external sources, it just reminded ourselves about the importance of schemas and their evolution, as they are guarding correctness and compatibility of your events. And that's crucial really when you deal with different data formats, the other reminder, or some finding from this project is that you need to handle poison pills with care. They effectively break your pipelines and stop events from being processed. So, you should really consider what to do when your events are malformed for some reason. 

Daniel Jagielski: 
Anyway, the opportunities that we gain out of stream processing in a poll are increasing ingestion throughput, by scaling services horizontally with proper data partitioning and going from batch level, to record level processing. It makes a huge difference as in some cases, let's say individual usage lines from our batches can be corrupted. So in such case, we do not need to necessarily reject the whole batch anymore.What we can do instead, is just put those records aside and still progress with remaining data and process that data in downstream processors. Later on, we can potentially request these individual records to be resent, but majority of records will get processed and can produce some results. 

Daniel Jagielski: 
There are much more use cases and pipelines to be revealed in event driven style. And it is really, really challenging project as the data volume that we deal with is extremely huge.It's real big data. Online music streaming is getting very popular these days, and usage of such services will only increase in future. We've got really interesting work to do here, which will provide a real value for all interested parties, including artists, of course." 

Tim Berglund: 
And this is a legacy migration type of project, correct? Where, there was an existing system. 

Daniel Jagielski: 
Yes. 

Tim Berglund: 
So, I'm going to assume that was some monolithic architecture. Are you also refactoring to services as you refactor to streaming? 

Daniel Jagielski: 
So, we are creating it from scratch. We are building new services, new stream processors, obviously re-implementing some of the logic that was there. However, fixing old bugs, adding new features and improving the general stability of the platform. 

Tim Berglund: 
Nice. Cool. And is the cut over from the old to the new, is that going to be a big bang thing, or is that a thing that you do service by service? 

Daniel Jagielski: 
No. No. One thing that I've learned over my career is that, you certainly would like to avoid [crosstalk 00:42:43] It's no different here. It's no different here. We would definitely to go smooth to integrate it smoothly with the existing platform, so that we first have a way to always to go back. And we just validate our assumptions also first on the side, and then gradually take over responsibility from the previous platform. 

Tim Berglund: 
Got you. Biggest lesson you've learned from, take either Tesco or ICE. What's the most exciting thing from those two projects that you would like developers and architects building event streaming systems on Kafka, you would them to know? 

Daniel Jagielski: 
I would say that when you are transitioning from this relational database oriented programming, where you use a relational database and have this pattern of accessing the data, you develop some habits over the years. If you map those habits directly to the event-driven architecture to how you use Kafka, for example, it's not going to work. This is what we have done in one of our projects, and I have discussed it for during my Kafka Summit talk, so feel free to check it. But in general, you need to embrace this new way of thinking—new way of developing applications and thinking about your system. So you kind of, like, get one pattern and map it to this new world. You need to learn, you need to experiment, and you need to wrap your head around this concept before designing the solution. What is more, you would also like to check the data distribution when you design your data pipelines, because choosing proper event keys and payloads is really important when it comes to attributing final success and great performance improvements. And the thing you would like to benefit the most out of is when you have this properly distributed platform, which doesn't have a data skew, and this is how you can achieve it when you check it first and plan accordingly to it. 

Tim Berglund: 
My guest today has been Daniel Jagielski. Daniel, thanks for being a part of Streaming Audio. 

Daniel Jagielski: 
Thank you. That was a pleasure. 

Tim Berglund: 
Hey, you know what you get for listening to the end? Some free Confluent Cloud, use the promo code 60PDCAST, that's 6-0-P-D-C-A-S-T, to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021, and use it within 90 days after activation. And any unused promo value on the expiration date will be forfeit, and there are limited number of codes available, so don't miss out. 

Tim Berglund: 
Anyway, as always, I hope this podcast was useful to you. If you want to discuss it or ask a question, you can always reach out to me on Twitter @tlberglund. That's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on YouTube video or reach out on community Slack or on the community forum. There are sign-up links for those things in the show notes if you'd like to sign up. 

Tim Berglund: 
And while you're at it, please subscribe to our YouTube channel and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple podcasts, be sure to leave us a review there, that helps other people discover it, especially if it's a five-star review. And we think that's a good thing. So, thanks for your support. And we'll see you next time.