Streaming Audio: Apache Kafka® & Real-Time Data
Streaming Audio features all things Apache Kafka®, Confluent, real-time data, and the cloud. We cover frequently asked questions, best practices, and use cases from the Kafka community—from Kafka connectors and distributed systems, to data mesh, data integration, modern data architectures, and data mesh built with Confluent and cloud Kafka as a service. Join our hosts as they stream through a series of interviews, stories, and use cases with guests from the data streaming industry. Apache®️, Apache Kafka, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Streaming Audio: Apache Kafka® & Real-Time Data
Data-Driven Digitalization with Apache Kafka in the Food Industry at BAADER
Coming out of university, Patrick Neff (Data Scientist, BAADER) was used to “perfect” examples of datasets. However, he soon realized that in the real world, data is often either unavailable or unstructured.
This compelled him to learn more about collecting data, analyzing it in a smart and automatic way, and exploring Apache Kafka® as a core ecosystem while at BAADER, a global provider of food processing machines. After Patrick began working with Apache Kafka in 2019, he developed several microservices with Kafka Streams and used Kafka Connect for various data analytics projects.
Focused on the food value chain, Patrick’s mission is to optimize processes specifically around transportation and processing. In consulting one customer, Patrick detected an area of improvement related to animal welfare, lost revenues, unnecessary costs, and carbon dioxide emissions. He also noticed that often machines are ready to send data into the cloud, but the correct presentation and/or analysis of the data is missing and thus the possibility of optimization. As a result:
- Data is difficult to understand because of missing units
- Data has not been analyzed so far
- Comparison of machine/process performance for the same machine but different customers is missing
In response to this problem, he helped develop the Transport Manager. Based on data analytics results, the Transport Manager presents information like a truck’s expected arrival time and its current poultry load. This leads to better planning, reduced transportation costs, and improved animal welfare. The Asset Manager is another solution that Patrick has been working on, and it presents IoT data in real time and in an understandable way to the customer. Both of these are data analytics projects that use machine learning.
Kafka topics store data, provide insight, and detect dependencies related to why trucks are stopping along the route, for example. Kafka is also a real-time platform, meaning that alerts can be sent directly when a certain event occurs using ksqlDB or Kafka Streams.
As a result of running Kafka on Confluent Cloud and creating a scalable data pipeline, the BAADER team is able to break data silos and produce live data from trucks via MQTT. They’ve even created an Android app for truck drivers, along with a desktop version that monitors the data inputted from a truck driver on the app in addition to other information, such as expected time of arrival and weather information—and the best part: All of it is done in real time.
EPISODE LINKS
- Learn more about BAADER’s data-in-motion use cases
- Read about how BAADER uses Confluent Cloud
- Watch the video version of this podcast
- Join the Confluent Community
- Learn more with Kafka tutorials, resources, and guides at Confluent Developer
- Live demo: Kafka streaming in 10 minutes on Confluent Cloud
- Use 60PDCAST to get an additional $60 of free Confluent Cloud usage (details)
Tim Berglund:
Now feeding billions of people at reasonable cost is a big problem. One might say an industrial scale problem. BAADER is a German company that makes machines that help with this process. Mostly poultry and fish. Patrick Neff is a data scientist there who developed a really cool pipeline to take all kinds of inputs, all kinds of measurements and facts about poultry being processed on this line to make predictions that help make decisions to take better care of the chickens on the input and provide presumably tastier chicken tenders on the output. Now, I don't normally provide content warnings on Streaming Audio, but I will say we talk about the industrial processing of animal products on today's episode. So if that sounds like a bad idea to you, you should skip. Otherwise, this is a great discussion of machine learning, Kafka, Confluent, and the cloud as per usual, listening. [00:00:54]
Tim Berglund:
Hello and welcome to another episode of Streaming Audio. I am, again, your host, Tim Berglund available on audio and on video streaming audio now available on YouTube as well. If you want to watch us talk in addition to just listening. I'm joined in the virtual studio today by Patrick Neff. Patrick is a data scientist at a company called Baader. Patrick, welcome to Streaming Audio.
Patrick Neff:
Hello Tim. It's great to be here.
Tim Berglund:
Great to have you. Now, BAADER is not a super well-known brand in the U.S. I happen to know, so tell us what BAADER does, and obviously I always want to know, tell us about what you do and how you got to be there.
Patrick Neff:
All right. Okay. So as he said, I'm Patrick and I'm a data scientist as well as software developer for BAADER. BAADER is a German company, which is located in Lübeck. That's a city close to Hamburg in the Northern part. And mainly it is a global provider for machine processing for fish and poultry. That's what we mainly do. We create those machines and I am working for the digitization team, which is responsible for innovative solutions. So we develop applications based on, for instance, Kafka.
Patrick Neff:
I came to BAADER in 2019 in March. At that time I was still studying applied statistics and I joined BAADER to do my master thesis. For my master's thesis I just tried to took the data we had so far, which was basically just some transport data into machine data and try to predict the meat quality of poultry with some machine learning methods with some success, but also some uncertainties. Back in these days, my colleagues came up with me and said "Yeah, we were using Kafka." Actually I just knew Kafka from, from my German lesson at school, because then we had to read a book from the original Franz Kafka. So this was actually how my journey started at BAADER. I stayed there after my thesis and also how my journey started with Kafka.
Tim Berglund:
All right. Very cool, very cool. So I'm glad that a thesis turned into a job after graduation. That's always nice.
Patrick Neff:
Thank you.
Tim Berglund:
So, in terms of American companies, Tyson is one that comes to mind that does large scale processing of poultry, at least. I'll say shout out to my youngest brother when he was three or four years old, he lived an entire, I think, year eating nothing but Tyson chicken patties, which is a whole interesting set of circumstances that could give rise to something like that. But good thing that we have this kind of food supply. You mentioned using machine learning to predict the quality of a poultry production line, the labeled data set there.
Patrick Neff:
Yeah.
Tim Berglund:
But the mind wanders into what that might look like. It strikes me that there could be unpleasant parts of the label data there.
Patrick Neff:
Yes.
Tim Berglund:
You have to know what bad chicken looks like, right?
Patrick Neff:
So during the process, we take some pictures actually of the chicken and detect some defects, which obviously have a huge impact on the quality rating finally. But when I did this analytics and had those data sets, you just have one big data set where you can say, "Okay, now we have some [inaudible 00:05:03] variables and some dependent one and do some stuff. You have. So, many different ones from ours and so many different timestamps within the processing line. First it's a huge challenge to connect all of these to know, okay, this information from, from machine a for instance, doing something with the poetry is also the same poetry or the same chicken, two machines later on. And for instance, 20 minutes later,
Tim Berglund:
Okay. Can we dig into that a little bit? Do you know how that works? How, how do you, how do you know the identity of the chicken? Cause I assume this is in various stages of processing. So at some point it's not going to be a coherent bird anymore. It's going to be bird parts, right?
Patrick Neff:
That's true. Of course we can dig into that. So what you need, especially in these food value chain network, you can say is you need to understand the processes. That's very important because otherwise you don't know how long does the bird take for which process and what does machine art actually do. So that's really important, but you also have some data points where you really have some measurement for one particular chicken, but also you have, for instance, the load information where you just have the information per batch and you just know okay, these 5,000 chickens are coming from farm eight and arrived at 2:00 PM. Let's see [crosstalk 00:06:46]
Tim Berglund:
Batches interesting too. There's that batch of chicken. Cause that there could be properties that are common among the birds in that batch.
Patrick Neff:
Exactly. Yeah. So it was huge part of my thesis to figure out if the farms or the duration of the, of the, of the ride from the track might have an impact on finally the quality or on some defect at queue.
Tim Berglund:
Okay. Yeah, because, so I mean, backing up a step, you're doing prediction and you're trying to predict, is it going to be tasty and can we put it into patties to serve to preschoolers or whatever the product is to contend these, is it going to be a quality product that we should continue to invest in by sending it down the line because we're going to want to sell it to customers and make them happy and make their tummies full? Or do we want to stop that? Right? Is there something that we can find early on in the line that says we should reject this now and not keep spending money on it? That's kind of the basic problem right? . [crosstalk 00:07:52]
Patrick Neff:
Yeah. But finding solutions for that problem would have a huge benefit. So it's not too easy to find, to say, okay, now we have some results from the upstreaming processes and can say, okay, the down streaming services need to re reset the machines or know that there might be a batch missing. Unfortunately it's not so easy.
Tim Berglund:
What are the, obviously we're we're but by the way, Patrick, this always happens when there's some really cool industrial process behind what we're talking about. I'm like, yeah. There's Kafka. That's cool. Tell me about chickens. [I 00:08:35] I will continue to ask questions about the process. Cause I know you had to become an expert in that to get this done. I won't be able to stop myself, but we are talking about basically real-time machine learning here or a model that is, that is evaluated in real time. So let me try to control in terms of my curiosity about the plant, what are the inputs? You said an image before you said there's a picture, is most of the data that comes into your process that you know, that you're training on and then, and then classifying on later. Is it pictures? Are there other kinds of inputs?
Patrick Neff:
No. So the pictures that actually was just a tiny part, mainly it is Numerical data. So either really filled out by hand and then produced at Kafka for instance, as I said, the load information is all offered by the truck driver. Then you have some machine data, like for instance, temperature, pressure, what the machines actually measures. And this is also produced mostly the numeric data, okay.
Tim Berglund:
There are sensors on machines, it's doing something to the chicken and it measures the pressure that it takes to do the thing. And it's too squishy or not squishy enough or whatever. And that's also an input. So I'm, I'm, I'm really going to have to put a content warning on this episode because there are probably people who do not want to think about industrial meat processing and I'm going to let them know. So if you're at this point, you've already made it through. I always record the introduction after the episode, everybody. So you can learn how the sausage is made. So if you've gotten this far, you've been warned and that it's okay. So yeah, that makes sense. Machines are doing things to chicken. There are sensors, there's maybe an image on a lot of the numerical data and this is a real-time problem. Can you give me a sense of how long it takes from stock coming in? This is a little birdies chickens, coming in from a truck to making it to the output of the process,
Patrick Neff:
and finally packed.
Tim Berglund:
Yeah. How long does that take?
Patrick Neff:
Up to seven hours?
Tim Berglund:
Okay.
Patrick Neff:
Because you first need to kill the chicken and then it needs to be processed and cut and also chilled and finally to be cut and packed.
Speaker 2:
Gotcha. Packed and, and presumably frozen after that.
Tim Berglund:
Yeah.
Speaker 2:
Yeah. Okay. So cool. Walk us through, I'm, I'm interested to know a little bit about this from a machine learning perspective. They're definitely going to be people in the listening audience who are interested in that more than they're interested in the factory stuff. Not everybody is like me. So tell us, tell us about the machine learning models that you had to create. And obviously there's a Kafka interaction here, so just kind of walk us through that. What did you build and what's it look like?
Patrick Neff:
So starting with really a basic data science project, where you just had some historical data, you transferred them into the software. You need to or you use to analyze and first figuring out or trying to answer some questions. And for this particular prediction of the quality, I just really used the data or had them in R & Python and did some gradient boosting and some other crew machine learning methods, which I wanted to use. But then here, I need to distinguish somehow between having a model, which is working in production on real time and is being retrained for instance, every two weeks and between some applications or microservices, we build based on our analytics results. So talking about the later first we discovered some answers of our questions. So one big issue is always the animal welfare.
Patrick Neff:
So you have to do the poetry, which is driving from the farm to the factory. And some chicken dies along the way because it is bad planned. There is some uncertainty because they do not know with how many birds the truck is loaded, even though that should be clear, it is not in reality, which is a huge problem. And also somehow the scheduling of the track arrive and So they say, okay, this truck, might drive at 2:00 PM and actually arrived at 3:00 PM. And all these problems we figured out, we took the data, tried to answer some questions, those questions, and develop some solutions.Then implemented those solutions in some microservices using Kafka streams and [inaudible 00:14:01]
Speaker 2:
Well, tell me more about that. So I want to hear what you do with [inaudible 00:14:08] DB. That's super cool. But specifically, when you say implemented those solutions, do you mean you had static set of data that was labeled you built models, you did gradient boosting you did engaging and interesting machine learning things and then created a model. And you're just running that in real time, like as a case equal UDS or talk me through what the services look like and what [inaudible 00:14:31] looked like there.
Patrick Neff:
So we are one step before having those machine learning to the actual models itself, running on production. So we figured out that the temperature might have an impact on the animal welfare. And so we said, okay, let's get those weather information in real time. And we implemented this as an UTF later on. We need to transfer it to as a CAFCA streams application because we are running in [inaudible 00:15:04] on confident cloud. Right now, unfortunately not developed. So for that, you can UDS.
Tim Berglund:
By the time of this recording.
Patrick Neff:
True. [crosstalk 00:15:13] So that's what we mainly did. We also saw those variances in the arrival times and the planned ones. And we say, okay, we need to somehow say to, to the factory manager, this truck, even though it says to arrive at 2:00 PM, we need to say when it's actually arriving and that it's coming late so that you can take action, the next step obviously would be to help them or help him taking action and, and how to improve the process itself. Because right now it is basically just presenting some data to him. But here I'm talking just about the first step.
Tim Berglund:
Got it, got it. That's interesting that there's the reported arrival time and there's the actual arrival time. I mean, it's like events that have timestamps, you know, they have an opinion about when they arrived and then when there's, when they actually arrived.
Patrick Neff:
True.
Tim Berglund:
So that makes sense. And also the animal welfare aspect is interesting there because I mean, there's an economic motivation. That's going to affect process yield, right? If the animals are stressed, the end product is not going to be good. They're going to die or whatever, you'll lose some percent, but also there's an ethical angle. You know, we're speaking again broadly as people who eat chicken, at least many of us, but you don't want the chickens to have a terrible life before you eat them.
Patrick Neff:
Exactly, yeah.
Tim Berglund:
That journey on the truck, there's more than one reason that you care about that. And so looking at things like weather, which you sort of have an objective view of and some outside objective report on when the truck arrives versus reported arrival time, these become inputs to your model.
Patrick Neff:
Exactly. Yes.
Tim Berglund:
Okay.
Patrick Neff:
Based on these first steps, we're going to develop some, some cooler advance stuff and also try to, to connect more of just one part of the food value chain and use it. We'd like to connect at least the entire one, but that's not possible so far, but then we would have more data points and more data at all from different sources and could use them to gain insights.
Tim Berglund:
So back to the Kafka side of things, you said it ended up being a Kafka streams application because you're running [inaudible 00:14:29] in the cloud, which is an extremely reasonable thing to do, but custom UDFs aren't there in the cloud yet, That's a super hard problem. And we don't talk about product backlogs here on streaming audio, or [crosstalk 00:18:04] say that features are going to show up in a product someday. That would be foolish of a guy like me to do #Safe Harbor. I'm definitely not doing that, but just thinking about that problem as a general problem, I need a sandbox to execute arbitrary code in as a part of my [inaudible 00:18:29] application. So, you confluent cloud, You go be Kafka, [crosstalk 00:18:33] you go [inaudible 00:18:35] except here's some code and it's fine. Run this on my messages. That's not impossible that gets done. It's just that's the problem of custom UDFs basically. And one would not be surprised if that was a thing that at some point fully managed [inaudible 00:18:51] did, but right. You had to dial it back.
Patrick Neff:
Yeah.
Patrick Neff:
You're kind of instructing people here, as you talk on how to do this with confluent cloud and with sort of Kafka universe technologies, you mentioned the perfectly respectable normal Python & R tools for building your model, then how did you operationalize that? Whatever you're able to talk about that, it became a Kafka streams application. So I'm guessing it was some JVM language. How does the model get built and run in real time?
Patrick Neff:
Mainly for the most of our microcell we use Kotlin.
Tim Berglund:
Oh, wonderful.
Patrick Neff:
Yeah. I like it too. We just actually use Kafka stream. So to continue the data, do, do our modeling or do our information. As I said in the first step, we basically just add information together, [inaudible 00:20:08] and also in a very tiny step. We are running for one product up to two, six microservices, playing together and collecting all the data we need and then just present it.
Tim Berglund:
Nice. All right. In that final product, the model is running on Kafka streams. Is there [inaudible 00:20:34] doing something still with the data? Is it still a part of the chain?
Patrick Neff:
So when we have a problem, we first always try to find a solution with [inaudible 00:20:47] because it's quicker, it's simpler and you just set up your Curry and let's go. So we do [inaudible 00:20:57] for instance, for some aggregations where we want to aggregate the history of some data, because we want to present them, how was some machine values in the last 50 minutes, two weeks, whatever, and exactly for that we are using [inaudible 00:21:12]
Tim Berglund:
It's nice. I like it. So what kind of results did you finally get with all this? I mean, it seems impactful. It seems like you've got a lot of data to play with, you know how to make models. Are there business outcomes that you can share that you were able to generate? How much better is my chicken tendee than, than it would been [crosstalk 00:21:41] on average?
Patrick Neff:
So what I can say is that we could see some differences between the farms, for instance, but also how our customer uses our machines differently. So in terms of overall equipment efficiency, we can say that, okay, this costumer uses or has an availability of 80%, but another one has just an availability of 65%. So what's going on between them or what does the first customer different compared to the second one? And then you come and talk to each other and try to figure out and to optimize the process. So there we'll business outcome, and just again, we are very at the beginning, but we receive a lot of requests and we quit saying, okay, we'd like to have this feature or could you do this and that? I thought we are on that, let's say step by step.
Tim Berglund:
I love it. What's next? Anytime you build a system like this, there's always something that you're mad that it doesn't do yet. what are you hoping to do next?
Speaker 4:
That's a good question. What we are right now trying to install is pipeline where you can do data analytics automatically. So we have machines from different customers, but also different machines as well, sending different data, having other tenders. And we try to find solutions where we can analyze those data from different machines automatically, because We do not want to present the data because the dashboard is always cool. But what you really want to do is taking some actions. For instance, a trash hold is limited or something like that. We want to, to install a pipeline, which staffers in an automatic way, for nearly every machine that is connected to our platform and then do not have to analyze every machine went there is a new one. There's actually a huge challenge because it's not so obvious. We are trying different ways out and see how we can use Kafka or for instance, [inaudible 00:24:24] as a process engine and [crosstalk 00:24:26] to play around with those.
Tim Berglund:
Nice. I have some great friends that come over. That's [crosstalk 00:24:29] pretty cool. And previous guests on this podcast. Because I'm speaking as somebody who's completely ignorant of industrial food processing, I don't know how this works, but my guess is they'd be like, well, this stuff is really good. This needs to go to consumers, or there's a high grade expensive, and this is still fit for human consumption, but it goes into this other kind of product. And this is still edible, but it goes to pet food or something like that. I imagine there's, there's different ways that you get value out of that, based on all these quality measurements and being able to be smarter about those. Through work like yours would be pretty cool.
Patrick Neff:
So it makes a lot of fun. It is really cool. So you challenges, obviously, because as I said, tracking this through the chain is hard, when you think about machine data, basically to work with this data, you need three people. You need one who engineered the machine itself, So knowing how the process actually look like. Someone who programmed the machine, because he can say, all right, the sender sending data of this particular part and a data scientist who find used data and this is more often a challenging part to have all those three together. Finally come up with something everybody can agree on and everybody knows what to do next.
Speaker 2:
My guest today has been Patrick Neff. Patrick, thanks for being a part of Streaming Audio.
Patrick Neff:
Thank you for being here.
Patrick Neff:
And there you have it. Hey, you know what you get for listening to the end? Some free Confluent Cloud. Use the promo code 60PDCAST—that's 60PDCAST—to get an additional $60 of free Confluent Cloud usage. Be sure to activate it by December 31st, 2021, and use it within 90 days after activation. Any unused promo value after the expiration date is forfeit and there are a limited number of codes available. So don't miss out. Anyway, as always, I hope this podcast was useful to you. If you want to discuss it or ask a question, you can always reach out to me on Twitter @tlberglund, that's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video or reach out on Community Slack or on the Community Forum. There are sign-up links for those things in the show notes. If you'd like to sign up and while you're at it, please subscribe to our YouTube channel and to this podcast, wherever fine podcasts are sold. And if you subscribe through Apple podcasts, be sure to leave us a review there that helps other people discover it, especially if it's a five-star review. And we think that's a good thing. So thanks for your support, and we'll see you next time.