Streaming Audio: Apache Kafka® & Real-Time Data
Streaming Audio features all things Apache Kafka®, Confluent, real-time data, and the cloud. We cover frequently asked questions, best practices, and use cases from the Kafka community—from Kafka connectors and distributed systems, to data mesh, data integration, modern data architectures, and data mesh built with Confluent and cloud Kafka as a service. Join our hosts as they stream through a series of interviews, stories, and use cases with guests from the data streaming industry. Apache®️, Apache Kafka, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Streaming Audio: Apache Kafka® & Real-Time Data
If Streaming Is the Answer, Why Are We Still Doing Batch?
Is real-time data streaming the future, or will batch processing always be with us? Interest in streaming data architecture is booming, but just as many teams are still happily batching away. Batch processing is still simpler to implement than stream processing, and successfully moving from batch to streaming requires a significant change to a team’s habits and processes, as well as a meaningful upfront investment. Some are even running dbt in micro batches to simulate an effect similar to streaming, without having to make the full transition. Will streaming ever fully take over?
In this episode, Kris talks to a panel of industry experts with decades of experience building and implementing data systems. They discuss the state of streaming adoption today, if streaming will ever fully replace batch, and whether it even could (or should). Is micro batching the natural stepping stone between batch and streaming? Will there ever be a unified understanding on how data should be processed over time? Is the lack of agreement on best practices for data streaming an insurmountable obstacle to widespread adoption? What exactly is holding teams back from fully adopting a streaming model?
Recorded live at Current 2022: The Next Generation of Kafka Summit, the panel includes Adi Polak (Vice President of Developer Experience, Treeverse), Amy Chen (Partner Engineering Manager, dbt Labs), Eric Sammer (CEO, Decodable), and Tyler Akidau (Principal Software Engineer, Snowflake).
EPISODE LINKS
- dbt Labs
- Decodable
- lakeFS
- Snowflake
- View sessions and slides from Current 2022
- Stream Processing vs. Batch Processing: What to Know
- From Batch to Real-Time: Tips for Streaming Data Pipelines with Apache Kafka ft. Danica Fine
- Watch the video version of this podcast
- Kris Jenkins’ Twitter
- Streaming Audio Playlist
- Join the Confluent Community
- Learn more with Kafka tutorials, resources, and guides at Confluent Developer
- Live demo: Intro to Event-Driven Microservices with Confluent
- Use PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)
Kris Jenkins: (00:00)
Hello. You're listening to Streaming Audio, and this episode is an unusual one because this one had a live audience. We recorded this live at Current, the conference that recently happened in Austin, Texas. The idea was, let's put on a panel discussion to close out the conference, and invite some experts in to answer the question, "Is streaming really the future? Or, will batch processing always be with us?" I'll keep this introduction short because you're about to hear Texas me set the scene properly. But, this was a great discussion. It has all the magic and one or two of the mishaps of recording live. I'm your host, Kris Jenkins. This is Streaming Audio. Let's get into it.
Kris Jenkins: (00:43)
Hello. Welcome to Close Out Current '22, The Great Debate, Streaming vs Batch. Thank you for joining us for this. This is an event streaming conference, we're all here for event streaming, but everybody I've spoken to is on a slightly different point of the adoption curve. You've got some people who are absolutely die hard, "Streaming is the only thing I want to build now," all the way through to people who are experimenting, some departments doing streaming, some doing batch, two people who are just doing batch, but are curious about what's happening. So we thought we'd get into the big debate. Is streaming the future? Is streaming a part of the future? Is it a fad? I didn't think so, but someone might, on our panel.
Kris Jenkins: (01:42)
We're going to bring on some excellent panelists to debate this. I hope you enjoy this. I'm certainly planning to. Will you please welcome to the stage, Adi Polak from lakeFS. Thank you. Eric Sammer from Decodable. We have Amy Chen from dbt Labs, and Tyler Akidau from Snowflake. Thank you very much. Okay. I thought I would kick this off with stretching back a little way into our history. I think something of a turning point for our industry was the announcement about the Lambda Architecture. To me, that was the first time it became, "Yes, you're going to have batch systems always, but a real-time live streaming is now going to be a defacto part of your setup." I think that both opened the door to streaming being normal and made it a permanent, "We're always going to have batch." So I'm going to turn to Eric, because he's always great for a controversial opinion.
Eric Sammer: (02:55)
That's what I get paid for.
Kris Jenkins: (02:57)
Kick me off. Do you think that the Lambda Architecture is here to stay forever? Are we always going to have a combination of batch and streaming?
Eric Sammer: (03:08)
I think there's batch and streaming as siblings or subsequent stages in a larger data pipeline. Then I think that there's specifically the Lambda Architecture of this combination, batch component, streaming component. I hope the latter goes away. Fundamentally, reasoning about consistency and the usability of that system, I don't think that's human accessible, because you basically have this batch layer and this streaming layer. You have to think about how those things converge. More than anything, I think Lambda goes away. I think Cappa makes sense. I think we stream into many materialized views.
Eric Sammer: (03:52)
I use that term loosely where the data warehouse, like a Snowflake, is a very big materialized view. It's a collection of data sets. But I also think that a lot of that same data winds up in Redis caches, and operational databases, and adventure microservices. My sense is that, and bias on the table, I'm a believer that most of that infrastructure goes real-time. Basic can serve both the low latency use cases and the high latency use cases off of a single system.
Eric Sammer: (04:26)
I think once you get into the data warehouse, there is actually a lot of value in having tools that match how analytics engineers think about things. Then maybe there, I think batch is fine. I don't really care. It depends, I mean I'm interested in Tyler's perspective on this. But, in many cases, there are high latency use cases that hang off of the data warehouse in the grand scheme of the SLA or SLO, around operational systems. Anyway, I'll stop there.
Kris Jenkins: (04:56)
Okay, I'm going to bounce that over to a Adi, because you are talking about data warehousing, which is close to what you do at lakeFS, right? Is he right? Are you the end of the pipeline, or are you the heart of it?
Adi Polak: (05:09)
So Eric is a very good friend, so Eric always right. Put it that way. And, sometimes he's wrong. Yes, Lambda, Kappa, Delta architectures, all of these are basically, at the end of the day, and very similarly to what you actually said, you need to do some enrichment of the data. Enrichment often comes from historical data that you're collecting. It might be that historical data has been shortened up a little bit. It could be the last day, the last 12 hours, the last hour. But even doing batching, doing any processing on the last hour is still batch processing.
Adi Polak: (05:50)
When we think about event-driven, we think about milliseconds. We have one event, we're processing it. When I think about streaming, then it's a micro batching. We need a group of events, and then we're processing them. That can be a minute, half a minute, two minutes, five minutes, 10 minutes. After we cross that chasm, we're getting into the realm of batch processing. Once we have our data that we need to process, oftentimes it depends on how many much data we're accumulating, and we're crossing the chasm into the batch processing, we actually still need some systems to do that computation for us.
Kris Jenkins: (06:27)
Okay. I'm going to throw it over to Amy. Do you think there's that whole real-time micro batch? I'm wondering, people trying to go from batch to streaming, is micro batch the natural thing? Is it a natural stepping stone, or is it just trying to do batch in another way?
Amy Chen: (06:52)
I think it is, especially given what we see, how dbt is being utilized on Snowflake, Bitcoin. Because a lot of folks are limited by the tools they have, what tools their company will actually pay for, a lot of times you want to avoid having to provision a new vendor. So what folks are doing, is just setting up micro batches just to stimulate a very similar streaming effect. What's pleasant about that is, that allows you to just bring over your existing batch for your framework, how your workflows are, and then just implement that. But, I also know it doesn't scale up. At the end of the day, it becomes a conversation of labor-intensive versus resource-intensive. We've noticed, as folks start to do more micro batching, as the data scales up, it just becomes too expensive.
Kris Jenkins: (07:46)
I'll throw this over to Tyler. Would you agree with that? Do you think that the thing holding people back from going to streaming is cost, or is it mental models, or is it habit?
Tyler Akidau: (07:59)
I think it's all the above, to be honest. When I think about streaming versus batch, I like to think of it in two different dimensions. One is semantics. For me, semantics, streaming answers everything. We should have a unified view of, how do you process data over time? Batch processing, as we know it, fits nicely into that. If we can come up with a clean understanding of how all that works, we should be able to fit everything within a single model. From an implementation perspective, there's totally this spectrum of fine grain, low-latency, single digit second, chunkier micro batch stuff up in the sub minute range. Then, like others are saying, way up in the even hourly or daily range, you want to process things.
Tyler Akidau: (08:40)
You can get economies of scale, you can implement certain operations differently when you're not constrained to dealing with such low-latency. If you don't have a system that allows you to kind of walk that spectrum, then you may find, "Oh, I can't use it because it's too expensive, I'm stuck. The only thing it's good at is low-latency." So folks will pedal backwards and head back towards the batch systems. We have streaming systems that are a little more agile and what they can support from a latency perspective. You can get into a world where they accept streaming, and just realize streaming may have variance in latency.
Kris Jenkins: (09:17)
Yeah. Okay. I'm going to go back to Eric, but I'm not going to do you all in the same order every time. I shouldn't have said I was moving on in the same order because I've lost my own thread, and that's a terrible thing to say. But Tyler, yes. I knew we'd have fun with this.
Eric Sammer: (09:44)
It's going to be okay, we got it. We can admire your jacket in the meantime.
Tyler Akidau: (09:47)
We could ask you a question.
Kris Jenkins: (09:48)
Yeah. Oh, okay.
Tyler Akidau: (09:49)
What do you think, Kris?
Kris Jenkins: (09:50)
What do I think? That reminds me what my question was going to be. Thank you.
Amy Chen: (09:55)
Oh, that was avoidance.
Kris Jenkins: (09:59)
That was avoidance. Is streaming a super set? Is it a semantic model where, if we get streaming right, as Tyler says, then you'll be able to do everything you can do with batch and more?
Eric Sammer: (10:12)
I think the answer is, theoretically yes. When you start to look in practice, there are definitely optimizations when you're processing in batch that you can make, that you can't in a streaming context. The classic case is, in streaming context, I'll speak about Flink because that's the system I know the best. But you have checkpointing systems and all these other things. In a batch query, you don't need that because the unit of rewind is effectively the-
Kris Jenkins: (10:45)
The universe.
Eric Sammer: (10:45)
It's the universe. You can go backwards and you can start from the beginnings, so the state can effectively be ephemeral all the way through. So you can make a bunch of optimizations when you know something about it. Now I think there's a bunch of things that we can do in the tools to bring those worlds together. I know there's a bunch of people working on this stuff.
Eric Sammer: (11:02)
... together. I know there's a bunch of people working on this stuff. We're tangentially part of that world. But, I do think that, even if it's not the same system, there are questions like, "Well, maybe it's the same tools. Maybe it's dbt and SQL across both batch and streaming. Maybe it is the same system. Maybe somebody unifies everything. Or, maybe it's like two wildly different systems." I think we're mostly in that third category today. It needs to get closer to one of the other two categories.
Kris Jenkins: (11:27)
So, you think we could evolve this classic trick we always do in computer science, have one unified thing to say, and then compile down to the different target.
Eric Sammer: (11:35)
Yeah, right. That's the abstraction. Sort of, fake it till you make it, kind of, argument. Yeah.
Amy Chen: (11:41)
Can I respond to that?
Kris Jenkins: (11:42)
Yeah, I was going to call on you, because you mentioned dbt Labs. Amy, what do you think?
Amy Chen: (11:46)
Oh my god, it's like I work there. I think, before we start diving into it, it's actually more of an agreement. I will admit that the majority of my time is spent on batch, but, even walking around, speaking to vendors, seeing the talks, there is a significant lack of agreement, it feels like, from what is the best practice. There's just too much noise, in some ways, from what I'm seeing from the streaming side of things. I think, in batch, we've kind of settled in. We have the modern data stack. We have agreed that dbt is the default transformation tool.
Kris Jenkins: (12:25)
Does anyone remember agreeing that?
Eric Sammer: (12:26)
Yeah. Yeah.
Kris Jenkins: (12:27)
We'll go with you Amy. We'll go with you.
Amy Chen: (12:28)
That's fine. I'm happy for spicy takes. Essentially, even if you aren't using dbt, we see a lot of competitors who essentially say the same things. You should be testing and documenting. This is how you should do this. This is what the workflow looks like. But, I haven't seen that level of unity on the streaming side of things, and, I think it will take someone to decide that this is it, and then align to it, and then, we can then unify a little bit more. Because, I don't agree that it has to be the same exact batch and streaming framework. It just has to be a framework. We can't have KSQL over there, and then PostgreSQL. It just needs to be someone.
Kris Jenkins: (13:11)
Yeah. Are we waiting for that moment when one of the different runners in the race takes the lead, and then we all just... Like, what happened with JSON, right?
Amy Chen: (13:20)
Yeah.
Kris Jenkins: (13:20)
Suddenly, everyone was doing JSON. That was the defacto.
Amy Chen: (13:22)
And, that's where dbt is, right? Essentially we specified that.
Kris Jenkins: (13:27)
I'll ban you from the stage for three minutes if you pitch too hard.
Amy Chen: (13:29)
I will. I know. Goodness.
Kris Jenkins: (13:32)
It's good that you believe.
Amy Chen: (13:34)
No, I validate that I am living in, sometimes, a silo. The 40K people silo. But, I think what's really fascinating is, you just have to have those folks who agree onto one. I don't care. I don't care about Kimball, modeling, or dimensional modeling. As long as you, at your company, agree on this is how you do it. So, it's less about a vendor thing, and more like, a time test. You've been burned enough times.
Kris Jenkins: (14:06)
Yeah. But, that's one of the hardest things in, literally, any profession, getting human beings to agree. Do you think, Tyler... I'm going to go with Tyler, because I'm trying to break the ordering pattern. Do you think that if you've got different departments in a company, and some of them want to go in that direction, some of them don't, which happens a lot, do you think that the way to do it is stop, wait, get the agreement? Or, do you just forge ahead, and then try and persuade people along the way? If we want to evolve to that agreement. Any tips on how to do it?
Tyler Akidau: (14:43)
No.
Kris Jenkins: (14:46)
Okay, thank you very much. Adi?
Tyler Akidau: (14:49)
I think, if I'm understanding your question right, I think this gets back to what I was saying earlier about having a common model that, sort of, describes everything in place. If you have folks that feel like they want to hang out in that batchy world, and then you have folks that want to forge ahead, if we're all speaking the same language, at the end of the day, it doesn't matter. I think, Amy, you mentioned, you spend your time in the batch world. I'm, sort of, surprised to hear you say that, to be honest. Because, to me dbt is the embodiment of streaming. Yes, it's oftentimes, sort of, a batchy implementation, but it's data over time. It's stream processing, and I think people need to come to accept that that's a valid form of stream processing. dbt is absolutely pushing forward a useful advancement in stream processing, and we need to accept this.
Kris Jenkins: (15:34)
Okay.
Amy Chen: (15:34)
I feel like [inaudible 00:15:37].
Tyler Akidau: (15:36)
I can do this. You can't say this. If you keep saying this, you're going to get kicked off. I won't.
Kris Jenkins: (15:42)
Well, okay, do you know what I am going to do? I'm going to give you a chance, though. I'll give you two sentences. Because, not everyone knows what dbt is, so, you get two sentences to explain it, as a developer, not as someone in marketing. Go.
Amy Chen: (15:51)
Oh. I'm not in marketing. What?
Kris Jenkins: (16:00)
You've used up your first sentence. I'm going to give you bonus sentence.
Amy Chen: (16:02)
Oh, I don't... So, dbt is an open source transformation framework that, essentially, we have taken software engineering's best practices, and given them to the analyst persona. That was one sentence.
Kris Jenkins: (16:15)
That was one sentence?
Amy Chen: (16:16)
That was one.
Kris Jenkins: (16:17)
Okay, I heard.
Amy Chen: (16:17)
Yeah.
Kris Jenkins: (16:19)
Okay, I'll give you, that was a semicolon in the middle, then. How about that?
Amy Chen: (16:24)
It's still a sentence.
Kris Jenkins: (16:26)
You've got another sentence then. Give it to the analyst persona.
Amy Chen: (16:29)
Okay, so. Oh, there's now a delay on the podcast. So, essentially, one of the reasons why we, to build off of Tyler's, why we don't really care if you're on streaming or batch is, at the end of the day, we're a SQL compiler. We're adding the framework to give you the best practices. I know I'm over. I'm trying so hard, but, what I'm trying to push is, the idea that we shouldn't care about the tooling. I love the idea of being vendor agnostic, because you should be able to take your code from one platform to another, and run it the same way, because that's the only way you can get out of this world of specialization. You can start to have more people help you out, and not rely on just your data engineers to be your bottlenecks.
Kris Jenkins: (17:21)
Yeah, I do agree with that. But, I think, there's a fundamental thing. There are two different mental models going on in streaming and batch, and whilst you can get away from the tools, and you have standardization, as long as you've got one team working on one mental model of how software should be done, and another working on another, we're going to end up doing integration projects, but we're never quite going to be on the same page. And, I'm going to throw this over to a D, because lakeFS, you are in the git model of what you do, which feels to me to be a very different thing to event streaming.
Adi Polak: (17:57)
Yes.
Kris Jenkins: (17:58)
What do you think? Do you think there is going to be a unified mental model?
Adi Polak: (18:02)
So, I think that's really interesting. What lakeFS does is, essentially... Semantics are good for data. They give the data experience for developers. And, if we think a little bit about how teams are organized within organizations, taking us a little bit to the process aspects of how we're combining people, technology, and processes, at the end of the day, if we're having functional teams who are building a specific product, it's a question of what do they need to get the product done? And, if that's streaming, that's great. If that's batch, that's great. It's what do they really need? And, that's a really interesting take on when we think about data gravity-
Kris Jenkins: (18:40)
Data gravity. Define that for us.
Adi Polak: (18:41)
So, data gravity, basically, what it means... It's a very old concept, but it's held itself true through time. It basically means, if I'm saving my data somewhere, more people, more teams are going to save their data in the same space. So, if I'm using Snowflake, for example. If one of the teams is very happy with Snowflake, because how teams work, and how people communicate, and collaborate, more people are going to save their data into Snowflake. That's data gravity at a very high level. What we're suggesting is actually a technology gravity. If I have a team that's very happy with a specific process, it puts the guardrails for the teams to have healthier data in their systems, and so, this is a technology gravity, and this is what lakeFS does. It enables, and enforces, and provides the guardrails for best practices for data.
Kris Jenkins: (19:31)
Right, okay. So, are you saying that you think people will be gradually drawn towards the technologies that work, and things will naturally gravitate to a unified solution for the company?
Adi Polak: (19:44)
That's a really good question, and that depends on how big the company is.
Kris Jenkins: (19:47)
Okay.
Adi Polak: (19:48)
Right.
Kris Jenkins: (19:48)
Yeah.
Adi Polak: (19:49)
So, in the data space, we talk about decentralization. We used to have all the data in one space with the data gravity, but, slowly, we're trying to break the model in, a little bit. We're start to move it a little bit. We're trying. Let's see how well it's going to work for us, but, we're trying. And then, the question is, will the tools that enable us still work with the relatively similar aspects within the different teams? Because, this is what's going to stick in data. The knowledge, the skills, everything that I already have within my team, is probably going to be the engine that is going to push us forward, rather than trying to incorporate new technologies completely. So, for example, if I don't have anyone who's doing streaming, it'll be very, very hard for me to integrate streaming into my tech stack.
Kris Jenkins: (20:35)
Yeah, okay. Yes, I can see that. So, Tyler, respond to that. That's me being a host.
Tyler Akidau: (20:46)
You give me these questions where I just want to say, "Yes." "No." Okay, I'll respond to that.
Kris Jenkins: (20:53)
Do you think systems, with their gravity and people... I'll put the question, for you, this way. Everybody's happy with batch in their system, and it's working for the company. Is there a good external reason to, somehow, gently push them into what some of us think might be the future? Should we be... I use this word advisedly. Should we, as a community, be evangelizing streaming? Oh, you are itching to answer.
Tyler Akidau: (21:30)
So, this is a hard question for me, because I am an idealistic pragmatist. The ideal, to me, says, "Absolutely, we should be pitching. What are you people doing?"
Kris Jenkins: (21:38)
Yeah.
Tyler Akidau: (21:39)
"You could be doing it so much better." But then, there's a part that's like, "Well, screw it. You're happy with what you're doing, and you don't have problems. Why would you bother changing?" But, I think we need to evolve our models to the point where there isn't a question. Where it's just obvious that this is better, I should be doing this, and, if you're doing it the old way, you're crazy, because it just works out better.
Kris Jenkins: (21:57)
But, you have just stepped into evangelization, haven't you? By saying, "If you're doing it the old way, you're crazy."
Tyler Akidau: (22:03)
I guess.
Kris Jenkins: (22:03)
By saying if you're doing it the old way, you're crazy.
Tyler Akidau: (22:03)
I guess?
Kris Jenkins: (22:06)
You want the technological world to catch up so far that it's obvious.
Tyler Akidau: (22:10)
I'm saying, if we actually progress the technological world far enough, there comes a certain point where you look at the new thing, you look at the old thing, why would I do the old thing? We are very comfortable here, but obviously this is a thousand times better. When it's a little bit better, it's really hard to move away, it's a tech technological gravity. When it's ten hundred thousand times better, it's not going to keep you stuck.
Kris Jenkins: (22:35)
Yeah. Amy, I could see you itching to pick up on...
Kris Jenkins: (22:40)
This reminds me of, because we can be very slow as an industry to evolve, right? Because that makes me think instantly of C and Rust. Why would you write in C and have memory SEG faults? Sorry if that's super controversial. But why would you start a brand new project in C these days when there's something that's basically C but it doesn't explode with security errors? And yet still we do.
Kris Jenkins: (23:04)
So Adi, I felt like you wanted to jump in there on the social developer's angle. Does anyone want to jump in on that?
Eric Sammer: (23:12)
I really do.
Kris Jenkins: (23:12)
Eric, give me your opinion.
Eric Sammer: (23:18)
I really do not.
Kris Jenkins: (23:19)
[inaudible 00:23:19].
Eric Sammer: (23:19)
Part of me thinks that maybe we're sort of falling into the trap of over thinking this a little bit. If we reframe this entire discussion around microservices, there's nobody who would argue that we should take all of our codes, shove it into one process and demand that a multi-thousand person organization code on that one artifact or binary.
Eric Sammer: (23:41)
We've collectively decided that sometimes for some reasons there are parts of the system that need to be low level hyper optimized and that team needs to work in Rust. There's another team that needs to knock stuff out really fast and go... where Python is fine. And we've decided about lingua franca interfaces between these things and semantics to Amy's point about there needing to be some unifying concepts or stack, or however you want to put it.
Eric Sammer: (24:09)
And I think that data is fundamentally the same. This is, I think, about the disaggregation or the decentralization of the data thing. Whereas, I think people are going to put tons of data into Snowflake or Data Bricks or in many cases, large organizations both, let's be honest. Because companies acquire other companies and stuff like that. And they will just pick up different tech stacks and they'll always want to consolidate to save money and stuff like that. But it's fine that that happens.
Eric Sammer: (24:37)
What matters is the contracts and APIs and integration between these systems. And if some of them are batch and some of them are streaming, maybe that's fine, but they need to share a conceptual model, Schema, Time. All of these unifying concepts that are not necessarily streaming or batch specific.
Eric Sammer: (24:59)
So Tyler, I'm an idealist. Sure, it should all be in streaming up to the data warehouse and then it could be in batch and that's fine because that's outside of my domain, honestly. But I do think that realistically, there were just teams that just want different things and different languages and different tool sets. SQL versus Python is a whole other debate and I'm not going to touch that one with the 10 foot pole.
Eric Sammer: (25:25)
One of those groups is right, the other one's wrong. You can decide which one.
Kris Jenkins: (25:31)
Remember, you're going to be able to ask questions, you can force his hand.
Eric Sammer: (25:35)
Anyway, I'm just saying that I think it's about normalization of concepts and then strict API definitions and those kinds of things. And tooling that works across data quality should not be different for streaming and batch. That should work. We're not there. We're not there. And to Amy's point, I think streaming lags batch because it's just been weird for far too long. And I think it's getting there, I think it's getting there.
Kris Jenkins: (26:00)
Okay. So I think I agree with that. It's kind of settled. We're always going to be in a system where not everyone in a large organization will agree and that's fine. We can get there if we can collaborate with each other rather than coordinate.
Kris Jenkins: (26:15)
So that raises the question, I'm not sure who I'm going to ask this of. Are there sweet spots for streaming where those should be the first things we tackle? Are there obvious places where streaming data...
Kris Jenkins: (26:26)
You're nodding very heavily so you get to answer this one?
Amy Chen: (26:29)
Goodness.
Kris Jenkins: (26:29)
It's obvious places where we're going to do that and everyone else can stay batch and that's fine, but these are things we need to do.
Amy Chen: (26:37)
Yeah, I see it in terms of what is that use case? Because we start seeing folks who are just running batch processes within every minute. Essentially, you're trying to create streaming just packing it together.
Kris Jenkins: (26:50)
Yeah, you up the Chrome job a little bit faster every time.
Amy Chen: (26:52)
Exactly, you're just like, "Okay, how long can I keep this warehouse up? I'll just pay it." Because we see folks doing near Streaming with just serving client information directly to their patients. We see folks tackling that with, in terms of running live auctions. The reason why they decided to go micro batch because it's easier, they can do it on the same infrastructure as their batch processes. So they're just going to run it.
Amy Chen: (27:22)
But in those cases, lower latency is needed and you shouldn't hack your way around. Once you're trying to get into SLAs of two minutes and stuff, I just don't.... I get the technology's not there, but it should be there.
Kris Jenkins: (27:38)
Yeah, yeah. It's like if you're running your Chrome job every two minutes, it's time to admit what you're really trying to do.
Amy Chen: (27:43)
Exactly.
Kris Jenkins: (27:44)
Yeah. How about you Tyler? Do you think there are... I would like someone to give me some specific use cases and business cases, maybe user experience cases, where this is an obvious win for streaming, this is where we should start.
Tyler Akidau: (28:01)
Well, I think a lot of the... I'm going to give you a non-answer or I'm going to answer the things that are not in there.
Tyler Akidau: (28:09)
It's shocking how many people you speak to who say, "I want my data within 10 seconds." And you're like, "Really? Really you do? You need it within 10 seconds?" And you talk through it and you talk through the costs and the complexity and suddenly there like, "You know what? An hour's fine, a day's fine."
Tyler Akidau: (28:25)
There's not actually that many use cases that need sub-minute latency. It's nice to have, I see...
Amy Chen: (28:36)
I'm just a reactive person.
Amy Chen: (28:37)
In my talk earlier I mentioned, if streaming's easier, where could we go? And one use case came from a project I worked on, which was essentially we had a promo code go out, we made a dashboard that wasn't real time. Dashboards as a reminder, are for non-technical stakeholders to make realtime decisions. Generally, I would say you don't need a realtime dashboard but in this scenario, if streaming was accessible, the fact that this direct to consumer company didn't have realtime dashboard, actually hurt them in terms of customer satisfaction and trust.
Amy Chen: (29:12)
Because a promo code made it to Honey and it was around a thousand dollars during a holiday weekend when majority of the technical folks were off. So you can imagine what happened when the promo code was on for three hours and how many orders they had to cancel from customers who had their eyeballs on the company. Their first experience was literally a canceled order.
Amy Chen: (29:36)
Granted, I don't know who would think a $0 order is correct, but it's the space of where we could go. Where it becomes, we can actually center around streaming in the conversation of, what is the business loss if we didn't have streaming? Right now we are thinking so high level because we have to, because the technology's not there. But you have to balance it.
Kris Jenkins: (30:01)
Eventually we're going to start focusing on follow the money. Where is the most business value?
Amy Chen: (30:07)
What's the business impact? We shouldn't talk about, should you have a real time dashboard? We should talk about, what is the business impact to non-technical stakeholders getting access to this data? Because it's not us looking at it, it's literally the person who can't write SQL.
Kris Jenkins: (30:23)
Yeah, I agree with that to a degree because in the end, we are there to serve the final outcomes of people who aren't like us. And yet, the simpler we make our own lives, the more we can enable that, right?
Adi Polak: (30:38)
I want to suggest another thing.
Kris Jenkins: (30:39)
Okay.
Adi Polak: (30:39)
Another view. And we can take Uber, I never worked for Uber, but we can imagine what's happening when we push Santa to get a car in. Basically behind the scenes, there's something that's going on in the system that say, "Hey, that person wanted a car in so and so place." So that's an event when you think about it. And that event after that occurred, reaching that data, finding out where are the nearby cars that can get to that area, what's the time, what's going on on the streets, what the maps, if there's any issues in that space. If there's a big event going on and everything is blocked, there's high traffic, all of these good stuff.
Adi Polak: (31:18)
So we can imagine how here actually, there's a combination of event driven architecture, event got into my system, there's some streaming going on what's cars available in the area, how many orders, maybe I need to combine a couple of orders together if I'm doing a ride share. And then I'm enriching my data with some more batch data because we also want to give the predicted price. And how often the price changes, it's not a matter of minutes, it can be a matter of half an hour, an hour so it's a more batch processing aspect.
Adi Polak: (31:48)
So here we can imagine how we have a system actually combines all the three aspects of event driven. I want to have an event, I want to process it, I want to give a result, I have streaming, I have some micro batching behind the scenes because I want to know the status of what's going on around in the area. And then I have a batch system because I want to get the price, I want to get prediction, I want to get something maybe even from my machine learning platform.
Adi Polak: (32:14)
So oftentimes, and that really connects to Amy what you said is, what's the business requirements? What is my data product requirements down the line that connects immediately to the business KPIs to what it is that we're building. So if we think about it from that angle, it takes the conversation, it shifts the conversation from what is right for the business, whether what is right that I think as an engineer, it's really fun to do.
Eric Sammer: (32:42)
Okay, can I jump on?
Kris Jenkins: (32:44)
You can absolutely jump in on that, Eric.
Eric Sammer: (32:47)
I'm actually really surprised to hear that. I actually think that this piece of data being processed in real time, this one being a micro batch and this one being a batch, sounds like actually the most complex version of what this is. And so a lot of these streaming processing systems, I mean the way we would handle-
Eric Sammer: (33:03)
It is. And so a lot of these stream processing systems, I mean the way we would handle that at Decodable or with Apache Flink or something like that is that we would basically change the data capture out of the "batch" database and maintain that state as a materialized view and sort of through the joint. I look at this and I just kind of go, I don't think that there's any batch in that. I think that that's a realtime application. I think that part of the challenge is that people don't have a mental model around what the capabilities of a realtime application or realtime stream processing system are. And to your point about, I was sort of surprised by Tyler's answer, I'm going to call you out. I was surprised by Tyler's answers because when people say 10 seconds, do you really need it? I don't think you need it on a dashboard.
Eric Sammer: (33:43)
But I do think that increasingly that the trend that is happening here is, and I want to call this out, I do think there's probably a difference between the analytical data plane people and the operational data plane people. What you are talking about with the user on an app is very much to me the operational side of the system. You could argue that the dashboard back in Uber's office doesn't need to reflect reality up to that level. But my sense is that the use cases are increasingly obvious. I checked in for my flight to come to this conference and then 20 minutes later I got a tech notification that I wasn't checked in and I wasn't going to be allowed on my flight. And I was like, well, am I on my flight or am I not on my flight? And the answer is, there's some weird batch process that had decided that I had not checked in. There's a later arriving data problem, it is an out of ordinance problem there, it's all these kinds of things.
Eric Sammer: (34:36)
And you have that and my confidence in whether or not I'm actually going to be on that plane goes down significantly and sharply and quickly. And then I think that there's all these use cases around anything location based, like Grubhub doesn't exist in batch. Lyft and Uber don't exist in batch. Dynamic pricing systems during Black Friday do not exist in batch and maybe micro batch and that's cool. But I think that the trend that we're ignoring here is that most of the people in the analytical data plane believe that the consumers are human. And most of the people in the operational data plane believe that the consumer's a machine. So, if the consumer's a machine, why on earth would you wait to provide it data when it naturally speaks and operates in real time?
Kris Jenkins: (35:30)
But I'm going to turn that around and say, why on earth would you make the user wait, the people wait for data when the machine has infinite patience?
Eric Sammer: (35:39)
Fair point, let's make it all streaming.
Kris Jenkins: (35:41)
Okay.
Eric Sammer: (35:41)
Great idea. Fair point, let's make it all streaming.
Tyler Akidau: (35:47)
Eric's got a very good point here. I think the distinction really is how quickly do you need the reaction? Forget if it's a machine or a human, if it's interactive, if this needs to happen transactionally, that's the case where this real time streaming makes sense. Anything that's not that, you're probably not needing that kind of latency.
Eric Sammer: (36:05)
One interesting thing just to think about is at some point Bloomberg terminals ticked faster and faster and faster. And then inevitably somebody went, "Why don't we make this a software application?" And high frequency trading was born. Anybody who thought I don't need real time didn't do that, and all of a sudden somebody was making a whole lot more money than everybody else and then they all went, holy whatevers, this is a real time problem. And I think that logistics companies and Amazon flying drones of pet food through San Francisco, or whatever craziness is going on, that kind of stuff causes customer expectations in both a B2B sense and a B2C sense to change. I wouldn't accept anything less, three to seven day shipping is not okay. I want one hour shipping. I want to pick it up now.
Kris Jenkins: (37:00)
Yeah. I think this is a tough one that makes me think of a question that I think almost everyone in some shape wants to know. It's a tough one and I'm going to give it to you, Amy-
Amy Chen: (37:10)
Oh God.
Kris Jenkins: (37:10)
... I'm going to let you bounce it to anyone else if you don't want it. So, all this stuff, I try to think, something like Uber, taking the example of Uber, you can now see your cab turning the corner on the block. But if we go back in time to when you ordered a cab and most of the time it showed up 20 minutes later. You checked into your flight and most of the time you got on. Is there any way from that? We don't need real time batches, fine. Is there any way to predict the places in an organization where it looks fine right now, but actually it would be game changing if we thought in a different way?
Amy Chen: (37:49)
I want to steal a term that when I was working with Marta for Materialized about my own slides, she said something that I was like, oh shit, yeah. It's the concept of a gateway use case. It's essentially that one thing you're like, okay, if I do this, then oh my God, I can supply a new product to my customers. We see tons of folks who start setting up their stack and then they realize, oh, this is useful to my customers. Why don't I sell this? That becomes an entire other product model. That's why Uber exists, right? That's why Grubhub exists because it used to be you had to call up and do this, but people could make more money from it. It's really where the money is. I don't think there's, I can't tell you what your company is going to make money off of or what's the next big idea. Obviously if I did, I would love that, that would be nice. But it's really just seeing what you have already and what you could go to. What can you make profit off of?
Kris Jenkins: (38:54)
Very [inaudible 00:38:56]. But this is Current, the next generation, right?
Amy Chen: (39:00)
That's what TikTok tells me.
Kris Jenkins: (39:02)
Okay. Yeah. If I can sum it up I think you are saying if you follow the money within your organization that's one of your core points, you'll start to find the use cases. I want to allow plenty of time for questions, so I am going to just ask you two sentences each maximum, especially you Amy. Current '23, we're going to be back here in a year, where do you think the industry will be and the community will be? Do you think there'll be any significant progress? Which direction? Tyler, you get to go first. Two sentences.
Tyler Akidau: (39:37)
Just [inaudible 00:39:39].
Kris Jenkins: (39:39)
Does anyone want to go first? Adi, well done. Go for it.
Adi Polak: (39:43)
I feel courageous today. I believe, and that speaks to what Amy said is different companies adopt different best practices or what they're calling best practices. And because of the community and because of all the work that we do, because of companies that came up recently to help us make better decisions with our data, then we're starting to see kind of unified data ecosystem and best practices that are emerging for streaming world that we already know and love and use in the best ecosystem.
Kris Jenkins: (40:21)
Can I say we don't know what the trends next year will be, but you're expecting there to be trends starting to emerge the patterns?
Adi Polak: (40:27)
That's my vision.
Kris Jenkins: (40:28)
Okay, cool. Does anyone else want to field it or we should throw it open to questions?
Tyler Akidau: (40:33)
More sequel, more ease of use.
Kris Jenkins: (40:36)
More sequel, more ease of use. That will be one of the current '23 printable T-shirts. Eric? Amy?
Tyler Akidau: (40:45)
Plus one to what Tyler said.
Kris Jenkins: (40:47)
Okay. You can have the last word before questions if you like.
Amy Chen: (40:52)
I'm going to be a pessimist here and just say we're going to be back here. This is more than two words, I'm so sorry. I think we're going to continue to debate batch versus streaming. Everyone has their own favorite tools because that's what they're comfortable with, and most folks don't want to be uncomfortable, and most folks don't have the company that will let them take the risk. So, I think we're going to be back here having this conversation hopefully with best practice and just inching along.
Kris Jenkins: (41:24)
Well, I'm happy to end on that note. I don't see that as a pessimistic thing because the conversation will go on. The exact details of that conversation will change, but we will keep talking next year. Ladies and gentlemen, thank you for joining me on the panel. That's the end. We're about to break for questions, but will you please put your hands together for Adi Polak, Eric Sammer, Amy Chan, Tyler Akidau.
Kris Jenkins: (41:48)
And there we leave it. I've got to tip my hat to all four of them. It can be hard enough to stand on stage and give a talk to a 100 or so people when you've spent weeks preparing it, weeks rehearsing it. But we asked them to step up and be clear and interesting and lucid with no idea of what questions they were going to be asked, only a title to pair by. And I think they were excellent, I hope you agree. I think the only way you can be that excellent is with a lot of bravery and a heck of a lot of expertise. So, I tip my hat to all four of you. Rest assured that work is already underway to get all four of them on the podcast individually because I want to hear what they've been working on recently and I'm sure you do too.
Kris Jenkins: (42:34)
But in the meantime I shall remind you that Streaming Audio is brought to you by Confluent Developer, which is our site that teaches you everything you need to know about Apache Kafka and the real time streaming world at large. You've got blog posts there, you've got tutorials, courses, back episodes of this podcast and more. So, check it out at developer.confluent.io. And while you're there, you probably want a Kafka cluster to take those courses alongside. Easiest way to spin one of those up is to head to Confluent Cloud, sign up, you'll have a cluster running in minutes. And if you add the code PODCAST100 to your account you'll get $100 of extra free credit to run with. And with that, it remains for me to thank our guests, Adi Polak, Eric Sammer, Amy Chen and Tyler Akidau for joining us and you for listening. I've been your host, Kris Jenkins, and I will catch you next time.