Real-Time Data Transformation and Analytics with dbt Labs Artwork

Confluent Developer ft. Tim Berglund, Adi Polak & Viktor Gamov

Hi, we’re Tim Berglund, Adi Polak, and Viktor Gamov and we’re excited to bring you the Confluent Developer podcast (formerly “Streaming Audio.”) Our hand-crafted weekly episodes feature in-depth interviews with our community of software developers (actual human beings - not AI) talking about some of the most interesting challenges they’ve faced in their careers. We aim to explore the conditions that gave rise to each person’s technical hurdles, as well as how their experiences transformed their understanding and approach to building systems.

Whether you’re a seasoned open source data streaming engineer, or just someone who’s interested in learning more about Apache Kafka®, Apache Flink® and real-time data, we hope you’ll appreciate the stories, the discussion, and our effort to bring you a high-quality show worth your time.

All Episodes

Confluent Developer ft. Tim Berglund, Adi Polak & Viktor Gamov

Real-Time Data Transformation and Analytics with dbt Labs

February 22, 2023 • Confluent, founded by the original creators of Apache Kafka® • Season 1 • Episode 259

0:00 | 43:41

dbt is known as being part of the Modern Data Stack for ELT processes. Being in the MDS, dbt Labs believes in having the best of breed for every part of the stack. Oftentimes folks are using an EL tool like Fivetran to pull data from the database into the warehouse, then using dbt to manage the transformations in the warehouse. Analysts can then build dashboards on top of that data, or execute tests.

It’s possible for an analyst to adapt this process for use with a microservice application using Apache Kafka® and the same method to pull batch data out of each and every database; however, in this episode, Amy Chen (Partner Engineering Manager, dbt Labs) tells Kris about a better way forward for analysts willing to adopt the streaming mindset: Reusable pipelines using dbt models that immediately pull events into the warehouse and materialize as materialized views by default.

dbt Labs is the company that makes and maintains dbt. dbt Core is the open-source data transformation framework that allows data teams to operate with software engineering’s best practices. dbt Cloud is the fastest and most reliable way to deploy dbt.

Inside the world of event streaming, there is a push to expand data access beyond the programmers writing the code, and towards everyone involved in the business. Over at dbt Labs they’re attempting something of the reverse— to get data analysts to adopt the best practices of software engineers, and more recently, of streaming programmers. They’re improving the process of building data pipelines while empowering businesses to bring more contributors into the analytics process, with an easy to deploy, easy to maintain platform. It offers version control to analysts who traditionally don’t have access to git, along with the ability to easily automate testing, all in the same place.

In this episode, Kris and Amy explore:

How to revolutionize testing for analysts with two of dbt’s core functionalities
What streaming in a batch-based analytics world should look like
What can be done to improve workflows
How to democratize access to data for everyone in the business

EPISODE LINKS

SEASON 2
Hosted by Tim Berglund, Adi Polak and Viktor Gamov
Produced and Edited by Noelle Gallagher, Peter Furia and Nurie Mohamed
Music by Coastal Kites
Artwork by Phil Vo

🎧 Subscribe to Confluent Developer wherever you listen to podcasts.
▶️ Subscribe on YouTube, and hit the 🔔 to catch new episodes.
👍 If you enjoyed this, please leave us a rating.
🎧 Confluent also has a podcast for tech leaders: "Life Is But A Stream" hosted by our friend, Joseph Morais.

Kris Jenkins (00:00):

In this week's Streaming Audio, we're speaking to Amy Chen of dbt Labs, and I had great fun recording this one. Their perspective on the world we're all trying to get to is really interesting. Much like us over here at Confluent, they're deeply immersed in this idea of building data pipelines and thinking about all the different fronts on which you can try and improve that process. But we're each coming at it from very different beginnings.

Kris Jenkins (00:28):

You could say that in the world of Apache Kafka, we started with programmers writing streaming code, and we are gradually moving into this space of making that more accessible for everyone. Whereas dbt seemed to have started with making batch SQL jobs more accessible, and they're gradually moving into the world of streaming programmers. So same targets, but if we get Amy and we can get a fresh perspective on how to get to those targets, and we ended up chatting about all sorts of things, covering things from effective testing strategies and what a modern data tool set should look like to what we can do to improve people's workflow and access to data for everyone in the business.

Kris Jenkins (01:14):

So as ever, this podcast is brought to you by Confluent Developer and I'll tell you more about that at the end. But for now, I'm your host, Kris Jenkins. This is Streaming Audio. Let's get into it.

Kris Jenkins (01:31):

Joining us today is Amy Chen. How you doing Amy?

Amy Chen (01:35):

I'm good. So excited to be here today.

Kris Jenkins (01:38):

It's good to have you. You are one of the people on my hit list that did...

Amy Chen (01:45):

Oh my goodness.

Kris Jenkins (01:46):

No...

Amy Chen (01:46):

That sounds violent.

Kris Jenkins (01:50):

It really does. Let put in the... You're another one of my select guests from the panel we did at Current back in October '22.

Amy Chen (01:58):

Yeah...

Kris Jenkins (01:58):

We were there...

Amy Chen (01:59):

It was so fun.

Kris Jenkins (02:01):

It was fun, wasn't it? With a few cool people talking about batch versus streaming. And I thought I've got to get you on to the show so we can talk about what's really going through your head.

Amy Chen (02:10):

True. As long as I don't get too vendory, right?

Kris Jenkins (02:13):

Yes, yes. You got, if anyone's not heard that episode, you got into deep trouble for getting too sales pitchy. I almost had to throw you off the stage.

Amy Chen (02:22):

I know. I'm so happy that in this situation you really can't physically push me off the stage or take the mic from me, which is, I'm not going to lie, don't want to be ever in that state. As I mentioned earlier, it's like a futile state where you're just like, "Oh God, this is amazing and also overwhelming and stressful to talk on stage where we haven't prepared anything." And it's just the comfort zone is like, "I know my tool."

Kris Jenkins (02:48):

Yeah, yeah. But is there a risk that if you put someone in front of a hundred people, put a spotlight on them and shove a microphone in their hand, they reveal who they truly are.

Amy Chen (02:57):

Oh...

Kris Jenkins (02:59):

You are a vendor shill Amy Chen.

Amy Chen (03:01):

Oh no. I'm like, don't curse. Don't curse on this. Especially because my marketing team has come to me multiple times saying, "Amy you need to stop cursing on things." I hope that, I don't think this is a drunken state where this is the real me. The reason why I am very mindful about what I post on Data Twitter is just because I know sometimes my real feelings just really shouldn't be revealed on what actually is useful.

Kris Jenkins (03:33):

Okay. It's going to get a bit too psychologically deep in a moment. So...

Amy Chen (03:39):

It does.

Kris Jenkins (03:39):

Let me talk about something else you have a habit of posting on Data Twitter, the acronym MDS.

Amy Chen (03:47):

Yeah.

Kris Jenkins (03:48):

Pick that apart for me. What's MDS, why do I care, what's it about?

Amy Chen (03:51):

So technically it stands for Modern Data Stack and so...

Kris Jenkins (03:57):

Linux...

Amy Chen (03:57):

Lamp?

Kris Jenkins (03:57):

Patchy MySQL PHP. That's the modern data stack, right?

Amy Chen (04:02):

Of course. Yeah. And Informatica and whatnot, not...

Kris Jenkins (04:05):

That's fair.

Amy Chen (04:06):

Oh, oh goodness. It's still, overall, as I've said multiple times, it's very much a marketing scheme. Everyone wants to be part of the modern data stack because that's like what's trendy now. You see a lot of graphs of, "No, I'm part of the modern data stack." "No, I am." And then when you get to the point where you just have diagrams that are on top of diagrams of claiming this is the modern data stack, I think it loses a lot of the oomph. But the core value, the leading principles that makes the concept of being called the modern data stack tool, is the idea that, "Oh, you're easy to deploy, you're easy to maintain, and you're accessible." Which, what tool doesn't want to call themselves that? That is how you make the money.

Kris Jenkins (04:59):

Yeah. But I mean, you do get tools where their clear priorities, their top values are speed or low cost...

Amy Chen (05:09):

Performance.

Kris Jenkins (05:09):

Of ownership.

Amy Chen (05:09):

Or costs.

Kris Jenkins (05:10):

Things like that. So yeah, I can see you would want some term as a rallying cry for ease of deployment, ease of use, that kind of thing.

Amy Chen (05:18):

Yeah. Rallying cry, wow, can't speak.

Kris Jenkins (05:25):

Rallying cry?

Amy Chen (05:26):

Rallying cry. I don't know why. English is difficult, but the way I think about it is, it's exciting because it's like, "Oh, okay." This really pushes tools to think about, what kind of personas do you want using it? Do you want people who've had to train on it for 10 years and become the experts of the field? Which, I'm not knocking any of those people. I have always aspired to be one of those people who are in the weeds, they know everything versus something like, "Hey, I'm a data team of one. How do I pick up all the things and create a pipeline with a swipe of credit card?" That is very attractive. And also, in terms of diversity, is very exciting because more people can be part of this conversation.

Kris Jenkins (06:13):

Yeah, yeah. It's nice to have a value that absolutely puts the user in the core of the sentence, right?

Amy Chen (06:22):

Exactly.

Kris Jenkins (06:23):

You take a value like ROI and the problem with that is the person who cares about it probably isn't the person that's actually using the software. Whereas...

Amy Chen (06:33):

Oh, 100%.

Kris Jenkins (06:35):

But usability, ease of deployment, it's probably putting the user right in the center of the conversation.

Amy Chen (06:41):

I think what I post a lot about is with the modern data stack, it's really, a lot of people discount the developer time. They see the time to production, but they don't really, a lot of tools in the past, and also business stakeholders, have just traditionally not cared about the quality of life that these users have. Because even if they're band-aiding something or they're waking up at 2:00 AM because some number is wrong, that's not usually captured very easily in a metric. But the quality of life of, if that person stays at the company, is not as easily capturable. Like the fact that they're band-aiding something is not a pleasant experience and shouldn't be what we continue to support.

Kris Jenkins (07:28):

Yeah. Yeah. I've seen plenty of people in my career move jobs because they want to use a more pleasant tool set for whatever their definition of pleasant is.

Amy Chen (07:38):

This is probably my first ding of vendors, but that's something I've noticed, especially when I was consulting, was people legitimately also choose the tool stack itself for what company that they work at. And at this point, there are true folks that are advocates that just will not work for a company where they're still in the Legacy systems.

Kris Jenkins (08:03):

Yeah. I've seen someone make Jira their dividing line for a job, which is...

Amy Chen (08:07):

Oh gosh, I don't miss using Jira for data. I remember being tagged on so many Jira tickets and it was rightfully painful.

Kris Jenkins (08:17):

Yeah. I think I'm agnostic on bug trackers 'cause I've worked in the past with a company where their bug tracker was Excel and anything that's not Excel is like great.

Amy Chen (08:30):

Excel has its moments, but I'm not going to lie. I've written project plans on Excel and been on calls where I'm like, "Did you do this thing? Did you do this thing?" And it was my number.

Kris Jenkins (08:43):

Check this out. Yeah.

Amy Chen (08:45):

Because people, there's no tracking. Sorry.

Kris Jenkins (08:48):

So let's carry on into the idea of, so, usability of tools that takes us naturally to your next acronym, which is the tool you spend most of your time with, I gather. dbt.

Amy Chen (09:01):

Yeah. It's almost like I work for the company, but yes, dbt used to stand for a data build tool, but our marketing department has nixed that. So it's just called dbt. It...

Kris Jenkins (09:14):

There should be a word for that. For something that has ceased to be an acronym. I think there is a word for that.

Amy Chen (09:20):

Is there?

Kris Jenkins (09:21):

Yeah, I think there is.

Amy Chen (09:22):

I'm sure a brand marketer knows.

Kris Jenkins (09:24):

There's backronym for the reverse when you start off with the word and get to the acronym.

Amy Chen (09:28):

To the acronym.

Kris Jenkins (09:29):

Yeah. Anyway, so you're now just dbt, but what is a dbt?

Amy Chen (09:34):

Yeah...

Kris Jenkins (09:35):

I've never actually used it. Give me the crash course.

Amy Chen (09:37):

Yeah, so the five-second spiel is, it's a data transformation framework that essentially allows data analysts to operate with software engineering's best practices. That essentially means what we're doing is, you're still doing the transformation on top of your data platform, but we're providing that framework to actually make it more reusable and scalable. Very fancy words, but at the end of the day, it just means like, "Hey, we're now giving people who traditionally have not had access to git, version control, being able to easily automate testing and document things." We're just giving it all in the same place so that they no longer have to just be forced to not have any, or be able to tape together a bunch of different open source tools that way. And we're also open source.

Kris Jenkins (10:32):

Okay. So who's a typical user for that then? Is it the person that's currently sitting there, importing things into their Excel spreadsheet?

Amy Chen (10:42):

The persona that we are going for is what we call the analytics engineer. And I would like to very keenly note that you don't have to use dbt to be a analytics engineer. The goal here is that you have someone who's able to own more of the pipeline. They can handle enough of the extract and load maybe using off the shelf tooling like say Fivetran or Airbyte, handle the transformations and do enough of the BI side of things to make it ready for stakeholders. Just start jumping in on say, self-service BI.

Kris Jenkins (11:19):

Okay. So somebody who, I can think of an example in our organization, someone who is trying to gather all the YouTube watch metrics for the different videos, Confluent publishes, grab them out of Google's API, stick them somewhere and turn it into a report. Is that the kind of thing you did?

Amy Chen (11:40):

Yes, exactly. That's the workflow that we're after is that person is going to be responsible for making sure that that YouTube data gets into the data warehouse itself and then can handle the very usual transformations and denormalization and then get that into the downstream tool of choice.

Kris Jenkins (11:58):

And they're not non-technical, but they're not developers either. And in that mid-ground.

Amy Chen (12:04):

Well, we do call them developers because I think it is, developer is a little bit of a, what is it, spicy term in some ways. I know, right? Because there's a lot of, "Oh, SQL is not a real language." There's a lot of that, and we don't want to live in the world that people are very like gate keeping, right? We want to acknowledge that people with business who have the business logic should be able to own the business logic and actually write the transformations without having to go poke at say their devs to, "Hey, can I get access this stream? Can you write this transformation for me?"

Kris Jenkins (12:45):

Right. Yeah. I think I could be accused of that one a bit. I'm of the level where I don't care which editor you use, but I expect you to have a strong opinion about the one you do use.

Amy Chen (12:58):

That's true. That's true. I mean, but I don't care. I don't care about your interface, I just care about, are you able to do this with the right set of best practices? I don't even care about what is your SQL format. I'm sure I'm ready for the dbt community to come after me for this, but it's very much, do you real... And I had this conversation with one of my partner engineers this morning is, what is dbt Tonic? A lot of people know that there is, we have a best practice in our SQL guide, but at the end of the day, I just want to make sure that you're thinking about your code and making sure that you're very intentional about how you're handling your modeling. Is it aligned across the board in the entire project? Are you actually testing and making sure it's of quality and that anyone jumping into your project actually knows what the [inaudible] they're doing? It doesn't take a month to...

Kris Jenkins (13:57):

You swore.

Amy Chen (13:58):

Onboard to your... Oh no.

Kris Jenkins (13:58):

Yeah.

Amy Chen (13:58):

First ding.

Kris Jenkins (14:01):

That's strike... You get three strikes.

Amy Chen (14:04):

I get what?

Kris Jenkins (14:05):

Strike you from the record, right? Sorry I interrupted you for the sentences.

Amy Chen (14:08):

No, it's fine. This is in the recording now. I apologize to everyone who was offended by that, but I'll do my best. I will write down no curse words. But, like...

Kris Jenkins (14:19):

So what we're doing is taking to the developers this, some of the things that we take for granted as programming developers like version control, testability, shareability.

Amy Chen (14:29):

Exactly.

Kris Jenkins (14:30):

How we actually do that in practice?

Amy Chen (14:34):

So, essentially, someone will use either dbt cloud, which is our SaaS offering, or they can do it on the command line. They're connecting to their data platform of choice and connecting to their git repo, I don't care what your provider is. And then they're starting to write code. They're going to check out a branch just like a dev would and work in their own development sandbox and then develop. And when they're ready to merge the code, they're going to open a pull request and ask for approval, run CICD jobs and then merge that in. So very much like what developers have traditionally been used to, but in terms of analytics, this is pretty advanced from where we're coming from, where there's been a lot of gooey based or using just pure airflow for transformations, which have not just traditionally been accessible to an analyst persona.

Kris Jenkins (15:28):

Right. But how do you get people, I can see that persona saying, "What's a pull request? What's Git? What do I care about versioning? I only care about the modern one. Can I just save it to a file called dot back?"

Amy Chen (15:39):

Oh gosh, not going to lie...

Kris Jenkins (15:41):

How do you get into that world?

Amy Chen (15:44):

That's the fun part. So there's different ways. I would also like to say Git is, I think Git enough to be dangerous is actually not that difficult, but we also do provide a lot of handholding if that concerns you. Obviously, with our SaaS offering, we have that guided Git flow. You just click a button and then you can open a pull request. But we also have a lot of documentation where you need to know enough about Git to be powerful. So we've written a bunch of blog posts on this is how to write a good pull request. Early on when I joined the company, I didn't know Git either. I basically complained that, "How do I know what's good for a pull request? How do I make sure I write the right amount of documentation for my approver?" So what we did was just open up a pull request template.

Amy Chen (16:35):

We designed that and then now it's part of a lot of our documentation and a lot of people take that same one or alter it and put it in their own project. So, it's kind of guidance with the right amount of handholding. I also will note, we actually, I think the majority of our community do still identify as data engineers. So they're not going into this blind. They have been so sweet to also train a lot of folks in and answering people about Git questions. Our community is the warm of fuzzy that I love. Because it's a little intimidating to enter in this new mindset with all these different tools you have to know about, like what is the command line, what is Git? And outside of what we can facilitate, people are one-on-one helping people and answering questions on like, "Hey, this is the best way to arrange a project. This is the best way to manage your git branches even."

Kris Jenkins (17:38):

Yeah, and I find the almost trickier thing is, how do you convince people they care what their branches are? It's not just the how, it's the, what matters.

Amy Chen (17:51):

Yeah, I know. I think you would, the word tech debt probably strikes fear in you, but obviously, in some folks it's a little bit new. I kind of try to balance it a lot out when I'm thinking about it, because in some ways, if you're a data team of one, there is literally no one else who knows anything about that. Obviously, you're going to work a little bit differently where you might have to merge into production because you're literally the only person checking, but at least you're setting down the foundation. But obviously as you get bigger, I don't think folks really need that convincing because they've been burned so many times by different tools.

Kris Jenkins (18:34):

Yeah, yeah. You start to believe in backups after you've lost your first data set, right?

Amy Chen (18:41):

Yep.

Kris Jenkins (18:41):

Isn't that...

Amy Chen (18:42):

You basically, when you accidentally drop a table and you realize, "Oh god, I now need to recreate this." You kind of know the burn. You also, I don't know about you, but when I started with auditing, it was very much copy a subset of the data into a Google Sheets, then compare it to what I believe the data should look like. And writing some terrible, I fully admit I am not the best at VLOOKUP and I definitely wrote some terrible VLOOKUPs to make sure that things were aligned. And after you've been burned from that, I think you start to see a little bit of like, "Oh, I could have a better life just by putting in these better practices and kind of start building up the muscle and training yourself to do it in this process."

Kris Jenkins (19:33):

Okay. So yet again, in programming, the motivator is pain. And then we go looking for good solutions.

Amy Chen (19:40):

We are trauma bonded in how terrible some things have been in our lives. After you've had to...

Kris Jenkins (19:50):

That's a nice thing...

Amy Chen (19:50):

Sorry, go. No...

Kris Jenkins (19:50):

A nice thing...

Amy Chen (19:53):

Rock paper, scissors.

Kris Jenkins (19:54):

Yeah.

Amy Chen (19:55):

Go. You go first.

Kris Jenkins (19:55):

You go, you're the guest, you're the guest. You go.

Amy Chen (19:57):

Oh goodness, you're too kind. The key thing I was going to say is, I think after you've had to debug a really heavily nested sub-query of multiple 800 lines, that's when I feel like this is when we generally jump in really well and it's like, "Look, you don't have to do that anymore."

Kris Jenkins (20:23):

Yeah, we can lead you to a better land. Now you know what the dry arid desert looks like.

Amy Chen (20:31):

There is an oasis over there. Just follow the not yellow brick road. The orange brick road.

Kris Jenkins (20:38):

When you say oasis, and that gives me a lovely segue into streaming.

Amy Chen (20:42):

Yes.

Kris Jenkins (20:43):

So...

Amy Chen (20:44):

Water.

Kris Jenkins (20:44):

You're mostly, or your history at dbt, right? It's mostly batch or batch load.

Amy Chen (20:49):

It's a batch.

Kris Jenkins (20:50):

Load a big chunk of data from somewhere, SQL the heck out of it, and then send it somewhere. But, how does that fit into our streaming world?

Amy Chen (21:01):

Yeah, I think historically we have been also limited in what we run. At the end of the day, we are a SQL parser and compiler. Our limitations are what the platforms we sit upon's limitations. We can't execute any SQL if you can't do that. But I think times are changing, obviously with SQL first platforms like Materialize, but also just traditionally batch based ones like Snowflake and Databricks, they're starting to see that light, which also gives us, we get to hop on and continue on and develop where that space can go towards.

Kris Jenkins (21:45):

So what does that mean in practice, is that you are just... If one of the things you connect to supports streaming and SQL, you can compile to it and you are just magically safe from having to think about any major differences, is that what you're saying?

Amy Chen (22:03):

No, I think there is still going to be a mindset shift because what does it mean to actually have a DAG that has both batch and streaming pipelines on it? Obviously I know we can also debate on a streaming only system, but I like to look at the complex world of, at the end of the day, you're not going to very quickly migrate everything over from batch to streaming, right? Especially if we end up in the ideal state where everything is easy, but I think it really does change up how we think about testing and even running. The idea that you can run something once, create the object, and it's ready to go, constantly, is very powerful. And someone on Data Twitter recently posted predictions in 2023, orchestration becomes a key component. And I find that really funny because I've lived in the world where orchestration has always been very, very prevalent in the problem of managing a pipeline.

Amy Chen (23:05):

And I want to live in the world where I don't have to care. I want to just make sure everything is up-to-date and is ready to go. So I think it's very compelling to have all your data transformations in one place and then execute and you're good to go. And then on top of that, being able to test automatically in a very modularly, from what I've learned from speaking to other folks who've been in the streaming space, tests are a little bit brittle. They're not easily shareable, but I'm very excited about what Materialize has written about, which is, unit testing using dbt tests, which is just using the same out of the box configurations, but being able to constantly check to make sure, "Hey, something's going on." You can just send an alert on this tape, this materialized view. But that doesn't actually require a lot of mindset shift. It just is a reminder like, "Hey, you need to know how to do this," and then you're still good to go. It's the same concepts.

Kris Jenkins (24:04):

What does that actually involve?

Amy Chen (24:06):

Yeah, so, to break it down, they're using two of our core functionalities obviously to being able to create the materialized view that has the transformation in it as well as materialize the test results. And then he uses a configuration called store failures. That, in a batch based world, is when you execute a test, it will create a table inside of the data platform that has all the failed rows. They have changed it a little bit for their adapter and essentially made it so that it materializes as a materialized view. So it's always up-to-date and then always running. So that test is actually just running concurrently with the materialized view itself. And then because they have alerting on top of the materialized view, you're able to then make sure that, "Okay, anytime there's a new row added in, there's a difference, there's something going wrong with your data assumptions."

Kris Jenkins (25:07):

Okay, so you've got a real time stream on the succeeding or failing tests.

Amy Chen (25:12):

Exactly. With only one additional configuration.

Kris Jenkins (25:17):

Yes. That's interesting. I'd not really, I think we've only just beginning to think about testing in the streaming world, and I'd not thought of doing that. But it's also, I think we might need to rethink the testing model because there's something inherent about the way we do unit testing that says you've got this bunch of data, you run this code over it, and you get another bunch of data out. And it's inherently batched the way we do unit testing, small batch.

Amy Chen (25:43):

Because you're essentially comparing, this is the assumption, but that's not as powerful because, especially with streaming, your data's always changing. There's always new ones to test. And I think this is why the simplicity of this approach is so exciting to stare at because you can now do data quality tests in a way that is actually scalable without constantly, "Hey, take a piece of this."

Kris Jenkins (26:12):

Yeah, I could see that. So do you think, you've mentioned the whole coexistence of batch and streaming, right? And that makes me think about the Lambda architecture, which we've talked about before, which kind of assumed that batch and streaming would always be together. And is that the future you think we're heading towards where they are naturally always part of the solution?

Amy Chen (26:43):

I think the long, I don't know what the middle long term is, right? I think where we're stated into, we have to land right there, and that is kind of more of the make it or break it moment. I don't have better phrasing for that, but streaming needs to become more accessible and almost just off the shelf as batches. And I think to do that, we need to essentially share our... There's a lot of great principles that come from streaming that batch have not been so great at because it's been a little bit too easy, almost. And I think we have performance optimization and cost optimization in streaming because it's already so cost sensitive. I think we can definitely take principles from there and how we handle incremental models, how we think about what to update, and that's really powerful and we can definitely learn from that.

Amy Chen (27:46):

I'm not opposed to the world, which I think is a little bit far down given the technologies that we have today, to having a streaming one. Because as we've all iterated, if you can have everything up to date and it's fresh and it's easy, why wouldn't you? And I want to live in that world. I also want to care less about, like if I was an analyst or a data engineer, my performance reviews aren't going to be how well I orchestrate something, right? It's not going to be like, "Oh, I managed to do this pipeline." But if my streams are always up-to-date and it's actually tested by my business stakeholders, well, why would you not be there?

Kris Jenkins (28:28):

Yeah, yeah. Ultimately the people you're serving care more about the data being fast and high quality. But it's always this thing where you can't dismiss the way it's done under the hood. The users can, it's our job to worry about that for them, right?

Amy Chen (28:44):

Exactly. And that's kind of where we kind of go from circle to the modern data stack. Streaming should be 100% part of the modern data stack. The pain points that we've just had is, it's still so early to call, and I fully admit there are ones in the diagrams and I agree, but I want to push that. I want to push to see can we have more, can we have it better and cheaper?

Kris Jenkins (29:10):

Right. So if I'm not putting words in your mouth here, you can see us getting to an all streaming world or a predominantly streaming world. And your bigger question is, what's stopping us from getting there?

Amy Chen (29:23):

Yeah. I think it'll be, obviously, there's still more best practice to we've developed. There's going to be a duel it out to see what tool comes on top. And my hope is the tools that come up on top are going to be the ones that really think about their user experience and not, my straw man is Bob, who, at my old company, owned everything. When he retired, we hired him as a consultant. I don't want tools to subscribe to that persona and making sure, "Oh, you just have to be really smart to do this thing." You have to be smart, but you also have to be efficient with your time.

Kris Jenkins (30:05):

Yes. So I totally agree with that and I suspect the most user-friendly tools will win 'cause they tend to.

Amy Chen (30:12):

Yeah. Hopefully.

Kris Jenkins (30:15):

But, hopefully, yeah. But there's always this tension, and I'm going to avoid mentioning specific technologies here, but there's always this tension where you get some open source projects that launch and they are clearly more user-friendly and they are clearly broken and they win the race by being more user-friendly and then try and patch themselves back up as they bought some runway. And that always feels icky. Can we not just get it user-friendly and fundamentally correct? Do we have any hope of being that sensible?

Amy Chen (30:47):

I am avoiding also just calling anyone out, but I agree. Like...

Kris Jenkins (30:52):

Wonder if we're thinking of the same people.

Amy Chen (30:54):

We probably are and that's okay, but we just won't say it on this podcast. In Twitter DMs. But yes, that's kind of why I'm extremely hesitant in some ways because in... It's all about the right dominos falling because at the end of the day, we live in a capitalist society. There are people who are trying to expand and try to get more of the surface area and they have the money to back it, and that can be a little bit intense. But I also do believe you shouldn't discount that there is still a better quality of life if we all go towards the same principles, even if someone's winning. At the end of the day, if you become the standard, people are going to be adopting your principles as well to say, "Hey, we have that too." So it's not necessarily the worst having just a few choices. It's good to have competitors, but at the end of the day, as long as we all agree, this is what our users want, I think we'll be fine. Maybe I'm reading a fairytale here.

Kris Jenkins (32:06):

Oh yeah. I think the moment you said getting our users to agree, I think you lost me.

Amy Chen (32:12):

Not so much getting... I think it's more, users know how they want to work, but this is not to say we're only subscribing to one user pool. There are potential users that will have fundamentally different feedback, and those are important to take into account, and I think that's why I often cheer for the open source tools a lot specifically because they have that very quick user feedback loop and that trust.

Kris Jenkins (32:43):

Yeah. And they know that they have a direct relationship with the people who are going to use their tools, usually. And that usually helps the feedback loop.

Amy Chen (32:51):

Exactly. And if they hate it, they'll open a PR.

Kris Jenkins (32:54):

Yeah. And sometimes expect you to work for free to solve their particular problem. But let's not get into that either. But okay, so coming from another perspective, as you work in a predominantly batch world, I work in a predominantly streaming world. Give me your critical eye from that modern data stack value perspective. What are we in the Kafka world doing wrong or could do better or should be doing?

Amy Chen (33:23):

Yeah, I think modularity is one of my pain points is at least from what I've seen, obviously I have a limited view. I have not worked on many streaming use cases in my consulting days, but I find a lot of streaming pipelines extremely brittle because it's like, "Oh, you shaped this for just this particular use case," but it's hard to pull code from that to reuse it for the bigger picture. I think it's the custom code aspect is where I want to push folks in the streaming world to start thinking about...

Kris Jenkins (34:07):

Okay, I think I could level that criticism at a lot of SQL queries in the batch world. So what's the difference? What am I missing?

Amy Chen (34:18):

I think at the end of the day it's, we're aware of it and we are, then again, I'm not, that might be a mistake, a ding against me where I'm like, "You folks are probably aware of this problem as well as your pipelines also break." I think in some ways we have the tooling to assist with that and make it a lot easier. Obviously, dbt being one of them, but jumping in with other analytics, engineering tooling, there's just more growth in terms of how to be mindful and how to take care of that tech debt.

Kris Jenkins (34:56):

Well, tell me how it works, because doesn't dbt's approach to SQL have something of modularity built into it?

Amy Chen (35:03):

We do. So, because we believe a dbt model, which is also referenced as a data model in this case, is just a single select statement. And then because we can take care of that lineage for you, specifically from just using a ref function, we are able to then build that out. And then you write in one place if you need to change it, you change it in one place, but you can reference it down the line. When you don't accidentally lose sight of what has already been built, especially when it's very easily visualized, it makes it a lot easier to understand when you're going down the deck, when you're looking for technical debt and performance to know, "Hey, I can fix that one key point, rather than having to look at that 800 lines of SQL statement."

Kris Jenkins (35:53):

Right. So it's looking at the... You can break it into sub queries and sub steps and see the whole pipeline.

Amy Chen (36:00):

Yeah. Yeah, exactly. And then if that code's already been written, you can reuse it for another pipeline, and I think that's one of my pain points of folks who only work in Notebooks. They only see a subset of the pipelines. I can see the entire global pipeline, essentially. All the pipelines together. Yeah, global.

Kris Jenkins (36:22):

That's something we have been working on more and more over here. It's interesting that we're both arriving at the same kinds of problems from what feel like very different approaches to data.

Amy Chen (36:33):

I actually don't think it's surprising because the joke here is data is always messy and it's a lot, right? At the end of the day, it's just as much of data is hard, but also it's a workflow issue and having the right tooling, but also keeping in mind what you actually care about. This is why I say tools aren't part of your solution. They're not going to take care of everything, they're part of the solution. Because at the end of the day, if you care about that, if you care about the fact that your pipelines are desperate and you can't tell what that team over there is doing, then that's actually, you can solve it with a tool, but there's also like you can solve it with communication as well.

Kris Jenkins (37:19):

Yeah. That sort of relates into this idea that we talk a lot about of data mesh where it's partly worrying about your data quality, it's partly a social change of mindset as well as the tooling. Yeah. Yeah. So last chance to pitch dbt. Is there anything we can learn about data testing that we should be picking up from you?

Amy Chen (37:44):

Yeah. In terms of data testing, I kind of break it down into two things. You should be able to test at every single step of your process while you're developing, while you're in a CICD process, especially for analytics and even in production all the time. But what actually makes that accessible is knowing how to write a test in one time and then being able to apply it very easily. That's why our tests are just as modular and also written in SQL, which means that...

Kris Jenkins (38:13):

Oh, a test written in SQL.

Amy Chen (38:14):

Yeah.

Kris Jenkins (38:16):

Very cool.

Amy Chen (38:19):

It is, I know. The SQL train continues, but it's the idea that your analysts know what can go wrong with the data. They know when there is a bad number, so they're able to actually write the right test to test those assumptions to make sure that that is always up-to-date.

Kris Jenkins (38:39):

Oh, okay. So you're sort of doing property based testing where you say where value is less than zero and that's selecting an error.

Amy Chen (38:49):

Exactly. We have where statements, we also check to make sure that there's no nulls, your primary keys are aligned, you're not having the dreaded fan out, but we also have folks who are using tests to do other things to test before you build so you're not snowballing the failure and also to apply constraints. So making sure that like, "Hey, this should meet this particular primary key constraint," or something like that, and we have packages there if you don't want to write it yourself either.

Kris Jenkins (39:25):

Okay. Let's check this out. Does make me think that there's some kind of better relationship for testing. Unit testing feels very batchy, SQL, and maybe property testing feels somehow naturally more akin to streaming. That's something to look into another day.

Amy Chen (39:43):

I would say both are very relevant. Obviously there's assumptions you can't easily write, and so I want to live in a world where both are just as accessible and you can just pick things off the shelf to use as you like. I think that's why my concern also with just being very vendory is I want us to get away from this idea of, tools dominate the best practices. It should really be the best practices and the tools follow after.

Kris Jenkins (40:14):

Yeah. The trouble with that though, is we're...

Amy Chen (40:17):

It doesn't really work.

Kris Jenkins (40:18):

We're in a place of exploration where we've got to figure out the best practices while we're building the tools at the same time.

Amy Chen (40:24):

Exactly. Yeah. It's not how it works out, but I think as long as people keep in mind that that is how it should work, it's still meaningful.

Kris Jenkins (40:34):

Yeah. Yeah. Okay. I think I'm going to make this the last question I have for you.

Amy Chen (40:40):

Go for it.

Kris Jenkins (40:41):

Again, looking to the future, what should we be watching? What's on your radar? What are you watching for in the industry?

Amy Chen (40:48):

Yeah, I'm definitely seeing where Python fits into our story because, obviously, we just launched it at Coalesce. So I'm interested in seeing how the world decides to push what you can do in dbt with Python. Obviously, the semantic layer and metrics layer really is top of mind for me, but at the end of the day, I'm just excited to see where dbt can go in streaming.

Kris Jenkins (41:17):

Okay. Well, we'll have to keep watching the skies for that.

Amy Chen (41:21):

Yes. We'll look over there and then as we do that, we'll debate about Marvel.

Kris Jenkins (41:27):

Yeah. Yeah. And when we finally get there, we'll say that we could see the inevitable tracks in the snow when we were here.

Amy Chen (41:33):

Yes.

Kris Jenkins (41:33):

Right.

Amy Chen (41:33):

Yeah, exactly.

Kris Jenkins (41:34):

We'll feel prescient by the time we finally figure out what's happening.

Amy Chen (41:38):

Hopefully it's in our lifetime.

Kris Jenkins (41:41):

Yeah, hope so. Hopefully we get some solutions in this industry in our lifetime.

Amy Chen (41:48):

Oh goodness.

Kris Jenkins (41:48):

Amy, pleasure to talk to you. We'll catch you again. Cheers.

Amy Chen (41:49):

Yeah. Cheers. It was lovely.

Kris Jenkins (41:52):

Well, that was fun and semi hopeful note to end on. And now I think Amy and I will return to our side conversation debating Marvel Comics. If you're interested, the topic currently under discussion is, could Squirrel Girl kill Deadpool? There's probably another podcast out there for that kind of conversation. Maybe we should go and find it. In the meantime, if you want to hear more from Amy, they were part of a panel discussion we recorded in Austin, Texas not too long ago. We'll put a link to that episode in the show notes, or you can just scroll back a few podcasts to the episode called If Streaming is The Answer, Why Are We Still Doing Batch?

Kris Jenkins (42:30):

If you want to find all the back episodes at this podcast and vastly more about the world of event streaming at Apache Kafka, then check out our developer site developer.confluent.io for free courses, blog posts, and guides on how to build a modern streaming pipeline effectively. And if you've built your effective modern streaming pipeline and you need a place to run it, head to Confluent Cloud where you can spin a professionally managed Kafka cluster in minutes. And if you add the code PODCAST100 to your account, you'll get $100 of extra free credit to run with. And with that, it remains for me to thank Amy Chen for joining us and you for listening. I've been your host, Kris Jenkins, and I will catch you next time.