Streaming Audio: Apache Kafka® & Real-Time Data
Streaming Audio features all things Apache Kafka®, Confluent, real-time data, and the cloud. We cover frequently asked questions, best practices, and use cases from the Kafka community—from Kafka connectors and distributed systems, to data mesh, data integration, modern data architectures, and data mesh built with Confluent and cloud Kafka as a service. Join our hosts as they stream through a series of interviews, stories, and use cases with guests from the data streaming industry. Apache®️, Apache Kafka, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
Streaming Audio: Apache Kafka® & Real-Time Data
Security for Real-Time Data Stream Processing with Confluent Cloud
Streaming real-time data at scale and processing it efficiently is critical to cybersecurity organizations like SecurityScorecard. Jared Smith, Senior Director of Threat Intelligence, and Brandon Brown, Senior Staff Software Engineer, Data Platform at SecurityScorecard, discuss their journey from using RabbitMQ to open-source Apache Kafka® for stream processing. As well as why turning to fully-managed Kafka on Confluent Cloud is the right choice for building real-time data pipelines at scale.
SecurityScorecard mines data from dozens of digital sources to discover security risks and flaws with the potential to expose their client’ data. This includes scanning and ingesting data from a large number of ports to identify suspicious IP addresses, exposed servers, out-of-date endpoints, malware-infected devices, and other potential cyber threats for more than 12 million companies worldwide.
To allow real-time stream processing for the organization, the team moved away from using RabbitMQ to open-source Kafka for processing a massive amount of data in a matter of milliseconds, instead of weeks or months. This makes the detection of a website’s security posture risk happen quickly for constantly evolving security threats. The team relied on batch pipelines to push data to and from Amazon S3 as well as expensive REST API based communication carrying data between systems. They also spent significant time and resources on open-source Kafka upgrades on Amazon MSK.
Self-maintaining the Kafka infrastructure increased operational overhead with escalating costs. In order to scale faster, govern data better, and ultimately lower the total cost of ownership (TOC), Brandon, lead of the organization’s Pipeline team, pivoted towards a fully-managed, cloud-native approach for more scalable streaming data pipelines, and for the development of a new Automatic Vendor Detection (AVD) product.
Jared and Brandon continue to leverage the Cloud for use cases including using PostgreSQL and pushing data to downstream systems using CSC connectors, increasing data governance and security for streaming scalability, and more.
EPISODE LINKS
- SecurityScorecard Case Study
- Building Data Pipelines with Apache Kafka and Confluent
- Watch the video version of this podcast
- Kris Jenkins’ Twitter
- Streaming Audio Playlist
- Join the Confluent Community
- Learn more with Kafka tutorials, resources, and guides at Confluent Developer
- Live demo: Intro to Event-Driven Microservices with Confluent
- Use PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)
Kris Jenkins: (00:00)
Hello, you're listening to Streaming Audio, and I have a challenge for you. Scan the whole of the internet, save the interesting bits as regularly as you possibly can. That's a lot of data. And this week we have a story that is genuinely web scale. We're talking to Jared Smith and Brandon Brown of SecurityScorecard. They do web security and security threat analysis and scanning every IP address on the internet is just one of their data sources. So you can imagine the capacity they need to stream data in at scale, and to process it efficiently. We end up talking about their journey from using RabbitMQ in the early days to the way they're using Apache Kafka today. We talk about the upsides of being able to ingest that much data and some of the problems it brings, and we also explore how they're helping the teams around them to both create and consume new feeds of data now that they have a platform that can actually cope with it.
Kris Jenkins: (01:01)
So Streaming Audio is brought to you by our education site, Confluent Developer. More about that at the end. But for now, I'm your host, Kris Jenkins. This is Streaming Audio. Let's get into it.
Kris Jenkins: (01:18)
I'm joined today by Jared Smith and Brandon Brown. Gentlemen, welcome to Streaming Audio.
Jared Smith: (01:22)
Hi, good to be here.
Brandon Brown: (01:24)
Hi, thanks for having us.
Kris Jenkins: (01:26)
Good to have you. So let's see, Jared, you are the... this almost sounds like an FBI title. You are the Senior Director of Threat Intelligence at SecurityScorecard.
Jared Smith: (01:37)
Yes.
Kris Jenkins: (01:37)
And Brandon, you're the Senior Staff Software Engineer, right?
Brandon Brown: (01:40)
Yep.
Kris Jenkins: (01:42)
Okay. So Jared, I think we should start with you being a senior director. What's SecurityScorecard? What do you guys actually do?
Jared Smith: (01:52)
That's a good question. It depends on who you ask, but just for the sake of how we usually describe ourselves for the people that aren't familiar with security ratings. You can think of it most similar to credit ratings, but for cyber security. And that's a horrible way to explain any company, my startup being Uber for food delivery or whatever that may be. But that's a great way to think of us. So the world has lots of security problems, it has lots of things that we need to put into place to protect people from threats, protect people, from getting phished, all of those sorts of things. And we're really good at identifying that risk and bubbling it to the surface for the people that need to know how to address it, where to address it, how to improve, how to protect their network.
Kris Jenkins: (02:39)
So detection and advice on how to solve it afterwards?
Jared Smith: (02:42)
Absolutely. The remediation is a huge key as well. Yes.
Kris Jenkins: (02:45)
Yeah, I can imagine. Give me one example. What's one kind of threat you deal with?
Jared Smith: (02:51)
One simple example would just be simply people leaving ports exposed on their network. Imagine that you have a login for a cloud server, and that's on your network. And when you are testing it in development, you forget to bring it down, but it's still connected to a production database. That's a risk, because if somebody finds that or there's a zero day in that box, or it's a reuse password, that can then lead to bad things inside your network.
Kris Jenkins: (03:21)
Yeah, absolutely. Okay. In that case, I'm going to head over to the senior staff software engineer. Give me a spicier example, Brandon.
Brandon Brown: (03:29)
A spicier example. So one of the key things is that all of the data aggregation we do is all from the outside in. So, we don't actually go into your network and scan, all of our scanning and data collection is all from publicly available information. So that's a key differentiator as well. So for example, one of the products I work on, AVD, Automatic Vendor Detection. We actually take the data that Jared's team produces and we automatically make connections saying... Because, for a simple example, you use Google and we know that because you used a Google tracker on your webpage, is that kind of super simple, easy to follow example, Right? And so we can pick these things up automatically. And so what's really exciting is larger companies that are trying things out expose themselves unintentionally.
Brandon Brown: (04:21)
We actually had a customer who we said they were using HubSpot, and they said, "We stopped using HubSpot years ago." And we look at the raw data, we go, "Well, on this date we saw linked to HubSpot." And sure enough, an internal team was experimenting with HubSpot. A small group within that company was aware of it, but the larger company was not. And so this was a potential risk to them because they didn't know, and we were able to point it out for them automatically.
Kris Jenkins: (04:52)
Because this is a thing that a lot of hackers do. They start, they are auditing what their target does and doesn't have.
Brandon Brown: (04:57)
Yep.
Jared Smith: (04:58)
Yeah.
Kris Jenkins: (04:59)
Right. So you've got to get the jump on it. Okay, I get the idea. Jared, what kind of scale are you doing this at?
Jared Smith: (05:08)
What kind of scale?
Kris Jenkins: (05:10)
Are you dealing with dozens of customers or what's your...?
Jared Smith: (05:15)
Thousands of customers, but even on my team, we don't even think about the number of customers we have, we think about the number of possible digital assets, digital indicators that we can track. And that's where Confluent starts to come in. I mean, there's 4.1 billion IPv4 addresses, about 3.9 billion are actually routable. That's not even thinking about IPv6. And one of the most fundamental things we need to do is constantly be aware of what's on all 3.9 billion of those routable IP addresses in a week to week and a half basis, if not faster.
Jared Smith: (05:49)
And so we need to talk to every single one of those, and we need to spend a few minutes assessing the security risks of each of those. And that's a large number, but we also deal with billions of accounts online that we found the data breaches, and we've got to process those. We're talking about hundreds and millions of hits a day in our honey pots that are in 63 countries that are just pulling in attacks that people are running out there that has to be streamed back to Brandon's teams to actually put into our platform. So the scale is essentially limitless. It starts to get pretty large, pretty fast.
Kris Jenkins: (06:31)
You're attempting to scan the whole internet plus other stuff?
Jared Smith: (06:34)
Yes.
Kris Jenkins: (06:36)
That will do it for scale. So maybe we should bounce back to Brandon? Give me an idea of the basic architecture of what you're doing and where Kafka comes into this.
Brandon Brown: (06:46)
Yeah, So in regards to the product that I work on, we actually take a feed from Jared's Kafka cluster, which is just raw JSON, and we only care about a few specific entities in that topic. So what we actually do is we fan it out into a different cluster and we standardize all those messages as Protobuf. And so what's great is we get very compact messages in our topic, and then we also get the speed and efficiency of, we can just subscribe to whatever feed of entity that we care about. So when we actually went GA, we had a single entity we cared about, we very quickly were able to add on two additional entities just by saying, "Start consuming these topics." And just for an example of scale, we had an issue where the message sizes increased. And so our fan out process was working fine, but our consumers actually were out of memory. And so we ended up building up, over the span of about four days, a 6 million message backlog.
Brandon Brown: (07:50)
And we were able to, just by increasing memory, go ahead and process those 6 million messages in under eight hours. And if we wanted to, we could've done it faster, we just would've had to do some more tuning and tweaking but for our use case, eight hours was perfectly acceptable. And so that was pretty exciting actually, to see that happen, to see us be able to get caught up from that in less than a day and it didn't even choke our system. And we've seen other batch based approaches, when that scale of data comes in, they just fall over, they have to reconfigure a bunch of things just to get caught back up. So this was a really powerful use case for one, using Kafka, but also using specific message types and kind of splitting out your data in ways that work.
Kris Jenkins: (08:43)
In those circumstances, a lot of companies wouldn't be able to catch up at all. They'd just throw away that missing data in the meantime and just have to pick themselves up from scratch.
Brandon Brown: (08:50)
It was actually a really funny thing that happened because I saw the alerts happening and I was like, "Oh, it's just the infrastructure being weird. There's not a problem here." And then come in to work Monday, I'm like, "Actually, something seems really off." And we were able to quickly triage the process, which basically consumes those messages, does some processing to make connections, and then saves to a Postgres database. And what was great about it is, we got a delay in data, but we still were able to serve up data because we kind of segmented our services in such a way that that delay in processing didn't affect the actual serving up of the data.
Kris Jenkins: (09:32)
So you're essentially using an architecture where Jared's team is pumping into one cluster, just gathering raw data and then handing over to you guys for analysis?
Brandon Brown: (09:43)
Yep, and what we do is we kind of do the standardization of that data, so making sure that the fields follow specific naming conventions so that you can just read those messages and not have every consumer do the pricey thing of, "Okay, well how do I map this field into this entity? Oh no, the casing of it changed. Now I got to update all my code." We handle all that standardization, so you can trust everything in our cluster is already standardized.
Kris Jenkins: (10:12)
Okay. And you're doing that so that you can just slurp everything from the internet in as fast as possible and then worrying about tidying it up later, and then the downstream people don't worry about tidying it up at all?
Brandon Brown: (10:25)
Yeah, and what it lets you do is kind of create this approach of, let's take in all the things and then let's be really focused on what are the things we're sucking in that we actually care about, and let's give time and attention to making sure those are good. And so you can work in this iterative approach where you continually add on things without having it all perfect up front.
Kris Jenkins: (10:48)
Yeah. Because in a perfect world for a system like that, you'd say, "Okay, load the entire internet into memory and then start filtering it." So you're kind of doing something like that?
Brandon Brown: (10:58)
Yep.
Kris Jenkins: (11:00)
Yeah, that makes sense. How long have you been working on this system? Or how long has the system been alive, Jared?
Jared Smith: (11:08)
Oh yeah. So AVD in particular, I mean, we've been working on it since last fall, but it went GA in the spring of this year. The early winter.
Brandon Brown: (11:17)
Yeah, actually it went GA early January.
Jared Smith: (11:21)
Early January, yeah.
Brandon Brown: (11:22)
It was January, the first iteration. And then the spring we had additional product data feed that we onboarded.
Jared Smith: (11:30)
And then we've been using Confluent, and just Kafka more broadly, to do the internet scanning since January, 2021. So it was a whole story of moving off of RabbitMQ, which we had for years, to Kafka, which immediately scaled almost seamlessly. I mean, basically seamlessly. The only non seamless piece was just managing the on-prem infrastructure with Confluent. But I mean, that was still better than running our own cluster in something like MSK. And that really ties into our story because as soon as we started sucking in data, we weren't going to limit the speed at which we would scan the internet or the number of ports we would scan, which we had to do with RabbitMQ because it wouldn't scale. But that was just fundamentally not going to be the thing that blocked us. So if Kafka wasn't able to handle that amount of data, then that wasn't going to be the right solution.
Jared Smith: (12:21)
But it works. I mean, that's what it's designed for. It's for a massive amount of data. It's designed to be able to track it, replay it, and process it, all of that. And we've been able to just scale in terms of new data we collect for our customers just by adding new topics and new collectors. And so a lot of our collectors are Python, basically Python programs that we distribute around the world in different boxes, or we run crawlers and they all just produce a Kafka in different topics. And that's the super simple way to scale up all the different types of data we bring in that we then surface to customers upstream.
Kris Jenkins: (12:58)
Yeah, it makes sense that you are... I mean, I can tell just from talking to the two of you, you've separated out reads and writes...
Kris Jenkins: (13:03)
I mean, I can tell just from talking to the two of you, you've separated out reads and writes very nicely.
Brandon Brown: (13:04)
Yeah, and-
Kris Jenkins: (13:05)
... Conway's Law seems to be working for you.
Brandon Brown: (13:08)
... The beauty of it too is that because they are separated out, it's very easy to improve upon one of those systems without having this kind of monolithic, "Oh, well, this change is held up by this other team's system." We can work in parallel independent streams. And the other thing when I came on about a little over a year ago, was I did not want to manage on-prem Kafka at all. So we actually had MSK and it was not meeting our needs.
Brandon Brown: (13:38)
And so Jared got me connected with Confluent and we started using Cloud from the get go. And as soon as we were able to do an annual commit, we were able to use Kafka Connect, and we were actually able to take, in our service, a lot of expensive rest API calls and eliminate them completely because we could use just Kafka connectors to stream from upstream databases and save off the information we wanted, so we could do lookups. So what's really cool about our ABD data set is now, officially, any piece of data that constructs it lives in a single database, in a single schema. You can query it and play with it and interact with it without having to go hunt down various sources of data. And so it's allowed us to actually fix bugs really fast as well because of that.
Kris Jenkins: (14:26)
So is... You use a fair bit of Postgres as the last loading layer before the presentation layer, is that what you're saying?
Brandon Brown: (14:33)
Yep. We actually do some really, really cool, I like to call it using and abusing Postgres. We have a materialized view that kind of, we're using that because of indexes that we could put on it. But that basically gives you, for all of the things that we rate and score, all of those companies' third parties. And that materialized view is about 42 million rows. We can concurrently refresh the materialized view every hour. It takes about 20 minutes to concurrently refresh. So every hour, we're updating the data, we are able to then use that to calculate your fourth parties as well, which then obviously that 41 billion rows expands out much higher, and we can serve up that data in a matter of seconds. So for example, a company like Google, we can get their third and fourth parties in about two seconds, which is-
Kris Jenkins: (15:35)
Because this is an important part about security, right? You trust Google, so you include their tag. But who are they using, and who are they using, and what security issues have they had? And nobody really tracks that firsthand right?
Brandon Brown: (15:48)
Right. And that's actually, if you look at a lot of the big breaches that have happened in the last few years, they're not directly because of your third parties. They're because of your fourth parties, so your vendor's vendors. And something that just, large company acquires another company, and that other company previously acquired some company that doesn't have good security, and it opens as a tax service. And so, that's part of the work that Jared's doing and that I'm doing is helping surface those in a much more easy to digest and track manner.
Kris Jenkins: (16:19)
Yeah, yeah. As they say, the friends of my friends are also my friends, but it doesn't mean that they're good at security.
Brandon Brown: (16:25)
Yep.
Jared Smith: (16:26)
Yeah. I like that. I'm going to have to reuse that.
Kris Jenkins: (16:31)
But I want to jump back a bit because I really want to dig into this. Jared, maybe you could take this part. So you started off with RabbitMQ, and I always like to know at what point people say. "This is painful, we need to move."
Jared Smith: (16:44)
Yeah. Absolutely. So the RabbitMQ was used for a while before I joined in 2020, almost two years ago in November. But the point that became painful was asking to add a single port to our scanning. So we were doing, at that time, something like, I think, 80 ports internally, and we had some other ports we brought in from external providers, other scan providers, and I wanted to add ports. So can we scan this port, this port, this port? And there was no consensus on whether that would bring down the system or not. The actual streaming's-
Kris Jenkins: (17:25)
Just through sheer capacity?
Jared Smith: (17:26)
... Just through adding, just through scanning another 3. 9 billion IPs, and if they have that port open. Yes, it all happens and at the same time all the ports get checked at the same time, but if that brought too much data back, no one is really sure whether it would knock the system over. And so, that was the leading technical limitation. And at the same time as well though, we were being asked to move all of our scanning in-house. So moving off third party providers and bringing it all in-house.
Jared Smith: (17:57)
So there's a few companies out there called, there's one called Show You and one called Censys that you can type in, "All webcams," and you get back all webcam IPs. And so, that was what we were tasked with essentially building in-house because we weren't going to rely on someone else. And at that point, it was immediately clear that if I needed to be able to scan more than 1,400 ports, which is what we do now, and is pretty standard, then we need it to be on something more scalable. Because, I mean, we just can't, that scanning process in all the other collections, it is the root of our company's livelihood.
Jared Smith: (18:37)
If we don't have the data, if we don't have the insights and we don't have the research, we have nothing. We can't visualize things to customers, we can't attribute it to their network, we can't alert them about breaches if we don't have the data. So we really just, we were in a scramble to be like, "What is the right choice?" And I remember back from, I spent six years in the US government doing intelligence research with DOE, power grid security, I was in-
Kris Jenkins: (19:03)
You were an FBI agent. I knew it.
Jared Smith: (19:04)
... I actually work with the FBI more here than I did there. I worked with the NSA a lot more there. It was NSA, it was the CIA when I was in the US DOE. But what I was getting to is we played around with Kafka back in 2017, 2018, before I joined here in 2020. And we loved it. It wasn't nascent then, but it was something we were using the open source version of. So I had this bug in the back of my head back in... jump forward to 2020 or 2021 when I was building this here, and I was like, "Maybe I should just try Confluent, or Confluent on-prem in our platform." And the reason we did not platform is because, at the time, we were running scanners in really sketchy places, which we can talk about in a bit. And I didn't know about streaming data from mainland China back to Confluent Cloud's clusters, or from Russia or Iran. So I'll talk about that later.
Jared Smith: (19:59)
But I literally, I downloaded it and when I hit the license limit, I was like, "Man, I need to sort this out." So I asked my boss Ryan, and then he was like, "Yeah, dude, whatever we need to do, we need to make this work." And it was just me and him at the time before we hired the whole team to really take over this system and bring it where it's now. And so I reach out, I mean, I kid you not, on LinkedIn to a BDR at Confluent, and that led to us meeting Adam DeLand, who was our Account Manager. And that relationship's been almost two years in the making.
Jared Smith: (20:31)
How often does a BDR get just dropped an opportunity in their hand like this where I was the one that reached out? And so I... That's given me so much more respect because I'm like, "Man, I want to do everything I can to help the people that have made our lives so much easier." I mean, whether it be you all or other vendors that we partner with directly. That's kind of my long story, but it's been fun, and it all comes back to can we trust the technology we're building our livelihood on? And I mean, we can trust Kafka. There's a reason it is the leading streaming platform. And we love Confluent because it makes all the other hard stuff easy.
Brandon Brown: (21:13)
Yeah. Actually, I've been using Kafka since 2016, and I've always had the managing ourselves, and after my last job, I was at an oil and gas company, and after some terrible snafus we had with managing our own cluster, I was like, "I will never do that again," because it is a nightmare. And then also, we were heavy users of Kafka Connect and I was like, "And that's its own nightmare, additionally, if you want to manage." And so I was like, "I don't want to handle those two things."
Kris Jenkins: (21:41)
Yeah. I think I've reached the point where I want to know how to handle it, but I don't actually want to do it-
Brandon Brown: (21:47)
Yep.
Kris Jenkins: (21:47)
... And I certainly don't want to be doing it at 3:00 in the morning.
Brandon Brown: (21:49)
Yeah. That's not the time you want to do it.
Jared Smith: (21:50)
The Connect piece is really, it is a huge game changer for us because at least on our side, we have to produce the systems directly to clusters like Brandon's, but there's still people in-house that aren't fully Kafka-fied yet. And so for them, they read from our S3 data lakes. So it's as simple as, just, if we make the topics, then what happens is we dump the S3 for those teams that read from there, and then Brandon just pulls directly out of ours for ABD and other things. And then we also, inside our threat intelligence group, we use Splunk to do our internal visualizations, so we dump output to there as well. And recently, we've been building our own custom connectors that will do farther processing outside. So we'll dump to those as at that point. And it's really, it makes it a lot easier than having to write custom hooks to get this data to just various systems we have.
Kris Jenkins: (22:49)
Yeah. Connect is such a great connector. It's a great way to just join the dots without too much work, right?
Brandon Brown: (22:56)
It's interesting, too, because you observe in data engineering there's this mythical golden perfect representation of data processing, where you can just take inputs, do something and produce some outputs, and you see a lot of people try and implement that. And it's very hard to home grow and get it right because there's so many edge cases that you have to consider. And Connect has just had, it's had that history and been around long enough that it's very stable. And you don't have to worry about it. It pretty much is that golden system of, " I want to take some input, transform it, and result in an output." And the non-Kafka-kied teams in-house, that's a hard [inaudible 00:23:43]. I know, I liked it. It was a good one. They use a lot of snapshotting of databases to S3 to do analysis, and those database snapshots aren't incremental. They are full database snapshots.
Brandon Brown: (23:56)
And so that's a lot of cost. And the teams end up having to do logic to say, "What are the things that changed?" And you get that for free using Debezium. You get that for free in a sense, even just using the standard Postgres connector, because the data is incremental in change. And what's really exciting is you can look at those topics and you can actually start to get a sense for what is the rate of change on these tables? How much activity actually is happening? And it's actually surprising, in some cases, you go, "Actually, this data set updates, we think it updates every minute, and really it updates maybe every 30 minutes." And being able to actually see that by visualizing just topic traffic is really powerful, because you can also start to glean more insights into what is actually useful in your product and what people are using.
Kris Jenkins: (24:50)
Oh, yeah. Do you also do things like, "We're not sampling, it's clear that we're not sampling this source fast enough," or, "We need to sample it less often because it's not worth the effort."
Brandon Brown: (25:00)
Yeah. So we've had a little bit of that on our side. The one thing I'll say that's always important to consider is what is the point at which your customers would realize that something is missing?
Kris Jenkins: (25:12)
Yeah.
Brandon Brown: (25:13)
And that kind of gives you your wiggle room of how fast you have to be. And generally speaking, you can be a day behind with our data and it's okay. It's going to take a day for the customer to notice. So that gives you some wiggle room if you find issues. Kafka, because we can kind of infinitely scale it and be reactive, we can catch up to and replay and reprocess bugs and code so much faster. So instead of in a traditional batch world where you're like, "Oh, I've identified this problem, now it's going to take me a full eight hours to reprocess everything. Oh, and now the customers will notice." You can shrink that, let's just say, to four hours, and now there's actually a chance that, yeah, you know what, you probably have a couple customers that might notice it, but it's going to be-
Brandon Brown: (26:03)
You probably have a couple customers that might notice it, but it's going to be a handful. The majority of your customers will never even know that you had a delay. And that is really, I think, another powerful part of using Kafka as part of your data processing kind of infrastructure.
Kris Jenkins: (26:15)
Yeah, 'cause these kinds of scale jobs, you want something that has built into the heart of it the idea of incremental catch up. Right?
Brandon Brown: (26:24)
And I'm able to right now... We consume a feed from Jared's topic, when we fan out for our production. We're testing out using service accounts, so actually locking down what consumers and producers can read and write to. And I'm able to just go ahead, read his topic into our QA environment and test out that change in parallel. Not breaking anything with production, know that it's good and then roll it to production super quickly.
Kris Jenkins: (26:54)
Yeah, that's another thing that's one of my favorite features of Kafka. You've got this immutable store of data that you can read in all sorts of different ways without affecting anyone else.
Brandon Brown: (27:03)
Yep. You're just like, "Oh, here's my new consumer group ID. Let's go to reprocess and see how it goes."
Kris Jenkins: (27:12)
So this is leading into topic retention and compaction stuff, but how much data are you actually dealing with as a working set?
Jared Smith: (27:22)
All my topics are set to unlimited, and I think we're at hundreds of terabytes of data already just from leaving them on unlimited. But our actual data lake the other day was 11 petabytes or something, in S3. Kafka doesn't have nearly that much right now because we've implemented it in the last two years. And again, it doesn't have to deal with all those snapshots and things that are just pretty much unnecessary.
Jared Smith: (27:51)
I can give a concrete example, and this kind of plays into how multiple systems interacted, is we sucked in 4.5 billion database records that we downloaded from the dark web and parsed out. So think of usernames, passwords, credit cards, all these things that we have to track for people that are posted by threat actors. And so we sucked all of them into a topic, or we produced them into a topic and then dumped them out to an S3 connector. And all of that stream, that 4.5 billion record streamed and was saved infinitely, but it's streamed out to S3 in eight hours or something ridiculous because we just let the default settings be there. And the problem wasn't that it went so fast. The problem was the next day when that upstream batch system picked up that 4.5 billion records they were like, "What the heck is this? We have never [inaudible 00:28:41] a single day." And it broke stuff. And so it immediately told us like, "Oh man, we can't push data as fast as we want to push data, even though we can push data that fast." Because we are running now at the point where we're mature enough in our data collection and scale all the way down at the bottom, where now it's forcing us to improve and reconsider how we can take systems like AVD that are fully streaming, but get those same benefits in systems that are more batch, that will break when there's this much information. So that was a lot of data.
Kris Jenkins: (29:15)
You need some kind of back pressure on that, right?
Jared Smith: (29:18)
Yeah. So what we actually adjusted the topic to stream a lot slower.
Kris Jenkins: (29:22)
I have to say you're going to win the prize there for humble brag. So fast downstream people can't cope with it.
Jared Smith: (29:29)
Yeah. I mean, they do great stuff. But it was revealing about, as a company, in the last couple years we've really opened the doors on how much data we can bring in compared to what we used to. So it's revealed to us a lot of assumptions that were being made implicitly, simply by not having that much data. So stuff wouldn't break because it was pretty much stable, where stable is a range of millions to hundreds of millions. But when you just say, pick an example, start to do billions or hundreds of gigabytes or terabytes in one S3 bucket a day that has to get picked up it starts to slow down and impact all the way up to the front end, the actual SQL databases the front end's reading from starts to fill up itself, out past the other processing. And then our actual latency in the platform slows down simply because we push so many measurements to show the customers.
Jared Smith: (30:26)
And so that's where Brandon can speak more to just, what does the future look like to us? To me at least, and I hope to Brandon, the future looks like a world in which everything is that very much like, push data in, a bunch of things take it out and there's no implied, "This will break if there's too much data," constraints on any piece in the platform.
Brandon Brown: (30:49)
Yeah, I was going to say as a comparison, we fan out the data that Jared produces and put it into Protobuf. We only do a seven day retention, just because that's all we really care about for now. We're toying with the ideas of having some longer retentions on things. And then all of our topics are six partitions. And we're at I think a terabyte as our cluster. So that's one giant feed of data from Jared plus our Kafka connectors. And one internally produced data set is about a terabyte of data, and it holds at about that with seven day retention.
Brandon Brown: (31:29)
And the power too of the Connect stuff is we actually had an upstream database, they did a change to add a column and that was 12 million rows that got updated because they had to set defaults. And so we had our sync connector failed because we weren't dropping off the new field 'cause we did not know about it. I was able to recreate that connector with the new transform setting, go ahead, set it to six tasks, run that for about 10 minutes, be caught up, set it back down to one. And that was a minute of work to scale that up and down. It was super fast.
Brandon Brown: (32:11)
And I think that's, in my mind when I think about the future, us using more of Connect for transforming storing database data, rather than us writing a lot of custom code if we don't need to. Let's only write services to transform data if we actually have to have business processing logic that's important in there. But if you are just literally ferrying data from A to B you can use a connector to do that. And the connectors are cheap, they're going to be stable and you don't have to worry about the bugginess. You can focus on your business logic and delivering value.
Kris Jenkins: (32:52)
So is it fair to say that your only real stream processing is going from Jared's cluster to yours?
Brandon Brown: (32:59)
Yeah, for now that's a fair assessment.
Jared Smith: (33:04)
Yeah. Upstream of our cluster, yeah, that's the case. In ours we are starting to use more KSQL to do processing in line, but a lot of the processing we have to do has to hit other APIs and has to pull in other data sources that might not be already in a topic or might change depending on the actual day. And so we haven't put them into a streaming topic. So what we'll then do is suck it off of a... Or consume it, I should say consume versus suck it off the topic, I don't know where I [inaudible 00:33:37]
Kris Jenkins: (33:37)
Move on, move on, move on.
Jared Smith: (33:40)
Consume it. Consume it and then actually say... And process that information and then produce it back, because that consumer and producer that lives in its own application will then do some kind of enrichment on it. And that enrichment could query an API, it could run another scan. But what's really cool about that scanning system Horace is that every time an IP is detected with an open port, that application that's consuming and producing to Kafka will additionally produce that same exposed port to another input topic into Kafka that gets consumed from another application.
Jared Smith: (34:19)
And then there's this chain of scanning and collection that we build, and a good example is if we start by scanning the entire internet just with a tool... There's a tool called MASSCAN that just says, "Here's all the open ports out there." And so what that does is says, "There's port 22 open on 1.1.1.1." That's CloudFlare, it doesn't have that open but let's just say it does. Then that will then get produced to a topic that does Inmap. And what Inmap does is it actually collects the data off that port.
Jared Smith: (34:49)
And so once Inmap gets it it produces its raw output to Kafka, but it will also... Say another service that has an HDP port. So if it finds HDP in another one, Inmap will push that to a crawler topic where that crawler will then actually web crawl versus just scanning the port. And then we keep going farther, another one is if we find an SSL cert. When we do SSL scanning there's actually two steps you have to do to get the full information, because the world now uses something called SNI. And SNI basically ties a domain to an IP in a bit more complex way. So once we see that SSL service open we'll push that to another input topic that is going to produce a full SSL record.
Jared Smith: (35:36)
Sub domains is another one, if we find a domain but we also find a sub domain we push that. All of this chain of tasks of things is all controlled via Kafka input and output topics. And that's in that system called Horace. So that's what makes it so flexible, because anytime we want to add a new thing to do based on the input of whatever we can find in the internet, we simply write a new Python application.
Jared Smith: (36:01)
We have a bunch of bootstrapping scripts that deploy it wherever we want in the world, depending what providers we're in, which is more than the clouds. We've got 40 or so, something hosting providers that we run our boxes in because, since we do threat research, we need to be in sketchy places but we also need to be in places that everyone else is in like the cloud. Or by the cloud, I mean Amazon. But Amazon won't let us do scanning, so we use other places to do that.
Jared Smith: (36:26)
So what I'm getting at there is it makes it so flexible, and once we deploy the application it looks for an input topic in our Confluent Cloud cluster. And then when it gets information it does its work based on a key in the message and then it produces it back to a topic that it knows to produce too.
Brandon Brown: (36:45)
You can almost think of it like a much more scalable and manageable actor system, basically.
Kris Jenkins: (36:54)
Yeah, I suppose so.
Brandon Brown: (36:54)
Which is the part that I-
Kris Jenkins: (36:55)
[inaudible 00:36:55] picture I have in my head.
Brandon Brown: (36:56)
I really love about it.
Kris Jenkins: (36:57)
Yeah. The picture I have in my head is sending out lots of UFOs from the mothership and beaming data back about the planet they've scanned.
Brandon Brown: (37:06)
Yep.
Kris Jenkins: (37:08)
Maybe I've watched too much sci-fi. There was another thing I wanted to ask about, 'cause you've got a lot of data, all of which is public, some of which is arguably owned. And you've said you've got passwords, right? So is there a whole governance issue for all this data? Does that come into it, Jared?
Jared Smith: (37:30)
Yeah, absolutely. So for the open data that comes off of... Those passwords are coming off of data breach records. So think about Have I Been Pwned, where I can go type in my email and I can see that my old Gmail comes up in my fitness file from years ago because everybody was breaching that.
Jared Smith: (37:47)
But to give an example of the governance use case, there's actually another system that's not the password side, like the leaked credential, but it's something called Internal Security Suite. And that is where we are actively bringing in information from customer's security vendors and tools. So think of things like CrowdStrike, Tenable, Scanning, Duo. So we'll bring in customer data that says, "This user failed a two-factor authentication." And then we'll map that to the scanning that shows that that two-factor fail was on an end point that also has a CV on it based on outside scanning, and start to build a kill chain of how an attacker... Maybe they got through that CVE and then they tried to log in with a leaked credential, but they failed a two-factor auth. And we can actually build that because we connect with the internal systems that have your proprietary security logs and user logins, all of that.
Jared Smith: (38:40)
And so that is where the governance comes in. And this is something that we've done really cool with Confluent, where every time a customer joins ISS, which right now is in early access, if they sign up and they connect their Tenable scanner keys, which does internal network scanning, and maybe they connect Duo, what we do is we provision an entirely separate, essentially, container for them...
Jared Smith: (39:04)
An entirely separate, essentially container for them inside AWS, but at the same time we create a topic in Kafka and a full end-to-end streaming setup just for that customer within our cluster. So since the data is is within that topic, it's encrypted before it even goes in. Plus it's all using TLS back and forth with the cloud. The topics will never, they're never going to share be in the same spot as another customer's data that's sitting in another topic, even another cluster endeavor QA. And that all happens actually whenever a customer joins.
Jared Smith: (39:38)
I've written boot strapping code that calls the Confluent Cloud API and sets up this whole end to end, this whole process. And then also creates the container in ECS and Amazon. And then once that container starts, it knows what topic that customer domain belongs to and then it just starts pulling data from their environment and then pushing it into Kafka via the producer to that topic. And then the front end reads that and shows it just like we do with other things. So that's our coolest example of governance because we do have to actually keep customer data separate from another customer when it's their internal data.
Kris Jenkins: (40:12)
And so they have their own dedicated front end as well?
Jared Smith: (40:17)
Yeah, so in our platform for every customer we have these things called scorecards. And so the ISS piece, it's their own scorecard. So it's their own backend there. But when we actually bring in their data, it's [inaudible 00:40:32] three buckets which is produced through from the connector from their own separate topic and then their own separate container. So all along the way it's just for that one domain.
Kris Jenkins: (40:42)
You're wearing an awful lot of hats here, from a security professional to DevOps.
Jared Smith: (40:47)
Yeah, my PhD is in computer science and I am more of a software engineer than most people kind of treat me as initially. But I write a lot of code and I build a lot of stuff. But no, I've been doing more sales recently too for some of the other products we're launching. But all I'll say is that that's what makes it so fun to work here is because you can do anything that needs to be solved. If there's someone to solve it, then they're happy to work on it. But I will say that I'd much rather be coding than doing a lot of the other stuff. So that's why I [inaudible 00:41:25].
Kris Jenkins: (41:25)
I know exactly how you feel. The two things I want to do with my life are coding and talking about code. So let's just take a peek into the crystal ball to wrap up. Brandon, where do you want to go next?
Brandon Brown: (41:41)
I want to see more of our internal systems taking a hybrid approach. So it's more micro batching, but the micro batching is driven by Kafka so that these kind of surges in ingestive data don't cripple systems. Teams can kind of scale up is where I see things ultimately going.
Kris Jenkins: (42:05)
But you want to move them to micro batching rather than streaming. Why is that?
Brandon Brown: (42:09)
Some of the workloads that they're doing inherently need to be batched because of the types of aggregations they're doing. So it's easier to reason about if it's in a micro batch fashion where maybe say it consumes from the topic for say 15 minutes, does some processing, and then cycles back. That's much more efficient and easier to debug than the full on always on streaming, using streaming windows. And I think would be a lot easier for non-engineers to pick up as well, so it kind of increases the surface of people that can contribute and work on these things.
Kris Jenkins: (42:50)
Yeah, that makes sense. You've always got to not just consider the technology but the people involved in implementing it.
Brandon Brown: (42:57)
Especially when it comes to monitoring and alerting and kind of traceability, it's a little bit easier with the type of work they're doing for it to be a smaller batch. Because you could put alerts around very easily and you could kind of put kill switches in if something's wrong. Because you don't always want to stream out the changes constantly. Sometimes you need to have a little bit of a gating and vetting process.
Kris Jenkins: (43:22)
Yes. Yeah, I can see that. Jared, what about you? Where do you want to take it next?
Jared Smith: (43:28)
Where do I want to take it? Next is a bit more people focus and I talked about managing. So my team is a team of hardcore security researchers. Been working in security longer than I've been alive. And so I'm a little bit younger, but these people are just rock solid. They've been in McAfee for 20 years before this. They've been in the Canadian CIA, they've done investigations. I worked in intelligence but these people are amazing too. But when I'm getting at though is when we all started here, there was a mix of engineers and security researchers within the threat intelligence group because of the way our teams were broken down. And I have made it my mission to turn every one of my security researchers into not only a competent engineer that's good enough, not Brandon level, but I can say take Python, go build it, come back and let's do it.
Jared Smith: (44:27)
But also making them data pipeline people. So one by one they've been learning Confluent and Kafka and to see it click. And for me, I remember when that clicked for me, I was like this is incredible. To see that start to bleed out among our security people has been amazing. So I guess what I'm trying to do is work backwards from the people that aren't super DevOpsy but are very much SMEs and security and teach them how they can basically themselves start to produce new data that we can use in our platform.
Jared Smith: (45:05)
And that's what we're actually doing now. Because when you give the person that knows where to find stuff in the dark web the ability to code and crawl but also push it to Kafka and then I've already specified how to push it, then everything else just works because usually that would be its own team that does that. So that's my mission right now is every last person in our group, whether they're technical or not, should know how to at least interact with the data we have coming through Kafka. At least be able to set up a connector to a new system they're doing. At least be able to go reference the source data. And it's getting there. We're getting a lot closer.
Brandon Brown: (45:42)
And I would say that the thing that really opens up for is improved data quality. Because you can reach the quantity but then you can also make sure that the quantity has quality behind it, because the people actually producing it are knowledgeable about what they're producing and they can do it so much faster because they have all those years of experience.
Jared Smith: (46:01)
And it's empowering for them too because some of the people that have been in my team a lot longer, they dealt with the RabbitMQ's of the past where the answer, you could not add more data even if you wanted to because stuff would not work. So changing that mindset that like, hey, you're not going to break it, go set that topic to unlimited retention. I don't care. The cost is going to be a little bit more, but we'll deal with that later. Right now we don't want to get rid of information that we might be able to build signals out of. So that is freeing, and going forward it's very much continuing to ingrain that mentality of you can be the master of your own destiny when it comes to producing novel insights for customers.
Kris Jenkins: (46:46)
That is a nice note to end on. I should write that down and put it on a t-shirt.
Jared Smith: (46:50)
Absolutely. Well, I tried.
Kris Jenkins: (46:55)
Jared, Brandon, thanks very much for joining us on Streaming Audio. That was fascinating.
Brandon Brown: (46:58)
Thanks for having us.
Kris Jenkins: (46:58)
See you around.
Jared Smith: (46:59)
Thank you so much.
Kris Jenkins: (47:01)
Well there we go. Not only are they doing some really interesting white hat security stuff, but now I know who I need to speak to if I want to get in touch with the CIA. Contacts in the security services, they're a bit like karate lessons. We learn them in the hope that we'll never need them. Before we go, Streaming Audio is brought to you by Confluent developer, which will teach you how to make your own webscale data processing system. We've got tutorials, we've got architectural guides, we don't have a shop for t-shirts yet, but I'm going to try and fix that.
Kris Jenkins: (47:33)
And in the meantime you can find it at developer.confluent.io. Quick side story for you. This morning on Slack, someone said to me that six months ago they were listening to this podcast and now this week they've just joined Confluent. So Adam, welcome to the party and yes, we are hiring. If you'd also like to join us, check out careers.confluent.io. And with that, it remains for me to thank Jared Smith and the excellently alliterative Brandon Brown for joining us, and you for listening. I've been your host Kris Jenkins, and I will catch you next time.