Streaming Audio: Apache Kafka® & Real-Time Data

Collecting Data with a Custom SIEM System Built on Apache Kafka and Kafka Connect ft. Vitalii Rudenskyi

July 27, 2021 Confluent, original creators of Apache Kafka® Season 1 Episode 169
Streaming Audio: Apache Kafka® & Real-Time Data
Collecting Data with a Custom SIEM System Built on Apache Kafka and Kafka Connect ft. Vitalii Rudenskyi
Show Notes Transcript

The best-informed business insights that support better decision-making begin with data collection, ahead of data processing and analytics. Enterprises nowadays are engulfed by data floods, with data sources ranging from cloud services, applications, to thousands of internal servers. The massive volume of data that organizations must process presents data ingestion challenges for many large companies. In this episode, data security engineer, Vitalli Rudenskyi, discusses the decision to replace a vendor security information and event management (SIEM) system by developing a custom solution with Apache Kafka® and Kafka Connect for a better data collection strategy.

Having a data collection infrastructure layer is mission critical for Vitalii and the team in helping enterprises protect data and detect security events. Building on the base of Kafka, their custom SIEM infrastructure is configurable and designed to be able to ingest and analyze huge amounts of data, including personally identifiable information (PII) and healthcare data. 

When it comes to collecting data, there are two fundamental choices: push or pull. But how about both? Vitalii shares that Kafka Connect API extensions are integral to data ingestion in Kafka. The three key components to allow their SIEM system to collect and record daily by pushing and pulling: 

  1. NettySource Connector: A connector developed to receive data from different network devices to Apache Kafka. It helps receive data using both the TCP and UDP transport protocols and can be adopted to receive any data from Syslog to SNMP and NetFlow.
  2. PollableAPI Connector: A connector made to receive data from remote systems, pulling data from different remote APIs and services.
  3. Transformations Library: These are useful extensions to the existing out-of-the-box transformations. Approach with “tag and apply” transformations that transform data into the right place in the right format after collecting data.

Listen to learn more as Vitalii shares the importance of data collection and the building of a custom solution to address multi-source data management requirements. 

EPISODE LINKS

Tim Berglund:
If you're building a security event and incident management system or SIEM system for a large enterprise, you're probably going to use Kafka. And you're also going to collect a lot of data in a lot of different formats from a lot of different systems, which makes it an interesting Kafka Connect application. I talked to Vitalii Rudenskyi a once-in-future Kafka Summit presenter, about how he solved a problem like this recently, it's all on today's episode of Streaming Audio, a podcast about Kafka, Confluent, and the cloud.

Tim Berglund:
Hello, and welcome to another episode of Streaming Audio. I am your host, Tim Berglund, and I'm joined in the virtual studio today by Vitalii Rudenskyi. Vitalii is a Principal Engineer at Oracle. And today we're going to talk about a few things. We're going to talk about data collection. We're going to talk about SIEM. We're going to talk about that SIEM, the security thing, not like the things on shirts, and we're going to talk about Kafka Connect. So Vitalii, welcome to the show.

Vitalii Rudenskyi:
Hi Tim. Thank you for offering me this opportunity to tell what I did. So just to notice, we're going to talk about the time when I was working in another company, it was McKesson Corporation, and all the security stuff that I did, all the stuff that I did related to McKesson Corporation.

Tim Berglund:
Very, very important note, principal engineer at Oracle. What are we talking about today? Not Oracle things. So thank you for that.

Vitalii Rudenskyi:
Correct.

Tim Berglund:
For getting that out of the way, that is important. And what else... Oh, what got us into this? You gave a talk on this at Kafka Summit 2020, the first virtual Kafka Summit in August of last year. And I think you also had a blog post on the Confluent Blog that followed up on that. Is that right?

Vitalii Rudenskyi:
Yes. Yes. This is correct. So my confidence speech was about to feed your SIEM Smart with Kafka Connect. It was about how to feed your SIEM with data, how to ingest data for your SIEM, for your security.

Tim Berglund:
Yeah. Those would be linked in the show notes, obviously. So you guys, you want to go follow up and watch that summit talk and read that blog post, I think we're going to cover a lot of that in this.

Vitalii Rudenskyi:
And for our post in Confluent Blog, it was more like a more technical explanation on what I did. It was related to Kafka Connect and the connectors that were used to ingest all the data.

Tim Berglund:
The data. So, as not to bury the lead. A lot of this comes down to being a Kafka Connect game, but I want to set the context first, like what's the problem you were trying to solve? SEIM is a fairly well-known acronym at this point. That's S-E-I-M, but probably want to define that for anybody who doesn't know what that means. And so let's start with that. What's SIEM and what was the problem you were trying to solve?

Vitalii Rudenskyi:
Okay. SIEM is an acronym for Security Information Event Management system. So it's just to collect all the data and analyze the data. And what we're talking about today is data ingestion. So we all know that before we start analyzing any data, we need to collect the data and that's the thing that always comes first. So how to collect data, how to clean up data, how to filter data, how to get rid of the data that we don't need. And [inaudible 00:03:50] the collection. And as you know, all the companies that sometimes have their products built on price for the products are related to the number of data you collected, on the amount of data you collected. And sometimes for big companies, for lots of data, it may be really important to reduce the amount of data to collect and what that has to do with Kafka Connect.

Tim Berglund:
Okay. You do that by reducing in Connect itself. Sorry, I know we're talking about ingestion, but can you give some context about the particular security problems the system was trying to solve and the kinds of data sources that were there in the business? What was the business doing? What was the basic domain and what were you trying to prevent?

Vitalii Rudenskyi:
So it was enterprise security for a big company that has a lot of PII or personal data, healthcare data, and it's really important to protect the data. And this was enterprise or infrastructure for enterprise security to collect the data, to analyze the data. And we have a product SIEM and the product SIEM, it was plug and we need to feed data. We need to collect data set. We need to collect data from thousands of servers that we are writing from inside the company. We need to collect data from multiple cloud services. We need to collect data from many, many important applications. It's all security data, and we need to ingest the data, we need to feed our SIEM with the data, to make our analytics, to make correlations, to make searches, just to be aware of company security.

Tim Berglund:
And so the outputs of the analytics could be something as simple as this is potentially unauthorized access, or there's a pattern of maybe somebody has escalated privileges, compromise some system, or something like that.

Vitalii Rudenskyi:
Yeah. Yeah. That kind of thing.

Tim Berglund:
Okay.

Vitalii Rudenskyi:
And maybe like fraud detection, if we're talking about analytics, but again, to make a fraud detection, we need to compare data.

Tim Berglund:
You do. So there are logs from applications of the company's own, there are cloud services that are emitting, I guess, logging events mostly. And you have to collect those. So the analytics we're not going to talk about today. I just wanted that context on everybody's mind so that they know what you're trying to do, but tell us what data collection looks like in a system like that. It just sounds unpleasant. It sounds like there are a lot of things to connect to, and everything's going to be a mess and nobody's going to agree on a schema or anything. So what's it like?

Vitalii Rudenskyi:
Yeah. So since you mentioned schema, so it's in most of the cases, data is the schema. Data is raw. Even if we collect like a simple thing, like Syslog, it always differs, for many companies, for many vendors, they all say that we can send the Syslog, but it is not necessary to be Syslog. So usually it's just a lot of raw data. And in terms of the way to collect that data, two ways, we have two ways to collect the data, push and pull. So one thing is a push. So we listen for the data. It's like the most common example, the Syslog, we listen for the data over the network, TCP-UDP, and we receive data from a remote site, it can be HTTP. It can be NetFlow. It can be Syslog, it can be whatever, can be pushed on our site.

Tim Berglund:
Got you.

Vitalii Rudenskyi:
And the other option to receive data or to collect data is to pull. So we have some APIs, we have some remote files, we have some remote databases and then we need to pull data from those services.

Tim Berglund:
Oh, sure. So there's some other security monitoring system. I mean, the SIEM solution is not monolithic, but there's a database of events that that system cares about. And well, there's a table and rows get inserted, right?

Vitalii Rudenskyi:
Yeah. We're all in clouds now. And so a lot of security data produced by cloud services and that we also need to collect, we also need to ingest and deliver to SIEM to correlate with internal [crosstalk 00:08:34].

Tim Berglund:
And that's the API side of things. So there's a cloud service and hey, you can go ask and it'll tell you, right. Some query and suddenly Jason is streaming back at you with something in it that might be of interest.

Vitalii Rudenskyi:
Yeah.

Tim Berglund:
And the rest, you said Syslog and just sort of usual suspects like that. Those are the push. Okay. So certain services are pushing data at you and you have to go pull from others, which are going to be HTTP queries.

Vitalii Rudenskyi:
This is absolutely right.

Tim Berglund:
Change to the capture. Okay. All right. And it's schema-less or, I mean, when I say schema, there's always a schema. There's a format and nobody agrees.

Vitalii Rudenskyi:
Eventually, it will go to data with a schema. So eventually when we deliver it finally to SIEM, so SIEM usually anticipates some kind of structured data just to be indexed, to be analyzed, et cetera, but we receive data in raw format.

Tim Berglund:
Okay.

Vitalii Rudenskyi:
So that's one of the challenges, this pain, it's really hard sometimes to make a schema for the data.

Tim Berglund:
All the glory is in the analytics, right. It's the super cool analytics you can do to detect a privilege escalation attack and wow, that's amazing machine learning, an investment. And I'll use the nice word, the investment, the work is on collection. I mean, take us through, did you have off-the-shelf connectors? Did you have to write connectors? How did that go?

Vitalii Rudenskyi:
So a little bit about the products. Usually, SIEM does come with its own collection layer. These are that way. At once, we decided at some point, we decided to move off the existing or the provided data collection layer to our own that is built on top of Kafka Connect.

Tim Berglund:
Okay. Backing up a step. This is not a custom SIEM solution. This is a vendor SIEM solution.

Vitalii Rudenskyi:
No. So it was originally a vendor SIEM solution, which we removed and replaced with our custom solution.

Tim Berglund:
Nice. Okay. Which was Kafka-based.

Vitalii Rudenskyi:
Which is Kafka based.

Tim Berglund:
Sorry, for me asking that question. I mean, we would assume.

Vitalii Rudenskyi:
Yeah. And so it was challengeable. When you work in a big enterprise, if you've worked ever in a big enterprise, you know it's not an easy thing to make things like this. And I want just to give some kudos to my boss at the time, [inaudible 00:11:38], if you hear me, you're the best. So he was smart enough, brave enough to move this forward and just replace it and it worked, just like a little bit ahead of time, it worked. And it worked really well.

Tim Berglund:
Excellent. Hey, so let's just walk through the connectors. I'd love to know which ones were easy and Connect got to connect and make it all happen for you. And when was that not true, where you had to go and write your own and were those on the pull or the push side. Just take us through the story.

Vitalii Rudenskyi:
Yeah. So as I said at the conference or presentation, we had three key things that allow us to make all this stuff successful, we have to connect as one connector I created for pull data and another connector for push data. And we made some custom transformations that really allowed us to make unstructured data into a structured one. So let's talk a little bit first about push connectors. So it's built on top of Native. So you all know that framework native is really great for networking and the issues than it solves.

Vitalii Rudenskyi:
So it's configurable, the first thing is configurable and it's designed or developed to be able to work and to collect a huge amount of data, and to be configured in a highly available manner, to be available to work behind load balancers, all these fancy things for enterprise data collection and for a large amount of data collection. You can check with my boss on this. It has a lot of details on how to implement a custom connector or customizations for the connector car to configure the connector. It supports TCP, it supports UDP protocols, and you can simply plug in your own implementation to make networking data into your data.

Tim Berglund:
Yes.

Vitalii Rudenskyi:
It comes with a play-in converter, it comes with Syslog stuff, but if you have your own data that comes from the network, so you can easily extend the connector, it has plugins.

Tim Berglund:
Are they plugins custom to the connector or are they Connect converters? Are they Kafka Connect converters?

Vitalii Rudenskyi:
It's not Kafka Connect converters, it's a custom, you have to implement. It's more specific to Native than to Kafka Connect.

Tim Berglund:
Got it. Okay. Okay.

Vitalii Rudenskyi:
If you're familiar with Native, so it will be easier for you. It's just for Kafka Connect, you don't have to worry about it at all. It just works. So if you need some additional transformations, I mean, transformations not in terms of Kafka transformations, but you can plug in obviously, but plugins in terms of it's some more Native stuff and Connect, it's just a wrapper that allows you not to worry. The key thing, the amazing thing that you can start doing your connectors without any knowledge of Kafka Connect. So if you know what you're doing, if you know in terms of networking, what kind of data you receive, you can do it without deep knowledge-

Tim Berglund:
Of knowing Connect. So here's a question and it's funny. I'm assuming there's a Syslog connector for Connect. I don't think I've ever used it or directly seen it, but the internet can fact check me now, several thousand people can now tell me whether that's true or not. Because I can't Google it right now, but assuming there's a Syslog connector for Connect. What guided you in the direction of implementing something custom?

Tim Berglund:
Because you had reasons and the reason that when I'm explaining Connect to people or just talking about the ecosystem, the reason I usually give is, oh, well, if it's a protocol that's not covered, you've got some old mainframe application and it's not any queuing system that is already represented as a connector, whatever, but if the protocols are covered, you still had motivations. So what was your thought process for that? And I'm asking the question so that we can help explain to other people who are using Connect in interesting ways, how to reason about that themselves.

Vitalii Rudenskyi:
Really good question. So yes, indeed. There are some, I believe at that point it was. I'm pretty sure there are some connectors to collect Syslog data. But at that point, when I developed it, it was super-simplified, super simple. You were not allowed to run it behind a firewall, or not firewall, behind a load balancer. So it was just one. And you remember the time in Kafka Connect when our rebalancing started. Stop the world we're rebalancing. It was crazy. At that point, all the connectors shut down, and to work this, was really painful at the time.

Vitalii Rudenskyi:
So when we mostly get rid of this rebalancing thing it was a relief for us, but still to make it work. And so we cannot stop collecting data just because they're doing rebalancing. So we had to put it behind the proxy, behind TCP proxy, UDP proxy. And that was the first problem. The second problem. So we receive, we never know, something, some old system still can't make TCP Syslog, or we need to make UDP Syslog. I haven't found it at that point. So I couldn't find the UDP Syslog connector. But we required TCP-UDP and required some kind of customization. Because of different data and different formats.

Tim Berglund:
Got it.

Vitalii Rudenskyi:
So the first thing is functionality. The second thing is high availability. We have to run it behind a proxy and that's probably the key situation [crosstalk 00:18:51].

Tim Berglund:
Did the proxy do any kind of queuing? Because if Connect is doing a rebalance for 10 seconds or whatever-

Vitalii Rudenskyi:
It depends.

Tim Berglund:
Okay.

Vitalii Rudenskyi:
It depends for UDP, so not really, not any queuing.

Tim Berglund:
Okay.

Vitalii Rudenskyi:
Some of them can, if it's TCP, it's way easier. Even on the [inaudible 00:19:11] side, they usually have some queue and they've already sent that, but in the case of UDP, they just push and forget.

Tim Berglund:
Okay.

Vitalii Rudenskyi:
And we have to be really highly available in terms of when we consume UDP data.

Tim Berglund:
Yeah. Because you don't want to lose events that are going to make you miss a... I mean, if it's UDP, the assumption is you could lose events, but you don't want to design a situation in which you're losing them. Now incremental connection-

Vitalii Rudenskyi:
And last but not least idea why we decided to make our own connector is just from configuration standpoint. So when you have lots of different connectors, you can find different connections, but they all have to be maintained in a different way. So just from a maintenance perspective, just for people who will maintain the infrastructure, it will be way easier for them to have a similar kind of configuration than see and have very different configuration options, very different possibilities or fins or features for each and every connector, that's the other thing. Not really important, but it's good to have.

Tim Berglund:
Got it. What's the hardest problem you remember about the work with this one?

Vitalii Rudenskyi:
With this push connector, it was the rebalancing and to make it really highly available. So-

Tim Berglund:
If you had incremental connect rebalancing at the time, would you still have done that?

Vitalii Rudenskyi:
Yeah. So we resolve this up with the help of Kafka Connect, we resolved these issues. So our connector supported the work behind the proxy and what we did, we just created two dedicated Kafka clusters, behind a single proxy. And so we never made changes to both clusters at the same time. So we make a change. It makes one of the clusters is rebalancing and stops connectors, starts connectors, et cetera. And when it's done and then back online, we do maintenance with the other clusters, you can fit more than two, but two worked for us.

Tim Berglund:
Okay. Was there-

Vitalii Rudenskyi:
That's the way we resolved the issue.

Tim Berglund:
Did you then combine the events from those two clusters or was there like some-

Vitalii Rudenskyi:
They are still right. They still configured the connectors, they still configured it to use the same Kafka underneath. They used the same, Kafka. They use the same topics. They used exactly the same stuff and a single load balancer in front of them. But the idea is just if one of the clusters, for whatever reason is down, is it from maintenance or for rebalancing? Then the other will send the data.

Tim Berglund:
Got it. Got it.

Vitalii Rudenskyi:
And will receive the data.

Tim Berglund:
It sounds like this was a pretty successful implementation. I mean, the way you're talking about it seems there was a courageous leadership decision that kicked it off and you built things and they were the right things. And it worked. Number one, tell me if I'm wrong. I don't think I'm wrong, but tell me if I'm wrong. And if you had it to do over again, is there anything you'd do differently?

Vitalii Rudenskyi:
Not really. It maybe sums up, let's say so, I would use Native 4 instead of Native 3. So that's what I would do differently.

Tim Berglund:
Yeah.

Vitalii Rudenskyi:
I'm still working on migrating to a new version, but at that point, I decided to use 3 because it was the default for Kafka at that time. If I did it like today, I would use a newer version.

Tim Berglund:
Cool.

Vitalii Rudenskyi:
As for the rest. I think it's really good. So I wouldn't change, maybe some internal, but your code will never be perfect. So it's always, you have something to improve, but in terms of in general, I would say it was really good and I wouldn't change anything globally.

Tim Berglund:
My guest today has been Vitalii Rudenskyi. Vitalii thanks for being a part of the Streaming Audio.

Vitalii Rudenskyi:
Thank you, Tim. Thank you so much and thank you for letting me this opportunity to talk about Kafka Connect and all these data collections.

Tim Berglund:
And there you have it. Hey, you know what you get for listening to the end, some free Confluent cloud. Use the promo code 60PDCAST, that's six, zero, P-D-C-A-S-T to get an additional $60 of free Confluent cloud usage. Be sure to activate it by December 31st, 2021, and use it within 90 days after activation. Any unused promo value after the expiration date is forfeit and there are a limited number of codes available. So don't miss out. Anyway, as always, I hope this podcast was useful to you. If you want to discuss it or ask a question, you can always reach out to me on Twitter @tlberglund, that's T-L-B-E-R-G-L-U-N-D. Or you can leave a comment on a YouTube video or reach out on Community Slack or on the community forum.

Tim Berglund:
There are sign up links for those things in the show notes if you'd like to sign up. And while you're at it, please subscribe to our YouTube channel and to this podcast, wherever Fine Podcasts are sold. And if you subscribe through Apple Podcasts, be sure to leave us a review there that helps other people discover it, especially if it's a five-star review and we think that's a good thing. So thanks for your support. And we'll see you next time.