Streaming Audio: Apache Kafka® & Real-Time Data

Real-time Threat Detection Using Machine Learning and Apache Kafka

November 29, 2022 Confluent, founded by the original creators of Apache Kafka® Season 1 Episode 245

Can we use machine learning to detect security threats in real-time? As organizations increasingly rely on distributed systems, it is becoming more important to analyze the traffic that passes through those systems quickly. Confluent Hackathon ’22 finalist, Géraud Dugé de Bernonville (Data Consultant, Zenika Bordeaux), shares how his team used TensorFlow (machine learning) and Neo4j (graph database) to analyze and detect network traffic data in real-time. What started as a research and development exercise turned into ZIEM, a full-blown internal project using ksqlDB to manipulate, export, and visualize data from Apache Kafka®.

Géraud and his team noticed that large amounts of data passed through their network, and they were curious to see if they could detect threats as they happened. As a hackathon project, they built ZIEM, a network mapping and intrusion detection platform that quickly generates network diagrams. Using Kafka, the system captures network packets, processes the data in ksqlDB, and uses a Neo4j Sink Connector to send it to a Neo4j instance. Using the Neo4j browser, users can see instant network diagrams showing who's on the network, allowing them to detect anomalies quickly in real time.

The Ziem project was initially conceived as an experiment to explore the potential of using Kafka for data processing and manipulation. However, it soon became apparent that there was great potential for broader applications (banking, security, etc.). As a result, the focus shifted to developing a tool for exporting data from Kafka, which is helpful in transforming data for deeper analysis, moving it from one database to another, or creating powerful visualizations.

Géraud goes on to talk about how the success of this project has helped them better understand the potential of using Kafka for data processing. Zenika plans to continue working to build a pipeline that can handle more robust visualizations, expose more learning opportunities, and detect patterns.

EPISODE LINKS

Kris Jenkins (00:00):
In this week's Streaming Audio, we head over to Bordeaux to talk about hacking, both kinds of hacking, actually, the television meaning of hacking, where people worry about other people breaking into our networks, and then the kind of hacking that's much closer to my heart, which is the idea of building interesting things with computers just to learn, just for the joy of building stuff and seeing where it leads. Hacking, as in playing with technology to feed your brain and see if it grows into something much larger. 

Kris Jenkins (00:33):
So my guest today is Géraud Dugé de Bernonville, who's been building out a network mapping and intrusion detection system using Apache Kafka, the graph database, Neo4j, and a bit of Google's TensorFlow for machine learning. And they're doing that to see if they can tap into the fire hose of network traffic data and just see what they can learn, see what they can build. Before we get started, Streaming Audio is brought to you by our education site, Confluent Developer. More about that at the end. But for now, I'm your host, Kris Jenkins. This is Streaming Audio. Let's get into it. Joining me today is Géraud Dugé de Bernonville. Géraud, how you doing?

Géraud Dugé (01:20):
Hey, fine.

Kris Jenkins (01:22):
Good to have you here. I have to say, as an Englishman, that is an epic name, very stately. I hope I'm not mangling it too badly.

Géraud Dugé (01:31):
In French, too, it's quite hard sometimes.

Kris Jenkins (01:35):
It sounds like you ought to own a lot of land.

Géraud Dugé (01:38):
No.

Kris Jenkins (01:41):
But you're actually ... you're not a rich French landowner. You're actually a software data wrangler for a company called Zenika, right?

Géraud Dugé (01:50):
Yeah, that's it.

Kris Jenkins (01:52):
Tell us a little bit about what you do.

Géraud Dugé (01:54):
Yeah. So I'm based in the Bordeaux area. So Zenika, we have a bunch of consultant, which deliver technological, organizational, and managerial expertise to our customer. So we do a lot of stuff. We do data stuff, data science, data processing and so on, and also development from back. We do agile, so it's quite large. But me, I'm more dedicated to data subject, so many data project in the Bordeaux area. So I do Kafka connect integration and so on, but I also work a lot with Elasticsearch at the moment. And so we deliver expertise to our client, but when we are not working for client, for customers, we give training. So I give training on Kafka topics, Kafka deployment, architecture, administration, and development, and also in Elasticsearch. And I also keep training related to machine learning and TensorFlow. A wide range of [inaudible 00:03:26].

Kris Jenkins (03:25):
A full suite of services. Funnily enough, I used to work for that kind of software consultancy agency that was based out of Paris. Is it just a coincidence or are there a lot of those sorts of houses in France?

Géraud Dugé (03:44):
Yeah. So Zenika, we have about 10 agencies in France. So we have one in Paris and we are also present in Canada, Singapore, and Morocco.

Kris Jenkins (03:59):
Oh, okay. Interesting.

Géraud Dugé (04:00):
More wide enterprise.

Kris Jenkins (04:04):
With a slight leaning towards places that speak French, by the sound of it.

Géraud Dugé (04:11):
Yes. Singapore is in English.

Kris Jenkins (04:14):
Yeah.

Géraud Dugé (04:15):
Other place, yes, more [inaudible 00:04:17].

Kris Jenkins (04:16):
Isn't there some French in Morocco? Let's not get into the geography. That's really not my strongest subject. Do you have a typical kind of client or is it just anyone that needs software?

Géraud Dugé (04:28):
No, we address different client from different business case. So we work with healthcare, ecommerce, and so on. So we have a wide range of customer or different customer types.

Kris Jenkins (04:47):
Okay, fair enough. Sometimes those agencies specialize and sometimes they're just who needs software, which is just about everybody these days, right?

Géraud Dugé (04:57):
Yeah. We work with who needs software.

Kris Jenkins (05:03):
Good target market because it's everyone. So getting more specific, so one of the reasons I wanted to have you on is you were one of our finalists in the Confluent Hackathon and you had an interesting project that I'm hoping you'll take me through. So give me the high-level view and then we'll get into some nitty gritty detail.

Géraud Dugé (05:23):
Yeah. So our project is called Ziem in French. So it's a pun with Ziem and Zenika. And it's mainly an internal project, more or less research and development. So we have a few people who work, few consultants who work on this. And the goal of this project is to provide real time intrusion detection. So in the process, we will collect data from network traffic, process those data, and try to detect if we can detect a potential attack from a hacker or bad person.

Kris Jenkins (06:17):
That instantly screams real time, large data to me because, network traffic, that's a lot of data to come in and you have to find out if someone's intruded soon rather than at the end of the month, right?

Géraud Dugé (06:31):
Yeah, real time. So that was the main objective of the project. This is still a work in progress project. So we still have a lot of work to do on this, but for the moment, we managed to build a simple infrastructure that we can iterate on and improve in the time.

Kris Jenkins (07:07):
So take us through that. What is the structure?

Géraud Dugé (07:10):
So the structure, in fact, this project comes from an old idea. So three or four years ago, we gave a conference on ksqlDB.

Kris Jenkins (07:32):
Oh, okay.

Géraud Dugé (07:33):
Sorry. The context was in the network traffic analysis. And last year, we decided to go further in this topic. So we decided to create a project around this conference, around this topic, and it started in the end of last year. So we started by reducing what we have done on the first presentation. So the infrastructure is the first part, there is a first big part, is to create sandbox environment where we can try to perform some attacks to collect the data and then to create dataset to train machine learning models on it and so on.

Kris Jenkins (08:29):
Oh, okay.

Géraud Dugé (08:31):
So this part was a big part to create this sandbox and this sandbox is... Can I go into technical detail?

Kris Jenkins (08:47):
Oh, yeah. Anything that doesn't need a diagram, you go for it.

Géraud Dugé (08:50):
Okay. So this part, sandbox, is deploying GCP in Google Cloud.

Kris Jenkins (08:57):
Okay.

Géraud Dugé (08:58):
We use Terraform to deploy this stuff. And so the infrastructure can be made of several dozen of virtual machine that share the same network and we can then connect to one of those VM and try to perform attacks.

Kris Jenkins (09:20):
Right, yeah.

Géraud Dugé (09:22):
So this is a playground where us, concept and we can play with and have-

Kris Jenkins (09:29):
Safely try and break things.

Géraud Dugé (09:31):
Yeah, yeah. That's it. And on this sandbox we have plugged an agent that collect the network traffic and then those data are sent to Kafka cluster.

Kris Jenkins (09:46):
Is that like a custom agent or is it just something built into...

Géraud Dugé (09:52):
For the moment, it's mainly... It's based on TShark, which is Wireshark but in client, command line mode. And-

Kris Jenkins (10:03):
So just a very standard network sniffer?

Géraud Dugé (10:08):
Yeah, yeah.

Kris Jenkins (10:09):
Yeah.

Géraud Dugé (10:09):
That's it. So it based on TShark and we use the Kafka kcat to send the data, send the output to Kafka.

Kris Jenkins (10:20):
And you're just dumping all that raw data straight into a topic?

Géraud Dugé (10:24):
Yeah. At the moment, we take all the data, we take all.

Kris Jenkins (10:26):
Okay.

Géraud Dugé (10:30):
So this will be in the next part where we'll try to improve the data to see if we need everything or not.

Kris Jenkins (10:44):
But that's a very standard approach with Kafka. "I've got this hose of data, I just dump it into a topic and then worry about it," right?

Géraud Dugé (10:52):
Yeah, yeah. That's it. That's why Kafka was the good solution for us because it can ingest a large amount of data.

Géraud Dugé (11:04):
And the other point that make us use Kafka is ksqlDB, which is a way to easily provide processing to implement a processing task.

Géraud Dugé (11:25):
And also with the integration of Kafka Connect, we can easily extract data to other systems. So that was the point of the Hackathon.

Kris Jenkins (11:40):
Right. So your pipeline is command line sniffing network packets, sending it to a Kafka topic. And then I did have a quick look at your source code, actually. So you are dumping just JSON packets as events into the topic and then you're using KSQL to reach into that JSON and massage it for later processing, right?

Géraud Dugé (12:00):
Yep, that's it.

Kris Jenkins (12:02):
Yeah. Okay. That makes sense. And it also makes a lot of sense that you're using that because you're grabbing everything because you don't yet know what patterns you're looking for, so just grab all the data and process it.

Kris Jenkins (12:15):
So what do you do with it once it's in the topic, once you've... How do you massage it with KSQL and what are you trying to get to?

Géraud Dugé (12:22):
Yeah, so for the moment, before the Hackathon, there was only one use case for ksqlDB. It was mainly to prepare data and export it for training some machine learning models.

Géraud Dugé (12:44):
And then we plan to reiterate on this stream processing to improve, to better format the data for machine learning.

Kris Jenkins (12:55):
Right.

Géraud Dugé (12:55):
And then we saw the contest, the Hackathon, and we think about how can we implement something that looks sexy and quite quickly.

Géraud Dugé (13:17):
So we had the data, we know how to collect the data, we have data in Kafka that is coming, and so the better way to represent those data that represent network traffic. So this is a graph. So the natural solution for us was to use Neo4j.

Kris Jenkins (13:41):
Neo4j, yeah. The graph database. Yeah.

Géraud Dugé (13:44):
Yeah, in the first step. So with ksqlDB, it was very easy to implement, to create a sync connector to Neo4j that take data from a transform topic to a graph database.

Kris Jenkins (13:59):
Okay.

Géraud Dugé (14:00):
Then the first idea was to only present the data from the default Neo4j UI. So this was the first idea to participate in the Hackathon.

Géraud Dugé (14:24):
But after looking a while, if the visualization proposed by Neo4j at the time was not quite satisfying, so we decided to implement a client based on a already existing JavaScript library, which is Neovis.js.

Kris Jenkins (14:44):
Oh, okay. NeoVis? So that's a dedicated JavaScript library for visualizing Neo4j data.

Géraud Dugé (14:53):
Yeah, that's it.

Kris Jenkins (14:54):
I've not used that one. Okay.

Géraud Dugé (14:57):
So that's the screenshot you see on the GitHub repository, shows that, the result of our UI.

Kris Jenkins (15:11):
I saw that and it's almost an instant network diagram of who's actually on your network and who's internal and external. And I thought before we even get into the hacking and machine learning, that would be useful for a lot of organizations that don't know their own internet intranet infrastructure, right?

Géraud Dugé (15:31):
Yeah, sure. That was the point when we implemented this because we thought for ourself, this is already a good point to have this visualization. And also, yes, it was in fact imposing in the roadmap of the project to have better visualization. In fact, we implemented this UI very quickly, it take only a few days. Once we had-

Géraud Dugé (16:00):
Very quickly, it takes only a few days. Once we had the whole infrastructure and the data coming in Kafka, it was very easy to implement the step to extract, to Neo4j and then plug in an application on it.

Kris Jenkins (16:18):
Do you get things like the ascents from your visualization of how much traffic is flowing between which nodes? Is it weighted at this stage?

Géraud Dugé (16:30):
It's in the V2 of ... Because during the summer, we continue to work on it and we improve the UI. And now, we can display the traffic flow, the volume, the count of packet that go from one point to another point.

Kris Jenkins (16:52):
Nice. Because that's ... I've worked at some banks where their network diagram is a Word document that's two months at best, two months out of date. They don't know what they actually have on their network. And if you could just go from a script to that kind of visualization that you knew was live and real, that's actually really cool just in itself.

Géraud Dugé (17:13):
Yeah, sure. But yeah, there are some limitation.

Kris Jenkins (17:17):
Oh, tell me.

Géraud Dugé (17:18):
Yeah. Under clearly data-

Kris Jenkins (17:19):
Terms and conditions.

Géraud Dugé (17:20):
Yeah. Because at the moment, currently in our sandbox, there is only one network, so there is no VLAN and so on. So in fact, if you want to scrap the data from different VLAN or different network, you'll have to deploy at least one agent on each, the clock.

Kris Jenkins (17:44):
Oh, right. Yeah. So one agent is enough to collect the local land traffic.

Géraud Dugé (17:50):
Yeah.

Kris Jenkins (17:50):
That doesn't seem too onerous. That doesn't seem like too high a burden, to make something that's really quite large. Yeah, that would be very cool. So getting beyond that, tell me something about your machine learning plans with this, because that sounds really cool.

Géraud Dugé (18:12):
Yeah, the really cool part. So for the moment, the first step was to perform non-supervised machine learning, to perform anomaly detection mainly. So it is to train some models and then to see if the prediction are too different from what we collect at the moment, to see if we detect anomalies, if the traffic is corresponded to anomalies. So it's mainly based on packet count.

Kris Jenkins (18:57):
Okay.

Géraud Dugé (18:57):
If you are a very high count of packet during a small period, it's suspicious.

Kris Jenkins (19:09):
Suspicious. Yeah. So it could be hacking, it could be some new process creating a lot of network load. But it's probably something you want to know about either way.

Géraud Dugé (19:21):
Yeah. That's one first point of the project, to start with non-supervised. And then the next step, we are not here yet, is to use non-supervised detection to help us better put label on our dataset for training. And the goal after is to go to supervised machine learning, to better categorize the different type of attacks. So to detect if an attack is a scan port or if it's a trial of DDoS in [inaudible 00:20:07] service.

Kris Jenkins (20:07):
Right, so let me just check I remember this correctly. Supervised machine learning is where you give it some examples that are labeled with, "This is an example of something that's an attack. This is an example of something that's not."

Géraud Dugé (20:21):
Yeah.

Kris Jenkins (20:22):
And unsupervised is where you give it no clues. You just give it data and say what patterns can you spot?

Géraud Dugé (20:27):
Yeah. That's it.

Kris Jenkins (20:28):
Yeah. Okay. Cool. So I learned something on that machine learning course I took. So do you have in-house security experts who are performing new attacks or are you not at that stage yet?

Géraud Dugé (20:45):
So in fact at Zenika, we have a team specializing in security. And for the first steps of the project, in fact, we use a training that is provided by this team, to try to reproduce some attacks by yourself, on our sandbox.

Kris Jenkins (21:14):
Oh, okay. I have to ask: You're running this on GCP. Do you risk running foul of Google detecting your tests as actual attacks and shutting you down, because presumably they're doing this proactive monitoring?

Géraud Dugé (21:33):
Yeah. So in fact, we are not choosing directly the network layer provided by GCP. We have implemented in fact a virtual layer on top of the network. In fact, all the VM share the same virtual network where we authorize some stuff that GCP doesn't.

Kris Jenkins (22:04):
Okay. That sounds like you'll be safe.

Géraud Dugé (22:08):
So for the moment, we didn't have some issue with Google.

Kris Jenkins (22:15):
If anyone from the GCP team who's listening to this, please leave them alone. They're safe.

Géraud Dugé (22:23):
It's for research and development.

Kris Jenkins (22:25):
Yes. It's all white hat hacking?

Géraud Dugé (22:28):
Yeah.

Kris Jenkins (22:30):
So is this something that's going to lead directly into client work, or is it just purely research and learning?

Géraud Dugé (22:38):
That is the purpose of the project too. It is top ... In first internally, to try to build a solution around this problematic of network security, and to have a consultant that can work on it and gain experience. And when we'll be more experienced, it is to industrialize this solution. And why not propose the solution to customers?

Kris Jenkins (23:18):
Yeah, because I can see you could just go straight in with that as network visualization and security detection. But it's also a kind of pipeline that's got to be reusable in other industries? We've got a ton of data coming in and we need some pipeline that can handle visualizing it, learning from it, detecting patterns from it. It must be the word Bordeaux in my head, but I instantly think how this could change the wine making industry. They have a lot of things to monitor and they need to know quickly when something unusual is happening. I guess sometimes you can just look out the window, but not always.

Géraud Dugé (24:01):
Yeah, sure. I think the network security is just a use case when we will be more experienced with all the ... how we implement the pipeline and so on, collecting data, processing, training models and so on. This is a method that can be reused in other use case.

Kris Jenkins (24:29):
Yeah. One other thing, one last thing I wanted to ask you was, I've not actually used TensorFlow in [inaudible 00:24:35], and I really want to. So give me a crash course in getting started. If I had some network connection packet traffic data, what would I actually do?

Géraud Dugé (24:47):
The purpose is to use TensorFlow to train a model on the network data. TensorFlow is a very famous library to train a neural network. Neural network is made of several layers of cells that try to learn from the incoming data. There is a lot of hyper parameters to tune the number of cells, the number of layers, how each layer connect to each other. That's a big work for data scientist. And the output of a training is a model that we can reuse to predict, to perform some prediction on fresh data that is coming in Kafka. That's the goal of the product too. How can we call the model? In fact, in ksqlDB, you can use user define function.

Kris Jenkins (26:05):
Ah, yeah.

Géraud Dugé (26:06):
Enrich the KSQL language. The idea is to define a user defined function that will call the model to perform prediction.

Kris Jenkins (26:16):
Oh, so you're going to in line some kind of predict function into your SQL query

Géraud Dugé (26:24):
Yeah.

Kris Jenkins (26:25):
Okay. Because you define user defined functions in Java? So are there Java hooks into TensorFlow?

Géraud Dugé (26:31):
Yeah.

Kris Jenkins (26:32):
Okay. I thought TensorFlow was mostly Python. Am I wrong there?

Géraud Dugé (26:37):
Yeah, but you can try load models.

Kris Jenkins (26:39):
Okay. Okay.

Géraud Dugé (26:43):
Or maybe a PI can call a PI to remote TensorFlow server.

Kris Jenkins (26:48):
Oh, that would be cool. So that's how you get back to the real time on a pair event basis prediction.

Géraud Dugé (26:55):
Yeah.

Kris Jenkins (26:56):
Nice. Okay. I'm going to have to try that out. I'm going to have to find time soon to play with that kind of thing.

Géraud Dugé (27:05):
Yeah. But it's still in progress, so we broke a lot to improve the rule.

Kris Jenkins (27:12):
Well, I don't think we have any rules against you resubmitting to next year's Confluent hackathon, so maybe we'll see how the project evolved by then. Right?

Géraud Dugé (27:20):
Yeah.

Kris Jenkins (27:20):
Géraud, thank you very much for coming and talking to us. That's a really interesting project.

Géraud Dugé (27:25):
Yeah. Thank you for your invitation.

Kris Jenkins (27:30):
Cheers. We'll see you again. Bye.

Géraud Dugé (27:32):
Yeah, bye.

Kris Jenkins (27:33):
Thank you, Géraud. I'm going to confess to you here, I'm a little bit jealous of him. That combination of big data and machine learning stuff is something I've never really found time to play with, and I would love to. If my employers are listening, please give me time to play with these things. In return, I'll remind our listeners that Streaming Audio is brought to you by Confluent Developer, which is our technical site that teaches you everything you need to know about Apache Kafka, realtime systems, and event systems in general. We've got tutorials, courses, architectural guides, and of course the back catalog of this very podcast. So take a look at developer.confluent.io. In the meantime, if you want to run your own Apache Kafka cluster, get it up and running easily, and leave all the management to us, take a look at our cloud service at Confluent Cloud. You can sign up in minutes, have Kafka running reliably in no time at all. If you add the code PODCAST100 to your account, you'll get a bit of extra free credit to run with.

Kris Jenkins (28:37):
And with that, it just remains for me to thank Géraud Dugé de Bernonville for joining us. To make one last apology for mangling his name as I'm sure I have with my British accent, and to thank you for listening. I've been your host, Kris Jenkins, and I will catch you next time.