How to use OpenTelemetry to Trace and Monitor Apache Kafka Systems Artwork

Confluent Developer ft. Tim Berglund, Adi Polak & Viktor Gamov

Hi, we’re Tim Berglund, Adi Polak, and Viktor Gamov and we’re excited to bring you the Confluent Developer podcast (formerly “Streaming Audio.”) Our hand-crafted weekly episodes feature in-depth interviews with our community of software developers (actual human beings - not AI) talking about some of the most interesting challenges they’ve faced in their careers. We aim to explore the conditions that gave rise to each person’s technical hurdles, as well as how their experiences transformed their understanding and approach to building systems.

Whether you’re a seasoned open source data streaming engineer, or just someone who’s interested in learning more about Apache Kafka®, Apache Flink® and real-time data, we hope you’ll appreciate the stories, the discussion, and our effort to bring you a high-quality show worth your time.

All Episodes

Confluent Developer ft. Tim Berglund, Adi Polak & Viktor Gamov

How to use OpenTelemetry to Trace and Monitor Apache Kafka Systems

February 01, 2023 • Confluent, founded by the original creators of Apache Kafka® • Season 1 • Episode 255

0:00 | 50:01

How can you use OpenTelemetry to gain insight into your Apache Kafka® event systems? Roman Kolesnev, Staff Customer Innovation Engineer at Confluent, is a member of the Customer Solutions & Innovation Division Labs team working to build business-critical OpenTelemetry applications so companies can see what’s happening inside their data pipelines. In this episode, Roman joins Kris to discuss tracing and monitoring in distributed systems using OpenTelemetry. He talks about how monitoring each step of the process individually is critical to discovering potential delays or bottlenecks before they happen; including keeping track of timestamps, latency information, exceptions, and other data points that could help with troubleshooting.

Tracing each request and its journey to completion in Kafka gives companies access to invaluable data that provides insight into system performance and reliability. Furthermore, using this data allows engineers to quickly identify errors or anticipate potential issues before they become significant problems. With greater visibility comes better control over application health - all made possible by OpenTelemetry's unified APIs and services.

As described on the OpenTelemetry.io website, "OpenTelemetry is a Cloud Native Computing Foundation incubating project. Formed through a merger of the OpenTracing and OpenCensus projects." It provides a vendor-agnostic way for developers to instrument their applications across different platforms and programming languages while adhering to standard semantic conventions so the traces/information can be streamed to compatible systems following similar specs.

By leveraging OpenTelemetry, organizations can ensure their applications and systems are secure and perform optimally. It will quickly become an essential tool for large-scale organizations that need to efficiently process massive amounts of real-time data. With its ability to scale independently, robust analytics capabilities, and powerful monitoring tools, OpenTelemetry is set to become the go-to platform for stream processing in the future.

Roman explains that the OpenTelemetry APIs for Kafka are still in development and unavailable for open source. The code is complete and tested but has never run in production. But if you want to learn more about the nuts and bolts, he invites you to connect with him on the Confluent Community Slack channel. You can also check out Monitoring Kafka without instrumentation with eBPF - Antón Rodríguez to learn more about a similar approach for domain monitoring.

EPISODE LINKS

SEASON 2
Hosted by Tim Berglund, Adi Polak and Viktor Gamov
Produced and Edited by Noelle Gallagher, Peter Furia and Nurie Mohamed
Music by Coastal Kites
Artwork by Phil Vo

🎧 Subscribe to Confluent Developer wherever you listen to podcasts.
▶️ Subscribe on YouTube, and hit the 🔔 to catch new episodes.
👍 If you enjoyed this, please leave us a rating.
🎧 Confluent also has a podcast for tech leaders: "Life Is But A Stream" hosted by our friend, Joseph Morais.

Kris Jenkins (00:00):

Tracing and monitoring. It's a problem that we often talk about on this podcast because it's a thorny one in distributed systems, and we often talk about it from the performance angle. How do you ensure that everything's being processed fast enough? How do you deal with bottlenecks? But this week we're going to look at it in a much more fine-grained way. At the level of individual events. What if you have, say a stock trade coming in over rest, getting processed through various systems, and eventually it has to go out over a connector to meet a reporting deadline?

Kris Jenkins (00:35):

Maybe every trade has to be reported to the regulator within the hour. That's fairly common these days. How do you debug that processing stream at the level of individual events? How do you track a single event through all those steps? Well, joining me this week with one answer is Roman Kolesnev, who's here to talk to us all about OpenTelemetry, what it can do for you, how it works, how you get started with it, and what it does for Kafka and event systems specifically.

Kris Jenkins (01:05):

He also covers what he found it lacked, the holes that he's been trying to fill, specifically around Kafka Streams and Kafka Connect support. He's made a lot of progress and he's got a lot to teach us, and there is more than just event level tracing on offer here, but if you need event-level tracing, this is a particularly good podcast to catch. So before we begin, this podcast is brought to you by Confluent Developer. More about that at the end. But for now, I'm your host, Kris Jenkins, and this is Streaming Audio. Let's get into it.

Kris Jenkins (01:42):

On the podcast today, we have Roman Kolesnev, who's joining us from Ireland, I believe.

Roman Kolesnev (01:48):

Yep. North Island.

Kris Jenkins (01:50):

Yeah. Yeah. Now you work for, let me get this right, it's CSID. The Customer Solutions and Innovation Division. Is that right?

Roman Kolesnev (02:02):

Yep.

Kris Jenkins (02:02):

Which gets pronounced CSID. So this is the first thing I have to ask you. Why is it not just called CSI Confluent? That feels like it would be cooler.

Roman Kolesnev (02:15):

CSI. Yeah, maybe. Maybe.

Kris Jenkins (02:19):

Okay.

Roman Kolesnev (02:20):

Looking for a,

Kris Jenkins (02:21):

Nevermind. Nevermind. So you guys get to do loads of cool practical things, right? Your department's been doing things like Parallel Consumer and end-to-end encryption and stuff?

Roman Kolesnev (02:35):

Yes. We have kind of two strains of work. One, experimentation and being the crazy scientists/A team, trying to work out things, how they could be implemented, how they can be done, and what's feasible. And part of it is findings professional service engagements and code that can be generalized and made more reusable. So the things that get implemented multiple times for clients in different engagements that generally could be made a bit more robust and made more usable and a bit generic kind of quickly. And-

Kris Jenkins (03:19):

Right. Pulling out those bespoke patterns into something more useful.

Roman Kolesnev (03:24):

Yes.

Kris Jenkins (03:25):

Yeah, that makes sense. Yeah. So that brings us to what you've been working on lately because there is a very, very common pattern in building large distributed systems, right?

Roman Kolesnev (03:38):

Yes. I'm experimenting with OpenTelemetry and distributed tracing and how it works with Kafka clients, what's available and what's not available, where the gaps, what we can fix it. It all started more this kind of general, how we can do event-level tracing on Kafka and Kafka ecosystem. And then doing different evaluations and exploring what's available already out of the box, different frameworks. And we landed on OpenTelemetry as a new and coming rage for distributed tracing.

Kris Jenkins (04:14):

Okay. And tell me something about OpenTelemetry.

Roman Kolesnev (04:18):

So OpenTelemetry is an evolution of OpenTracing and OpenCensus, which originated from Uber and was one of the driving forces. So OpenTracing was doing the tracing side of it. OpenCensus was doing metrics side of it. And OpenTelemetry is kind of new evolution that tries to cater for both. And its goal is to be vendor agnostic so that you can instrument your applications in different platforms in programming languages, but then have the common semantic convention so that those traces and the information could be streamed into common system. And they would all conform to similar specifications and would understand each other, would conform to the same standards.

Kris Jenkins (05:12):

Okay. And so if you've got your custom microservices around your organization made with all different kinds of technology, you could use OpenTelemetry?

Roman Kolesnev (05:22):

Yes. That's a -

Kris Jenkins (05:26):

Or if you've got a straightforward Kafka cluster, you could still use it?

Roman Kolesnev (05:29):

Yes. But even with Kafka Cluster for example, some of your clients may be written in Java and some may be written in Go. So even using the same platform for messaging, because you're instrumenting your clients, you're not instrumenting just the messaging platform, or sometimes you're not instrumenting the platform itself at all. You still need to have that conformity on your different clients so that the data that they stream, the traces that they stream are similarly looking for the backend and to be able to process them and understand.

Roman Kolesnev (06:03):

And then a different part of it is that it's vendor agnostic. So you're not streaming traces that only, say Elasticsearch can consume, or Datadog or whoever. You have your collection system that is slightly separated from your processing system and you can switch the backend much more easier because the exporter layer can be configured to point to different systems and move from one back into another, without changing how you collect the data on your application side.

Kris Jenkins (06:43):

Right. Okay, so that sounds very nice and reusable. Let me just check if I've got check I've got this right. So you're saying it's both for kind of metric level aggregate statistics, and taking the Uber example, I could trace the events related to a particular trip through the system?

Roman Kolesnev (07:05):

Yes. So OpenTelemetry, it has three subsystems. It's tracing, metrics, and logging. Logging is still very experimental. It's still in even more experimental than tracing and metrics. It's not marked as stable on their site, if I'm correct, unless it changed very recently. But the idea is that you need all three generally in your observability.

Roman Kolesnev (07:34):

From a trace, you want to jump to logs, to look at the logs that correspond to the trace. And then maybe to look at the metrics that were collected, either based on the timing or again if you are tagging metrics with trace information because sometimes it's not enough to just have one side of it.

Kris Jenkins (07:52):

What's the distinction between tracing and logs? I guess they kind of merged in my head.

Roman Kolesnev (07:58):

So the trace records the operations that's being happening and some metadata. And as a developer, you can record business-level trace as a metadata around the business logic and around the message. But, then the logs are your standard logs that you write from the system that may be message related, but they may be more system related, in a sense like consumer polling logs or whatever happening on the system logs.

Roman Kolesnev (08:30):

Generally, again, I guess if you are writing a completely new system from scratch all your message level processing, you could write all the logs as events to the spans to the traces and so on. You can probably get away without writing any logs at all, as logs as such as traditional logs. But if you have already a big system with all the logging implemented and libraries that do the logging and so on, then you need to stitch them together, traces and logs and potentially collecting them in different places as well.

Kris Jenkins (09:07):

So logs are just kind of the general traditional logging line, dumping ground of everything we know.

Roman Kolesnev (09:13):

Yes.

Kris Jenkins (09:13):

And tracing is more kind of business logic above it.

Roman Kolesnev (09:19):

It's kind of separate, a little bit. I mean, there is a very thin line because you can do tracing through logs. You can just dump the logs that contain, "Oh, I received the message with such and such metadata. I processed it, I sent it somewhere." And you can do it purely through logging. And then, say you're using Elastic or Splunk or whatever parsing and you're just parsing out the logs that are specific to message flow, for example, ignoring everything else, here's your trace. And you can visualize it, you can reduce metrics out of it, use the log time stamps to work out the performance timings and things like that. And do that all purely on logs.

Roman Kolesnev (09:58):

But, then it's more work. If you have a library that has an SDK, an API, that does it in a more easier, more understandable fashion when it's talking about, "Okay, let's record an operation," and the operation spans from here to here. And then the next operation, it depends on it [inaudible 00:10:22] spans from this point on next point to the next point, and together they make up a trace that captures the flow of this message through services.

Kris Jenkins (10:30):

Oh, okay. Yeah.

Roman Kolesnev (10:31):

That kind of records it in the natural semantic. And both for the developer and then for architect and for operational people looking it up, it's all talking about the operations and spans and traces rather than it's just a dump of logs.

Kris Jenkins (10:51):

Yeah, yeah. Okay, I'm with you now. Okay, great. So that kind of -

Roman Kolesnev (10:57):

But, then in that context, sometimes logs are still useful because obviously it's hard sometimes to map the information that you want to capture into the semantic of span and trace and even events happening there. You maybe want to add additional kind of data that you can look up in the logs, once you traced the point to which you look want to look, here's additional information I want to lift up from the logs now that is related to it.

Kris Jenkins (11:26):

Okay. Yeah, that makes sense to me. So if you want to do that kind of application-level tracing, what do you do currently with OpenTelemetry? Does it have a Kafka plugin that you just wrap into, or what's the deal?

Roman Kolesnev (11:42):

It depends.

Kris Jenkins (11:46):

My favorite answer.

Roman Kolesnev (11:48):

So if you're talking about Java. Let's take the Java because Java and GVN clients are probably most proliferate, more popular in Kafka ecosystem. So if you're using consumers and producers, a standard Java consumer producer ones, you have few ways to use OpenTelemetry with them. You can add a dependency to your project, to your application and write a bunch of code manually using add span here, add span there, on your application site just before sending or just after consuming. That is a very kind of base thing. But, then you have interceptors you can use, so you can attach interceptor, consumer interceptor, producer interceptor, which would inject the trace points, the spans for send and for receive automatically.

Kris Jenkins (12:42):

Okay.

Roman Kolesnev (12:44):

And the other approach -

Kris Jenkins (12:48):

So you can potentially just drop something in and get automatic producer consumer tracing into OpenTelemetry without changing your code.

Roman Kolesnev (12:57):

Yes, you configure receptors and they would be loaded for you and do that. Then, another approach is they provide a wrapping producer and consumer. So you can set up OpenTelemetry object in your code and say give me consumer. Now drop consumer is a tracing consumer, and all its operation will be traced.

Roman Kolesnev (13:16):

The third way is to use Java auto instrumentation. So OpenTelemetry has a Java agent that is written as using APM. Basically like APM library. So it can load it alongside your jar through minus Java agent command line parameter and just loads a jar in addition to your app. And then that auto instrumentation has a lot of different implementations, so it would add automatic tracing for, say [inaudible 00:13:53] if it's web service that you're hosting. Same for Kafka producer, consumer clients, and things like that.

Kris Jenkins (13:59):

And that's just using the usual JVM-level agent tracing type API?

Roman Kolesnev (14:09):

Yes. So the way it works, under the hood that it has, it uses Byte Buddy, which is a wrapper for Java ASM library. So at class loading it injects additional byte code that then has advices marked for specific libraries and SDKs and frameworks to wrap them into this tracing behavior.

Kris Jenkins (14:31):

Ah, okay. Right. Okay, I'm with you. Let's get a recommendation out of you, which of those three would you generally recommend people use?

Roman Kolesnev (14:45):

I guess the auto instrumentation agent is the go-to. If you only want to trace Kafka, and it's only Kafka, then interceptors are easier to configure and put in place. But if you want to add additional things like you want to add business logic span in the middle, you are doing web service call for example, things like that, you want the additional instrumentation to be there, then APM the agent is a go-to.

Kris Jenkins (15:20):

Yeah, that makes sense to me, I think. If I'm going to the effort of installing it anyway, I might as well get things like REST calls thrown in for free, right?

Roman Kolesnev (15:28):

And you can mix and match. So you can have an agent running, there's the traces, the libraries and SDKs use, but then you want to add a span inside your code to create a business logic specific, span specific to your process. You can add it in, they don't clash. You can use both at the same time.

Kris Jenkins (15:47):

Okay. So I'm leading in to work that I know you've been doing, but what's it like if OpenTelemetry's agent doesn't support a particular API you're interested in? What's it like to add in some extra support?

Roman Kolesnev (16:04):

It supports loading extensions, so in the same fashion as all the instrumentation is written inside the instrumentation agent, you can write your own additional instrumentation as an extension and then you can instrument the application by using the Java agent command line parameter and all that. And then as additional parameter to it, you can specify, "Okay, an addition to the main agent, load those extensions, one or more," up to you. And then you write the same kind of aspect oriented interceptors advice classes using Byte Buddy and so on to add additional behaviors that you need.

Kris Jenkins (16:41):

That's written in Java?

Roman Kolesnev (16:43):

Yes, that's for Java.

Kris Jenkins (16:45):

Okay. So you write some Java code that says, "Look out for call to this class function method," and instrument it this way?

Roman Kolesnev (16:55):

Yep. Exactly, yes. In your advice, you're saying for this class name space class, this kind of methods, so for example, public send that takes two parameters and first parameter is integer and second parameter is string apply this advice. And then you have the capability to do OnMethod Enter and OnMethod Exit to specify what additional code you want to run. And you can intercept the parameter values that are passed into the method. You can change the return that's being returned from the method with additional, with a written value if you need to. Gives you a lot of flexibility.

Kris Jenkins (17:36):

Yeah. It's been a long time since I've done any aspect oriented programming, but it's very useful for that kind of generic rewriting. It always reminded me of database triggers. Now, before you insert this row, do this stuff, right?

Roman Kolesnev (17:53):

Yep. But actually, yes, it's quite similar in that aspect.

Kris Jenkins (17:57):

Cool. Same concept in a different domain. So you have been writing these, I know. What have you been doing? What's missing in the current state of OpenTelemetry's native Kafka support?

Roman Kolesnev (18:16):

So for Java, again, specifically applications, it covers the consumer and producer tracing quite well. But if you venture a bit outside in the Kafka Streams, in the Kafka Connect, then the support has gaps in it. So there is an instrumentation for Kafka Streams, but it only targets the handover from the consumer into the topology and then tracing the send of the producer. So there is nothing to help with topology tracing inside.

Roman Kolesnev (18:52):

Now, if your Kafka Stream's topology is not using any stateful operations, if you're doing everything in flight, like some mapping and then send out, that works fine. The problem start if you're using stateful operations and you want to write some data in the state store, then look it up later for aggregation, for joins, for things like that, then your tracing context is getting lost.

Kris Jenkins (19:20):

Okay. So as soon as you do something like a K table or a join, the support falls apart?

Roman Kolesnev (19:25):

Yeah.

Kris Jenkins (19:27):

Okay. So you've been fixing that?

Roman Kolesnev (19:29):

Yes, I was playing about with that. And the main problems. There are two things that kind of apply, not just to Kafka streams, but in general for frameworks that do that things. One, is when the processing has to be kind of one-to-one, even with consumer and producer tracing. For example, the way Kafka works, when you pull data, it pulls a batch. So it pulls a batch of messages, yeah?

Kris Jenkins (20:02):

Yeah.

Roman Kolesnev (20:03):

And then you have your consumer records object. Normally, you would have a footage loop, for example, that you do something and then you produce out. If you're saying, consumer produce kind of flow, like Kafka Streams would be. And that works fine as long as your produce is within that iteration loop. Because the way that interceptors, the tracing code works is that the consumer records object is augmented to return a tracing iterator. So when you iterate, it's not the standard iterator, it's actually a tracing iterator that has hooks that for the next call, it takes a context tracing context, puts it into tread local, puts it into current basically context. And then when you do producer or send, it looks, "Do I have a tracing context in memory? Do I have it active?"

Kris Jenkins (20:57):

Right.

Roman Kolesnev (20:58):

Use it as a parent and links them together. And now the send is linked to the previous consumer process, basically.

Kris Jenkins (21:06):

Right. Yep.

Roman Kolesnev (21:08):

Now where it falls apart is if instead of doing for each and within the, for doing the send directly, you for example, store it into a list, do something, and then it trades at list because at list no longer returns the trace iterator, it has no idea.

Kris Jenkins (21:25):

Oh, right. Yep.

Roman Kolesnev (21:27):

And the context propagation is lost. So where I'm going with this-

Kris Jenkins (21:32):

It's really stormy because you wouldn't know as a programmer that you've just broken that chain, right?

Roman Kolesnev (21:38):

Yes. Again, I guess it tries to pattern to 90%, 95%, whatever the standard pattern, the ones that you can anticipate. Because the problem that I found as well, writing that extension, when you're working as a developer on a platform or on the application, you know what you're doing, you know what your processing is, you know what your domain is, you know how you're using it. Slightly less so if you are working on a platform or a libraries that you give to your internal developers, but even then you kind of can give some constraints or capture some usage scenarios because you're within the company, within kind of the departments.

Roman Kolesnev (22:27):

But, then when you're writing that instrumentation for general use, you can anticipate only so much. Someone is doing something crazy, someone is doing something completely different that you haven't thought about. And sometimes you can't really just do anything about it. Or you can say, "Okay, this way it works. This way you need to think how to solve it in your code additionally."

Kris Jenkins (22:53):

Yeah.

Roman Kolesnev (22:54):

And it's not hard in a sense that, if you need to do that handover, if you're writing to some intermediate buffer or list or whatever before you do processing and send or before you do the send, you can always read the trace context that's available while you're iterating over consumer records.

Kris Jenkins (23:16):

Okay, yeah.

Roman Kolesnev (23:17):

Store it somewhere, and with a message, store a pair, here's a message, here's a trace context. Now, when you're passing it to produce, you can say, "Here's a message, use this trace context for your send as a parent."

Kris Jenkins (23:32):

But you have an insight into how Kafka Streams tends to work, so you can pick up some of those common patterns.

Roman Kolesnev (23:41):

Yes. So the same thing with the Kafka Streams. So Kafka Streams, tracing propagation and context. So it had kind of two areas that we looked at. One is getting trace to propagate, not to lose it when we are doing stateful operations. And second part of it is actually trace the stateful operations themselves. So whenever we are doing store [inaudible 00:24:07] in the state store, whenever we do a read from it, it records spans for them, it record those operations as part of the chain so that you can see it in the trace.

Roman Kolesnev (24:17):

And the trace propagation issue, we are fixing exactly in the ways I described. So the problem is if you use caching enabled for state stores, so Kafka Streams has a caching layer, when the state store operations don't go directly to the state store but are written to [inaudible 00:24:39] and then cache is flashed at interval either on size trigger or during commit, and you have the same problem. You're trading over messages, you're hitting the operations that does, for example, stay store put, writing to stay store, instead of protesting the follow-on operations, they're just sitting in the cache until the flash happens. So now your context flow is kind of broken because you're returning to the next message, you're closing this context of the message process and you're picking the next message and starting its own process span. So what we're doing is we are storing the context along with the cash message so that on flash we can again rehydrate and read it back and rehydrate it and continue on the chain.

Roman Kolesnev (25:31):

The second part of it is state stores, they don't store headers. They only stores a message key and payload, and the tracing, which is sometimes causes that issue. And tracing context, is stored typically as a Kafka header. So we needed to kind of devise a way to store the header data alongside the actual payload of the message. So I'm doing merged serialization, so I'm taking the bytes of the payload, taking a byte of the trace context, you merge them together, add Magic Bytes so I can detect that it was a traced message or not traced message when I write to the state store. And then I'll read. Again, check for Magic Bytes, and split them out and write it right.

Kris Jenkins (26:23):

Oh, right, okay. Does that mean if I shut down the JVM and remove the agent and then started it back up, it wouldn't be able to read the state store?

Roman Kolesnev (26:35):

Yes, that's a problem.

Kris Jenkins (26:36):

Right, that's a bit of a project.

Roman Kolesnev (26:37):

It would rebuild the state store. Yes.

Kris Jenkins (26:39):

Okay.

Roman Kolesnev (26:39):

And the change log at the moment, the stuff that it reads to the change log has those Magic Bytes as well. So you would need to replace the messages from the original topic, but have a ticket in the backlog to address it, to move the trace data from the payload into headers when right to the change log. And then when rereading, rebuilding the state from the change log, then if it understands those headers, it can add the trace context as needed. If it's not, if the tracing library is removed, it'll just ignore it because it doesn't care in the normal flow about those headers.

Kris Jenkins (27:17):

Okay. Yeah, that makes sense. That'd be a lot faster. And so you've done that for Kafka Streams. You've added a lot in for Kafka Streams. You've also said you've been doing work with Connect as well, is that ...

Roman Kolesnev (27:33):

Yeah, the Connect has similar and different issues. Two kind of gaps. So there is no specific instrumentation for Connect in OpenTelemetry at all. You can use the same tracing interceptors or agent, but it just targets the consumer and producer within the connect workers and connectors. So depending if you're using sync or source connector, your consumer would trace records the span for receiving the message for the sync connector. But that's basically the last thing it'll record. What happens after in your connector would be is it disconnected or lost?

Kris Jenkins (28:14):

Oh, okay. Yeah.

Roman Kolesnev (28:15):

And the same for the source connector. The produced send will be recorded, but where it came from would be lost. So mainly I was looking at replicator, which is a combination of consumer and producer, but the propagation between the consumer and the producer was broken in it because it has similar logic. When it consumes the messages, whatever the source is, whether it's the database or replicator, which is Kafka consumer or MQ, first, it writes it to Intermediate Buffer, then converts it to internal connect record objects, and then iterates on the buffer and produces them all. And as I was explaining, the tracing context is lost because you're using that Intermediate Buffer, so you're no longer using that tracing iterator.

Roman Kolesnev (29:06):

So again, similar we are recording the context, we are storing it and reiterating it when we are going into the produce. We added a few extra features like tracing for Simple message transforms so that we have a chain of transformations you could pinpoint if some of them are slow or breaking to have that correlation.

Kris Jenkins (29:28):

Or corrupting things, simply?

Roman Kolesnev (29:31):

Yeah.

Kris Jenkins (29:32):

Yeah.

Roman Kolesnev (29:34):

And then a second part of it that is not very well kind of addressed yet in OpenTelemetry I find, is that it's more targeting individual services like microservices. Yeah?

Kris Jenkins (29:50):

Right.

Roman Kolesnev (29:50):

You have a resource attributor that says my service is, I don't know, let's say place order service or add to cart service or something. But Connect is a platform, so your single instance of Connect can run 20 different Connect workers and all of them are actually individual services that belong to different flows. So you don't want to name that instance saying "Connect platform one" because -

Kris Jenkins (30:23):

That's not very helpful.

Roman Kolesnev (30:24):

... what is it like? Yes. All the workers and record a single service. So what I had to add is some overrides so that I extract from connector name basically from the config from the instance of the connector running, and then override that resource attribute based on the connector name so that the traces recorded by all the different connector worker instances have the names that correspond to the connector, not to the instance that runs and not overall platform instance.

Kris Jenkins (30:55):

Right. Yeah.

Roman Kolesnev (30:58):

And things that would be applicable to a lot to, say tracing Flink or tracing KSQL again, or any platform that runs jobs when you want the name of the service to correspond to the job itself, not the platform instance.

Kris Jenkins (31:13):

Yeah. Yeah, that makes sense. That makes me wonder, have you done any work with KSQL on this yet?

Roman Kolesnev (31:20):

Not yet. I haven't looked into it deeper. I know that it uses Kafka Streams as runtime. So my hope is that most of the Kafka Streams' work would be applicable to fix again, the same issues. Propagations, they store operations and things like that. But it'll probably be a mix of kind of Connect issues and Kafka Streams issues because then again, it's like platform side of it and extracting individual workflows, individual jobs and their metadata and capturing it accurately as part of the trace.

Kris Jenkins (31:50):

You must spend a lot of time diving into the Kafka Streams' code base and just figuring out exactly how it works and writing new rules.

Roman Kolesnev (31:59):

Yes, yes. I did. Actually, on the kind of separate topic. We did a hackathon with a colleague on geospatial Kafka Streams' usage.

Kris Jenkins (32:12):

Oh yeah.

Roman Kolesnev (32:12):

And part of it we were substituting [inaudible 00:32:17] and figuring out how that would work. The previous work was OpenTelemetry and figuring out the whole flows of where the writes are, where the reads, how it all hangs together helps you a lot.

Kris Jenkins (32:29):

Yeah, I can see how you build on that. That's cool. So what's the actual final experience at the moment with this? As a user, as a debugger who's got nicely traced code, what am I actually going to see? Is there an OpenTelemetry GUI that now shows me my Kafka Streams topology or something?

Roman Kolesnev (32:53):

Not topology as such, but the flow of the message. So you have your spans ... so I have a demo application of test applications that I'm using that is like a transaction bank or something like that. Processing like withdrawal or deposit, really simple flow. So it has an account Ktable that stores account details, is a bank account open or closed, and then transaction, which is a depository withdrawal. So the typical transaction, say for withdrawal, that goes in the way it gets traced. You have your producer send recorded, then Kafka Streams application process, picking it up and processing it.

Roman Kolesnev (33:32):

Then look up, so the accounts Ktable fetch, reading the accounts details. Next, fetch of the state store for account balance, the aggregate value is the current aggregate value of it. Then, update of it as the next operation and then send off of the output to the consumer and then consumer picking it up. So how that kind of flow of operations, that touching either state store or external systems. So we're not tracing each node in the topology. If there is a flat map or filter that doesn't get traced at the moment as individual span, but anything that stores data or fetched data that gets traced all the kind of storage points, I guess.

Kris Jenkins (34:23):

You would see the data as it goes from topic to topic, maybe not internal topics?

Roman Kolesnev (34:30):

It would show internal topics, in competitions change logs, sends and reads. Yes. And the good thing about it is OpenTelemetry has a concept of linked traces or linked spans. So when I'm recording, say read from state store, account state store, it records a link to the previous update of it. So you can navigate from this trace straight to the trace that shows what was the flow that updated the value I just read. So if there is an issue with that data, you can go and look at the last update to it and its flow and find its logs and trace and all that.

Kris Jenkins (35:09):

Oh. So you could track back through someone's account history in your example?

Roman Kolesnev (35:13):

Yes. Yeah.

Kris Jenkins (35:14):

Oh, cool. Okay. That's groovy. And this is all some OpenTelemetry front-end GUI, or what's the actual tool?

Roman Kolesnev (35:23):

So telemeter doesn't have a GUI or backend as such. It terminates as a collector and exporter kind of data agent side. So you have agent running in your wraps that send traces to the collector set. And the collectors, you can specify additional batching, pre-processing, filtering of it and then exporting to whatever the storage and GUI is. So there is open source Jaeger, which is again Uber's open-source product, but then you could be using Elasticsearch with Kibana or Splunk or Datadog, New Relic, LightStep.

Kris Jenkins (36:05):

Right.

Roman Kolesnev (36:05):

Lots and lots of [inaudible 00:36:07] systems.

Kris Jenkins (36:08):

[inaudible 00:36:08] Analytics.

Roman Kolesnev (36:09):

Yeah.

Kris Jenkins (36:10):

Yeah. Okay, cool. So what's the state of this at the moment? Is it open-source? Alpha, beta? If someone wants to get their hands on it and try it.

Roman Kolesnev (36:22):

It's not open-source yet. We don't have a training on the customer yet, so we want to find a customer within Confluent, I guess engagement that needs it and wants to kind of partner a bit to battle test it basically. Because it's all written, it has integration tests and things like that, but it's never run on production platform at production or pre-production UAT, but with the proper level of flow and use cases that maybe we haven't thought about and want to kind of iron out corner cases in something that is a bit more real rather than our demo application or integration test.

Kris Jenkins (37:05):

Right. So you're at the, "We need to battle test this" stage?

Roman Kolesnev (37:10):

Yes, yes.

Kris Jenkins (37:12):

Cool. Okay. It sounds like if someone wanted to get started with this, then the first thing to do is grab OpenTelemetry and maybe start writing their own interceptors if they're feeling spicy.

Roman Kolesnev (37:26):

Yes. It depends. I find interceptors for the agent itself are best for tracing libraries because you don't want to fork Kafka Streams, you don't want to fork the library you're using. So the way to trace it is usually to write the aspect oriented codes that you can run at right places within the library. But, then if you have your own application side, your business logic and so on, it's easier and probably better most of the time just to add this span adaptation or create span here, create span there.

Kris Jenkins (38:05):

Right.

Roman Kolesnev (38:05):

Those are attributes that are relevant to my domains that I want captured along with the span from my message, and things like that.

Kris Jenkins (38:11):

Oh, okay. Okay, I see. Yeah. So the difference is do I control the source code, happily?

Roman Kolesnev (38:18):

The source code and how generic it is. If it's my opposite is only doing my thing and it's my business logic, then there is no point to write interceptors for it because it's like singular basically, or not very widely reusable.

Kris Jenkins (38:35):

Yeah. There's no value to the overhead of writing Byte Buddy code.

Roman Kolesnev (38:41):

Yeah. Now, if you are writing a library, if you're big corporation, Netflix, say size or big super, whoever, and your platform teams that writing a library to be used by your application teams, then you can either bake in the tracing code because again, you could control the source code or use interceptors to kind of separate the codes that does legwork of messaging and tracing kind of separately from it. And it's distributed separately potentially.

Kris Jenkins (39:14):

Cool. I think I've got that. Yeah, there was one other question I wanted to ask you which kind of relates to all this. It's the distinction between synchronous processes and asynchronous processes. I got the sense from reading some of the OpenTelemetry documentation that it had come from a background of dealing with traditional synchronous processing. And the -

Roman Kolesnev (39:45):

Yes, that's correct.

Kris Jenkins (39:49):

... the [inaudible 00:39:48] of Kafka is something that's been less well-thought through. Is that fair?

Roman Kolesnev (39:53):

That is correct. The semantic convention for messaging for synchronous messaging is still being developed. It's very recent, and the main use case was hundreds of microservices, but the whole deal in request response pattern to have a clear request response. Maybe not necessarily blocking, not necessarily always synchronous, but always more or less going through the chain of calls. You still have that, "Here's a request, I need to build up the response in whatever ways hitting whatever number of services."

Kris Jenkins (40:31):

Right?

Roman Kolesnev (40:31):

With Kafka and this general messaging, it's a bit different. Sometimes you have that request response, but more often than not, its here's a signal of something.

Kris Jenkins (40:42):

Yeah, yeah.

Roman Kolesnev (40:44):

That the event is a signal that then flows and it's aggregated, protest rooted, splitted and reached and can go in different ways in the different system. There is no direct correlation between the request and the response pattern there.

Kris Jenkins (40:59):

Yeah.

Roman Kolesnev (41:00):

And on top of that, the kind of transport versus logical flow considerations, because the producer accumulates a batch of messages that it sends and sends them in a block. Now, do you want the trace that records that operation, the send of the batch, or you want the trace that traces the individual message inside the batch on the song?

Kris Jenkins (41:30):

Oh. Yeah.

Roman Kolesnev (41:30):

Or both, and then how do you link them together? And the same on the consumer side. You pulled 20 messages, it took, I don't know, 30 milliseconds, but that 30 milliseconds is operations that pulled those 20 messages. Do you want to apply that to each message and split it and record per message, or do you record both the pull and then processing of each message individually? Because depending what you're doing, if you're doing performance analytics, then sometimes it's not fair to say it took 50 milliseconds to consume this message because alongside this message, you actually consumed 50 more messages in this case.

Kris Jenkins (42:11):

Yeah. But it's also not true to say it took a fifth, like one millisecond because you can't straight divide by 50. Yeah.

Roman Kolesnev (42:18):

Exactly. And the thing is, it's like what you're tracing, what you're trying to look at, are you tracing a flow of the message through the system? Are you trying to find performance bottlenecks and you are more interested in kind of metrics? But then should you be looking at the metrics then instead rather than the trace timings?

Kris Jenkins (42:39):

Yeah.

Roman Kolesnev (42:39):

Or you're trying to find A/B test stuff. So say when it's taking this pass, it's taking that long when it's taking this pass, it's taking that long. And then you are interested in performance, but you want it to be filterable. So you're looking only at some subset that's grouped in some way. Say you want to test performance improvement of potential conflict change, so you change it on some application, some instances, and then you're kind of comparing one against the other.

Kris Jenkins (43:14):

Right.

Roman Kolesnev (43:14):

So there is a lot in it.

Kris Jenkins (43:18):

Yeah. So it sounds like the tools are all there for doing it, but at the asynchronous world brings up some new questions you have to answer.

Roman Kolesnev (43:25):

Yes. And the specifications, the semantic convention is discussing that. There is a lot of movement in that regard. I think it's getting ironed out so that you have best of both. So you record both per message flow, but then you're linking using the trace links, the transport level traces, and they're kind of separate but alongside each other so you can jump from one to the other.

Kris Jenkins (43:51):

Okay. I could see people also using this to trace things like Akka Streams, like actor-level models, which is very similar, right?

Roman Kolesnev (44:05):

Yeah. Yeah. It supports Akka, to an extent. I haven't dug deep to say how much it supports it, but it supports a lot of those things. A nice thing that they found is that it automatically correlates the logs for you as well. So the trace and span that you generate, you can just enable using MDC like in context logging to inject those into the logs that are generated while your visiting that span when your code is generating them from the span.

Kris Jenkins (44:37):

Oh, right.

Roman Kolesnev (44:37):

So then you can jump from the trace to logs. But I read an article on Medium recently from VIX. They wrote their own kind of Kafka library, Greyhound I think, and what they do, they go in further. They inject trace IDs or message correlation IDs into the metrics that they generate. So, then you can correlate the metrics back to individual message flows as well, not just the logs, I guess, where it makes sense.

Kris Jenkins (45:07):

Okay. Yeah. Is that something ... Sorry. Is Greyhound the level above OpenTelemetry or is it completely separate?

Roman Kolesnev (45:15):

Greyhound is a VIX Kafka client kind of more rich Kafka client, so it's a wrapper around Kafka consumer producer, and they added capability to write traces and metrics and so on into it.

Kris Jenkins (45:26):

Right.

Roman Kolesnev (45:28):

But it's just an interesting idea that read about there that sounds, "Yeah, we're not just correlate logs. Let's correlate metrics on top of correlating logs and traces to stitch into one kind of cross correlated set of data."

Kris Jenkins (45:41):

An idea you could potentially steal for OpenTelemetry?

Roman Kolesnev (45:45):

I need to have a look, yes, I guess. But then again, it depends. Which metrics you care being captured per message versus more aggregate-able per instance, per tread or whatever, because again, it's like horses for horses. What's a use case, what it fits where when you should do it when you shouldn't.

Kris Jenkins (46:09):

Yeah.

Roman Kolesnev (46:10):

Because a lot of performance analytics metrics are perfectly fine for performance analytics. You don't need the per trace granularity for it to see that your IO is slow, for example, you're hitting disc too much. Doesn't matter for which message you're doing. It's just for that instance, for that node, you're doing it.

Roman Kolesnev (46:31):

Now, things that I guess are more per message performance analytic relevant is, either something in those messages that is different from our other messages, like for example, checkout process that calculates some discounts, that uses some specific holistic, and this combination of parameters going into it is much slower than everything else. Now we're trying to then look, is there a different process working? Is it taking time because there is a bug when it's only for people from Canada ordering or some combination or something.

Kris Jenkins (47:09):

Yeah. Those kind of weird combinations they come up when you've got to try and have some way of finding that that's the problem, right?

Roman Kolesnev (47:15):

Yes. Yeah. Okay. That's where it comes that correlatable metrics of data.

Kris Jenkins (47:21):

It sounds very useful. It sounds like we have to wait for you to publish more that they will come. If someone wants to get in touch with you, especially if they want to battle test it. Can we leave your contact details on the show notes?

Roman Kolesnev (47:39):

Sure. And there is Confluent, I think, Open Slack, community Slack.

Kris Jenkins (47:45):

Oh yeah. Confluent Community Slack. If you're on there, that's probably the best place to find you, right?

Roman Kolesnev (47:49):

I'm on there, yes, and I think there is a tracing channel or observability channel as well that I'm checking it once in a while and other people would ping it to me as they notice.

Kris Jenkins (48:01):

Perfect. If you're interested, head there. And we're going to get some more people from your department on, because you're doing all the fun stuff. As I said. Thank you very much for joining us.

Roman Kolesnev (48:11):

Thank you for having me.

Kris Jenkins (48:13):

Pleasure. Thank you, Roman. As we said, if you want to go and talk to Roman some more, maybe pick his brains, check out the Confluent Community Slack channel. Link in the show notes. It's easy to get in and there are always plenty of us here from Confluent, chatting with the rest of the community.

Kris Jenkins (48:30):

In the meantime, if you've got a hankering for more monitoring related information, you might want to go back and check out an old episode we recorded from Streaming Audio with Anton Rodriguez talking about eBPF, which is interesting because it approaches exactly the same domain monitoring, but from the level of hooking into the Linux kernel. I could definitely see a large enough organization wanting to use both. Monitoring's a large enough field to encompass both angles. So go and check it out. Link in the show notes.

Kris Jenkins (49:01):

While that said, if you want to monitor more of what we've been doing here in the developer experience corner of Confluent, head to developer.confluent.io, where we have an ever increasing library of courses, tutorials, blog posts, videos, long and short, teaching you everything we know about event systems, from design and implementation to monitoring and performance.

Kris Jenkins (49:28):

It remains for me to thank Roman Kolesnev for joining us, and you for listening. I've been your host, Kris Jenkins, and I'll catch you next time.