Bibit Jeruk Kolomonde Art Y9769Y Super Sudah Berbuah Ciamis Berkualitas

bibit jeruk kolomonde art. y9769y super sudah berbuah WELCOME Tunggu apa lagi,, Barang Saya READY STOCK, bisa langsung di order.. atau silahkan chat CS untuk memastikan warna / ukuran / variasi yang…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Tracing in eDreams ODIGEO Lodging with Open Telemetry and Grafana Tempo

In this post you will be able to understand the basics of tracing in software and how we are introducing it into our technological stack. Let’s start!

Being simplistic, tracing is the process of creating traces and depending on the field you work on, a trace is understood slightly different:

In technical drawing, the trace of a surface is defined as the intersection of the surface with the coordinate plane (you know, the x,y or the y,z, per example) which at the end looks like a line in the coordinate plane.

A trace intersecting Y

In linear algebra, the trace of a matrix is defined to be the sum of elements on the main diagonal from upper left to lower right.

Example on identity matrix 3 x 3: Trace = 1+1+1 = 3

In the 3D graphics world, there is a technique called ray tracing which is a process of creating traces that follow the path of light and how they interact with different objects in the scene and then helps rendering images with photorealism.

In software development, it is usually linked to the concept of debugging the execution flow and progression of an algorithm, application and/or services.

When you profile your software, traces are generated in a way that you can follow the data through the call stack. Even some logs can be considered traces!

I am not sure which definition inspired the software development one but the concept that you might be having in your mind already is the linearity of the trace which I don’t think is a coincidence and, so far, are good news, as humans have a linear way of thinking.

At eDO, we have more than one service (some are micro, some are just services) that we deploy independently and, like any software product, it can fail, no matter how good you program, or how good your infrastructure is.

An error on an inner service in the distributed call stack can affect the outer services that serve our customers and those failures, in general, produce error logs and alerts that the team can look into to debug and solve problems.

We are used to receiving or generating a unique identifier on each request that we use in all of our logs and then we pass it (via HTTP headers for example) to the next service which then uses the same identifier in its logs too and so on. This identifier is best known as CorrelationID and has the advantage of having a way to group all error logs + debugging with or without third parties (if sharing it).

The general definition of a CorrelationID is the unique identifier that follows an execution flow even if it is distributed synchronously or asynchronously (this means being able to propagate it through your event bus if you have one).

Example of service architecture that propagates correlation ids from outer services to inner ones and even asynchronous message consumers. In case of any failure, correlation ids are sent together with any log message to the log backend and developers can use them to debug and solve the problems.

Some failures are not so obvious as they might come up as a consequence of network bottlenecks, timeout and deadline issues, provider latency degradation and so on.

To give powerful tools to development teams to solve this kind of headaches we have introduced a new technological stack that complements our previous CorrelationID strategy with a more concise internal tracing capability.

From now on we will be speaking about Traces and not CorrelationIDs.

Before we deep dive in the technicalities, let me enumerate some of the goals that we wanted to achieve in order to find a suitable solution:

Any software piece that we deploy must be observable with logs and metrics.

Since we already have Prometheus, if the software has Prometheus exporters and exposes /metrics endpoints will be valued.

We also value portability since we have different programming languages and we love being able to evaluate and evolve our technology stack when needed so we will always prefer vendor neutral solutions which give us flexibility.

Our company has a good balance between open source and closed source software. We value software that has been designed with and for the community and we can understand what is inside.

There are tons of tracing solutions out there, and using standardized solutions and well-specified allows collaboration, evolution and robustness which are values that we love.

Tracing being an auxiliary software that is not the core of our business we expect it to be smooth in terms of performance and able to scale. At the end this is a tool to locate bottlenecks too, we do not want it to be a bottleneck :)

This tooling is going to be mainly used by developers. Developers are the customers of tracing for debugging and they must be able to understand, instrument their applications and be productive with the new capabilities.

Operations are the ones that are going to configure, provision and keep an eye on the tracing infrastructure… so the tracing solution must be user friendly with them allowing easy deploys, configurations, automation, scaling and so on…

Well, now that you understand what we tried to accomplish, let’s dive into the tracing solution ecosystem

Tracing strategies usually consists on several layers:

There are some vendor solutions like:

And vendor neutral / open source solutions like:

As you can see there are tons of them, some of them are interoperable, some are quite old and some are pretty recent.

We tried out Jaeger because it was well documented, looked pretty and a quite complete solution, docker ready and open source.

It was quite interesting as a starting point to start playing around with it and without dedicating much time to do so.

This is the Jaeger trace discovery interface where you can see all traces for a service and even filter them because we had a Cassandra backend that allowed us to search by ID, by service or by other criteria.
This is the interface of a specific trace which shows a frontend calling a backend service and the latencies associated.
And this is the result of running a spark job on the Cassandra database to plot the services relations and the number of traces stored.

Now we are about to start with the technicalities so we fully understand what is happening behind the scenes.

On the left the DAG with propagating context data. On the right a trace and spans representation.

As you can see, the moment in which the spans start and end depends on the execution of your applications (In the example, the Span A contains B, C, D and E, in other words, parent span A only finishes when the child spans are finished).

A trace and its spans might be part of a single application (p.e a modular monolith architecture) and/or distributed in several deployed applications (p.e a microservice architecture)

The Cassandra backend was something powerful and the solution was providing conceptually all the stuff we needed + some extras but we wanted something more simple to operate with.

That was the moment in which we started to evaluate OpenTelemetry and Grafana Tempo.

I have to admit that I fell in love with OpenTracing when I first read about it. It was a nice attempt to standardize the tracing industry, which Jaeger follows and that was the predecessor of OpenTelemetry with the help of OpenCensus.

If you feel dizzy at this point, you can ignore for now OpenTracing, OpenCensus and Jaeger at this point, because it was all a bit of history of how we got to OpenTelemetry and Tempo.

After some evaluation, we found out that OpenTelemetry was becoming a stable specification, and by stable I mean that in March 2021 they released the first stable 1.0.0 version!

Using their own words:

All OTEL components

Wow, it is just beautiful to read and it matches with the goals that we pursued, plus 1.0.0 means:

The key definitions to remember are:

No more theory! Let’s dive into some implementations and the problems that we had to solve so far, because theory is one thing but practice is another :)

Most of the services that we need to instrument right now are written in PHP7.4 but we might need to instrument Golang and Java services in the future.

I am going to bring back the aforementioned list of different tracing responsibilities that I mentioned before and explain how we are solving each of them:

For each programming language we will need a specific Software Development Kit (SDK) to create the Spans, propagate them between applications and to emit the data to a trace collector.

One of our To-Do tasks for the near future will be to replace Zipkin SDK with OTEL official SDK when we validate that everything is fine.

For this we use OpenTelemetry Collector, which is the main software piece that OTEL provides.

The OTEL collector architecture starts by accepting data on receivers (tons of them, like zipkin, opencensus, open telemetry protocol, jaeger… with any transport needed, like HTTP or gRPC) which then sends the data to a pipeline of processors (again, totally extendable and evolvable) which then fans out the data to different exporters (and again, tons of them with any transport required to any backend supported)

OTEL collector component design

At this point, you should already have done this:

For PHP:

For Java/GoLang:

The most performant solution is using the gRPC OTLP protocol since it’s a persistent HTTP2 connection behind the scenes but don’t worry because the collector is not a single piece, but an agent-gateway deploy architecture:

Representation of a microservice multi language OTEL-Tempo-Grafana architecture.

It is not plotted in the previous diagram but remember, that between services there can be a communication which propagates the state of the trace in progress.

The terminology for this is “in-band data” and “out-of-band data” which comes from networking and means that what is propagated in the same communication channel of the requests is part of the In-Band data, and the data that is sent to the OTEL agents from the SDKs is part of the Out-Of-Band data because is sent using a different communication channel (for instance, PHP and Java services might communicate using HTTP contract while Java and OTEL agent communicate using a OTLP gRPC OTEL contract.)

To connect the OTEL agents with the OTEL gateway collectors you have to configure a OTLP gRPC (Open Telemetry Protocol, which is based on ProtocolBuffers) exporter in the Agents and a OLTP gRPC receiver on the gateway collectors (remember agent and gateway are both otel-collectors software).

Example of otel-collector configuration deployed as agent for PHP application using Zipkin SDK

If you are wondering what Tempo is, don’t worry, it is coming now!

So, we collected the spans of the trace in the OTEL collector gateway (which can be a cluster) from each OTEL agent and now we are going to use another OTLP gRPC exporter but this time pointing to a Grafana Tempo backend.

Tempo, is, as defined by their creators

As you can imagine, it looks a very good match for our goals as a tracing Backend.

This is where some limitations start to appear.

Tempo, as it is in version 1.0.0, has no trace discovery as we saw with the Jaegger + Cassandra backend, but this might be open to improvement in the future.

Trace discovery is necessary in order to know which traces have been generated and stored in the backend and be able to query them.

Luckily enough, a very simple solution for trace discovery is to be able to log the traceID. That way, whenever an error occurs and is logged (or no errors, just any debug or info message) you will be able to take the traceID and query Tempo with it.

For this, we had to program a specific binding between our PHP Monolog based logging and the Zipkin SDK trace generation (If curious: we store the trace per request using a PSR15 middleware on a PSR11 container which then Monolog appends to any log).

Tempo is also written in GoLang and implements part of the OpenTelemetry collector (receivers), which allows it to configure it more or less the same way to receive OTLP gRPC data from our OTEL Collector gateway.

With Cassandra and the spark job, we could have a visualization of the whole service traces as a DAG (Graph) but with tempo being just an object storage queryable by traceID this is not available out of the box.

In any case, this is a cosmetic feature for us and was not part of our goals so we will live without it for now (Tempo might add some cool features in the future regarding this).

And this is the result and the experience that you get as a user of the tracing system.

Either by clicking the “tempo” button on the right of the traceID on the expanded log or by directly querying the tempo datasource you can get a full plot of the trace and the spans within showing latencies and execution flows.

Each span has a state, and can contain a list of attributes and a list of events (timestamped logs inside a Span) that can help you understand more context regarding the specific operation.

Example of span details between a client span and a server contiguous span between 2 services.

And here you can see how spans are being accepted by agents and gateway collectors as we defined before. This is just a query to Prometheus, which data he took by requesting /metrics endpoint on both agent and collector.

Real prometheus query showing accepted spans from OTLP protocols on agents and collectors during some minutes

This is just the beginning on tracing as we are still working on observability and how to give more powerful and useful tools to the teams to ensure that we deploy robust, scalable and performant services and that we will be able to be fast to spot and solve any error that might happen in our distributed architecture.

We are still in the middle of the implementation of the MVP and we have the capability to switch trace backends (we could add a Jaegger with Cassandra, or Google/Amazon vendor solution to try out) as long as they are compatible with OpenTelemetry which gives us a lot of flexibility.

We are also planning to deep dive into Exemplars for Prometheus and to connect them with traces which will close the circle of observability as we will have Logs, Metrics, Traces and a way to connect both of them.

The ecosystem is huge and the industry moves really fast, but I hope that this might be useful for some of you that are willing to deep dive into this amazing topic that as today is becoming more and more stable and standardized and allows many different setups.

Add a comment

Related posts:

A Fresh Start

Do I write? Rarely. Do I like writing? Depends, sometimes it flows and I enjoy that, but when it doesn’t… I dislike it intensely! When my brain gets stuck I feel very mortal and I intensely dislike…

Seven songs for a start up novice.

What does Carly Simon, Nirvana, Kenny Rogers, Phoebe Bridgers, Beyonce have to do with being a founder in a start up?

Burger or sandwich?

Nothing gets a burger purists knickers more in a knot than telling them their “chicken sandwich” is a burger. So what really constitutes a burger? Traditionally speaking, a hamburger is the ground…