Distrito Telefónica. Innovation & Talent Hub

Back
Software Engineering

Implementation of Distributed Tracking with OpenTelemetry at Telefónica Kernel

When operating and evolving a complex system, observability is a must. Usually, lots of components and software layers are intertwined in complex flows. As a result, it is difficult to draw some valuable insights to diagnose and tackle major performance bottlenecks or reducing e2e delay. In the same direction, being able to track specific requests through the e2e system is key for proper debugging of errors or corner case scenarios.

Traditionally, systems observability is built on three main pillars: Logs, Metrics and Tracing (although some extra pillars can be considered, let’s keep that out of the scope of this post). In this observability context, OpenTelemetry has got lot of traction and presents itself as tempting option that is worth trying. We would like to share with you how we are using it in our platform.

Trace and span concept, from Jaeger website

Trace and span concept, from Jaeger website

In Telefónica CDO Engineering we develop Telefónica Kernel, a service enabler and Big Data platform that exposes telco APIs from Telefónica local operations. Telefónica Kernel platform provides 3 main functionalities:

  • OAuth and OpenID based API exposure of a wide set of developer friendly APIs. These APIs expose access to several telco capabilities and data that can be used to provide new services or enhance the experience of existing ones.
  • GDPR compliant access to user Personal Information with explicit consent provided by the user regarding the authorized use (purpose) of her data by every requesting service.
  • Big Data capabilities and SDK, eg. to build and execute customized algorithms over a wide existing dataset portfolio form telco network and systems, also enforcing GDPR user consent when needed, as Telefónica Kernel platform mantra.

You can find detailed info about Telefónica Kernel platform in the following link. Telefónica Kernel is also the underlying platform supporting GSMA Open Gateway Initiative in Telefónica footprint, as recently shown during MWC 2023 in Barcelona. You can read about APIs exposed by Open Gateway here.

Telefónica Kernel is composed of several subsystems (no further details needed at this point) and it is integrated both northbound, with services that consume Telefónica Kernel APIs and SDKs; and southbound, with local telco networks and systems by means of native telco APIs and interfaces provided locally by every Telefónica operation. In this context, as in any other complex system scenario, e2e observability has great value.

In this post we will focus on distributed tracing observability pillar. Telefónica Kernel platform already has a solid state-of-art logs and metrics subsystems built on ElasticSearch and Prometheus/Grafana stacks respectively. But no current support existed for tracing across the platform, so we worked hard to solve that.

From Datadog website, distributed tracing could be defined as:

“Distributed tracing is a method of tracking application requests as they flow from frontend devices to backend services and databases.”

A trace is composed by a trace-id, that identifies it uniquely across the system, and a set of spans. Every ‘span’ represents the request pass through every component involved in the e2e processing and contains a set of attributes with useful information (time duration, ids, etc.).

Trace and span concept, from Jaeger website

Trace and span concept, from Jaeger website

As part of our internal analysis in Telefónica CDO Engineering, we reached the conclusion that the open approach from OpenTelemetry, with an open tracing format and protocol (OTLP, OpenTelemetry Line Protocol), was our best shot to cover our initial set of distributed tracing requirements, that can be summarized in the following list:

  • Ability to instrumentate all development technologies currently used in Telefónica CDO Engineering, so a single distributed tracing approach could be used in all software pieces provided (both Kernel and services built on Telefónica Kernel APIs) by our team.
  • Ability to trace transparently those components that could not or were not desired to be instrumented form day 0. Specifically, tracing inside Telefónica Kernel must not require any code impact during first stages.
  • Ability to integrate with existing tracing solutions currently used by Telefónica local operations teams, so Telefónica Kernel internal tracing could be exported to local tracing systems.

It is important to clarify at this point that Telefónica Kernel is a multi-local platform deployed on Azure cloud, based on Azure managed Kubernetes Service (AKS), as undelying compute infra. It is ‘global’ because it is available in all Telefónica footprint, but it does so by having a dedicated deployment per country were Telefónica operates (in the corresponding Azure region).

This way, services built on top of Telefónica Kernel APIs can be accessible to any Telefónica customer (so they are global in Telefónica footprint), but this is done by dedicated local deployment per local operation, which is operated as any other local system or network by the local operation. There are several reasons for that multi-local approach, not only technical, but this is out of the scope of this post too.

The approach followed to enable distributed tracing transparently inside Telefónica Kernel was to leverage on tracing capability provided by service mesh. Internally in Telefónica Kernel, a service mesh was being integrated at the time distributed tracing was being evaluated. Service mesh initial goals were providing Traffic Management capabilities required in the interconnection of the different platform pieces. We ‘hijacked’ that ongoing activity and extended its scope to include distributed tracing as one of its first features to be applied.

idecar concept is service mesh, source Istio web site

idecar concept is service mesh, source Istio web site

We decided to leverage on that effort and reuse the sidecar injected by service mesh to provide trancing about how service traffic progress through the platform, from component to component.

For sure, this approach is a black-box tracing approach as it provides tracing at the edge of platform components, not inside them. But that simple approach gives us immediate and valuable tracing information we did not have yet, without impacting the code base of the platform. Extra effort can be paid later, to instrument Telefónica Kernel components internally if deemed valuable or necessary. You can check here, as reference, how tracing is enabled in Istio service mesh sidecars.

We discarded using transparent instrumentation feature provided by OpenTelemetry because Telefónica Kernel includes components built on several different technologies, and transparent tracing provided was not homogeneous across all technologies, so we decided to stick with the mesh provided tracing, as the bare minimum base to start building distributed tracing on.

So first thing we started with was analyzing which mesh should we use for this tracing purpose, paying special attention to the impact that service mesh has on the overall platform. We build a short list of service mesh candidates composed by Istio, Linkerd, Cilium and Open Service Mesh (OSM, Microsoft backed alternative to Istio, currently integrated and optionally deployed as part of AKS). The criteria to build this list was to validate performance and resource footprint of service meshes with different implementation approaches.

For the performance analysis of this short list, we used the same performance test used in Linkerd project to compare to Istio (called emojivoto application), so we can have a something common to compare between them.

Componentes simples de la aplicación Emojivoto, de GitHub

Componentes simples de la aplicación Emojivoto, de GitHub

Resource consumption and impact on platform performance is key because of the high volume of requests served by Telefónica Kernel platform, so we want to pay special attention to the amount of resources dedicated to HTTP handling processes. That is important from the point of view of cost and limit/requests pod resources needed to ensure that sidecar proxy has enough vCPU and memory resources to avoid becoming itself a bottleneck of the proxied component.

In our tests, we used an input of 1K/s request per second generated from Azure Load test tool (based on JMeter) to load emojivoto service deployed on AKS kubernetes cluster, and focused our analysis on the performance observed on web pod in emojivoto app (the one in the middle). We chose that load figure (1K/s requests) because it is currently the dimensioning unit we use in several service-facing components of the platform.

Actually, we can not set the exact load generated, but only the number of traffic generators and ‘users’ per generator. As JMeter was generating different rates per service mesh (because of the different performance of each other), we decided to normalize the results assuming vCPU use is linear to input load (eg. 1,4 K/s, using 0.8 vCPU are equivalent to 1k/s using 0,57 vCPU) so it would be easy to compare.

The main conclusions drawn from emojivoto testing over service mesh short list are summarized bellow:

  • Without tracing enabled, Istio web pod sidecar required 0,63 vCPU (remember, normalized to 1K/s) and 50MB RAM. e2e added delay caused by addition of Istio and Linkerd service mesh was 2ms (comparing to emojivoto ran alone, without service mesh).
  • In the same configuration scenario, Linkerd provided similar delay, but only 0,46 vCPU (-27%) and 17 MB (-66%) RAM resources were used compared to Istio.
vCPU used by sidecar proxy without tracing

vCPU used by sidecar proxy without tracing

  • Regarding OSM, it presented +50% of extra delay, +40% vCPU, +40% memory, compared to Istio during tests done without tracing, so we decided to take it out of the short list at this point and focus on Istio as Envoy based mesh.
  • When tracing is enabled in service mesh, (with 1% sampling), Istio added delay is increased from 2ms to 4ms and (normalized) vCPU increases to 0,81 (+27%). In case of Linkerd, delay increase is lower than observed with Istio (from 2ms to 3ms) and normalized vCPU stays constant (but that does not mean vCPU use is not increased; if we remove normalization effect, the Linkerd throughput is actually reduced 20% by enabling tracing).
(1K/s normalized) vCPU used by sidecar proxy with tracing enabled and 1% sampling

(1K/s normalized) vCPU used by sidecar proxy with tracing enabled and 1% sampling

  • RAM memory used by sidecar is not affected apparently by enabling tracing. Additionally, memory consumption does not depend on the input rate, but mostly about the number of endpoints in the mesh. The following picture presents the results with the minimum set of endpoints. This value can be even double or triple when tested in real deployment.
Memory used by sidecar proxy, in MB

Memory used by sidecar proxy, in MB

  • Regarding Cilium, mesh impact on delay and resources was insignificant. As no sidecar exists, Cilium eBPF overhead is accounted directly to the proxied containers processes. We could not appreciate a measurable increase in testing pod containers resources compared to no service mesh scenario.
  • Performance impact of enabling tracing on service mesh depends on where sampling is actually done. In order to sampling actualy reduce resource consumption on service mesh sidecar it has to be done before reaching OpenTelemetry, eg. in sidecar proxy or even before. Sampling done inside OpenTelemetry does not actually save resources at service mesh or OTEL collector as traces has to be actually sent to OpenTelemetry Collector to be discarded (so performance ‘damage’ is already done). In our tests, 1% sampling was done in Istio Envoy or in Kubernetes Ingress Controller for Linkerd testing (as no sampling is supported by Linkerd sidecar proxy).
  • Finally, impact of mTLS on Istio and Linkerd is minimum, so its activation does not require extra resource allocation.

Taking into account the results from Istio tests, for those components that acted as destination of requests, mesh sidecar would require 0.4 vCPU (50% from 0.81 vCPU) and 50 MB RAM to handle 1K/s requests, compared to 0.23 vCPU and 17MB if Linkerd is used. In case of those components that do request forwarding, Istio mesh sidecar would need the full 0.8 vCPU observed for web pod in emojivoto app (which sits in the middle of load tool and emojivoto ‘back-end’ pods). This results obtained are aligned with those published at Istio and Linkerd website.

As resource consumption done by service mesh is relevant, Cilium no-sidecar approach is very compelling in terms of resources. But it has some drawbacks we can not skip easily:

  • Cilium high performance comes from eBPF based implementation of the sidecar (no sidecar used actually), but some Traffic Management capabilities are not yet implemented in eBPF so, it still requires to use Envoy sidecar in case they are needed. So, results observed in our tests are a best case scenario (as our emojivoto testing did not use any traffic management feature). A real use of service mesh in our deployment will require eventually some Traffic Management features implemented currently by using Envoy, breaking the no-sidecar ‘promise’.
  • Currently Azure does not include Cilium in the list of officially supported CNIs, so it has to be used under Azure Bring-Your-Own-CNI mode, which limits the support provided by Microsoft. The other option is using Cilium by CNI concatenation so eBPF logic is applied over connectivity provided by AKS officially suported CNIs.
  • Finally, Cilium observability subsystem, Hubble, does not include native support for OTLP and relies on an OpenTelemetry adaptor not officially supported yet by OpenTelemetry (available in an independent github repo).

Under these circumstances, we preferred to keep Cilium as a promising option to keep tracking (potential benefits in terms of delay and resource consumption are huge) but focus on Linkerd and Istio initially as lower risk approach.

Regarding Linkerd, during the preparations of the test we found a blocking limitation about the ability of Linkerd proxy to include a configurable list of HTTP headers as part of tracing attributes (which is actually supported by Envoy). We need that feature so we could add Telefónica Kernel e2e Correlator-Id header to tracing attributes, so we could use it as the key to match tracing and logging done in the platform for the same request. This limitation prevented us from taking advantage of lower resource consumption of Linkerd proxy compared to Istio Envoy.

Finally, Istio was currently being used in other teams in Telefónica CDO Engineering. Taking everything into account, we decided to go ahead with Istio as the approach with best risk/performance ratio. Anyway this is a short term decision that could be reverted in case of any relevant change about Linkerd, Cilium or any other relevant mesh.

Once we have decided to use Istio as service mesh, we have to actually build the tracing infra to be fed with mesh generated tracing. That’s where OpenTelemetry Collector comes into the equation. We need to deploy OTEL Collector in our platform but we have to do that in a way some requirements are covered:

  • We need to combine tracing info in different formats/protocols: Zipkin, OpenCensus and OTLP is our initial short list.
  • We need to provide an e2e ‘consistent’ tracing sampling.
  • Additionally, we must export Telefónica Kernel tracing to Telefónica local tracing back-end so local operations team can operate and trace Kernel as any other local platform.
  • We have to host a tracing back-end deployed inside Telefónica Kernel that, in the future, could gather traces generated by Telefónica Kernel and Telefónica services on top of Kernel APIs and any other relevant tracing source from local operation systems.
  • We must do our best to guarantee sampled traces are persisted in the tracing backends and avoid any tracing span loss caused by OpenTelemetry Collector overhead that could result in partial tracing o a request (missing spans).

First requirement is met by OpenTelemetry Collector because it provides an extensive list of collectors and exporters in different formats and protocols, included those mentioned explicitly in the requirement.

For the purposes of ‘consistent’ sampling requirement, tracing context propagation grants that sampling decision can be progressed between services build on Telefónica Kernel, Telefónica Kernel internal components and local platforms and systems. Telefónica Kernel components forward this context (initially B3 header and W3C in the short term future) when traffic crosses the platform, so sampling decision taken at some point in the flow will be progressed forward.

Tracing context propagation by means of B3 headers, from github

Tracing context propagation by means of B3 headers, from github

The major effort required is about guaranteeing sampled traces are properly persisted. We must ensure that the tracing subsystem is able to scale properly according to the traces volume and do it in a way that no traces are lost in the process.

As we are integrating traces from local and remote sources, and also exporting them to local and remote tracing back end, first thing we decided to do is to decouple collection and exporting operations as much as possible so any received or locally generated trace is sampled, stored AND forwarded even at situations of incoming tracing peaks. In simpler words, we want incoming traffic processing do not affect tracing processing and distribution to destination tracing back-ends.

Taking that into consideration, we decided to fine tune the deployment of OpenTelemetry Collector. As you can see in the OpenTelemetry Collector docs, OTEL Collector itself can be described as a pipeline of 3 steps: collection, processing and export.

  • Collectors will be responsible of receiving traces in different formats and protocols and transform it into common OTLP format.
  • Once in OTLP common format, processors will execute actions on the traces spans as modifying/enhance tracing info or defining how tracing spans are grouped/batched when sending to destination systems.
  • Finally, exporters will be responsible for exporting resulting tracing spans, adapting (if required) to the format/protocol of the receiving system. At this stage, some retry policy could be applied in case of tracing delivery is failed (destination system overhead, communication issue, cloud infra incident, etc.).
OpenTelemetry Collector pipeline model, from github

OpenTelemetry Collector pipeline model, from github

The actions done in these pipeline steps (which collectors, processing and exports are used) is defined as configuration of the OTEL deployment by means of defining the pipeline that will be executed by OTEL Collector deployment.

Every OpenTelemetry Collector instance is a complete and stand alone pipeline. Increasing the number of instances of OpenTelemetry Collector deployment will increase the throughput only if we load balance incoming tracing spans to the pool of available instances. But this span balancing has to be done taking into account how it will affect the way sampling is done.

If no special actions are taken, it could happen that tracing spans generated by different components when handling the same request are processed by different instances of OpenTelemetry Collector (no trace-id aware load balancing in used). One specific scenario to pay special attention is when using ‘tail sampling’ policies that will delay the sampling decision until all pieces of the trace (spans) have being received.

An example of ‘tail sampling’ policy is sampling only failed requests. That would require that all spans from a trace are handled by the same collector instance. That instance would wait to see the result of the request to decide it a request has to be sampled or not. Tracing load balancing must ensure trace-id is considered so all spans of the same trace-id end up in the same collector instance.

In Telefónica Kernel scenario we have currently no plan for ‘tail sampling’. We do not rely on sampling done inside OpenTelemetry because of the impact it has on both OpenTelemetry Collector and service mesh sidecars (as explained previously). Sampling is executed by external invoking services or, in case that that external sampling is not applied, it will be done by Telefónica Kernel ingress controller, Traefik.

Because of traffic volume handled by the ingress controller, as a design criteria, we decided to avoid sidecar injection to Traefik but rely on tracing native capabilities provided by Traefik including OpenCensus (OTLP precursor) and OTLP in latest major version (v3.0, currently in beta).

Combining all, we decided to split OpenTelemetry pipeline in two different sub-pipelines, each one executed by different OpenTelemetry deployments (with its own pipeline definition and instances pool) decoupled by a Kafka-like queue between them. We defined 2 independent OTEL deployments:

  • Collector deployment: OTEL Collector deployment to handle incoming tracing spans and temporarily store them in a streaming queue (Kafka-like) in OTLP common format.
  • Exporter deployment: OTEL Collector deployment responsible for processing and distribution of tracing to destination systems (both Telefónica Kernel and Telefónica local operation tracing back-ends).
2 pipelines OTEL deployment approach

2 pipelines OTEL deployment approach

Additionally, the streaming capability (Kafka queue) used between both deployments is a SaaS streaming capability (Azure Eventhubs) provided by cloud infra provider. This decision (SaaS streaming) was made to be able to scale up properly with no restrictions at the early stages. Once experience is gained in production, other options could be considered to implement this.

In our case, HPA (Horizontal Pod Autoscaling) for collector pipeline is based on CPU and Memory used by the containers, but HPA for distribution pipeline includes also streaming metrics (offset lag of Kafka queue) as scaling criteria, by means of HPA solutions like Keda. With Keda we can define scaling rules based on Prometheus-gathered metrics, including the ones generated by OpenTelemetry Collector itself. Scaling both OTEL Collector deployments is independent from each other.

This decoupling of export from collection allow us to keep retry policy enabled in our exporter pipeline, without worrying about losing incoming tracing because of overhead caused by retries. This retry policy will help us to meet the requirement to minimize the loss of tracings info because of delivery failures without causing an overhead that could affect the handling on receiving traces (back pressure protection).

Telefónica Kernel deployments have (initially) 2 different destinations for the traces: a local Jaeger deployment that is part of Telefónica Kernel platform and remote tracing back end provided by Telefónica local operation. For the latter, it was agreed to use OTLP format as it was verified that commercial tracing solutions deployed in Telefónica footprint support this common non vendor locked-in format. For this integration, we are currently using GRPC OTLP binding, protected with TLS.

Complementary to OTEL subsystem, we also deploy a tracing back-end solution as part of Telefónica Kernel platform. We are currently using Jaeger tracing back end, persisted on the same ElasticSearch infra that also supports our Telefónica Kernel logging subsystem. It is actually the current implementation of Telefónica Kernel Log subsystem that inspired this two pipelines approach for tracing subsystem.

As mentioned at the beginning of the post, currently we plan using OpenTelemetry only for tracing purposes. Logs and metrics are handled by its corresponding platform subsystem. Regarding logs, the ability to include HTTP headers to tracing attributes is the key that enable cross checking logs and traces of the same request. Additionally, at the moment of elaboration of this post, only tracing capability was stated as stable in OpenTelemetry Collector, with log and metrics presented as beta in many OTEL collector, processors and exporters.

In Telefónica Kernel deployment of Jaeger, as internal platform back-end tracing, we got an extra benefit derived from 2 pipelines approach for OpenTelemetry deployment. As we already have a streaming step in tracing pipeline, there is no need to deploy Jaeger with internal Kafka. As OTEL exporter pipeline implements retries in case of failure, it is possible to deploy Jaeger without internal Kafka, and let Jaeger Ingest HPA to scale properly, without worrying about tracing spans loss in Jaeger internal pipeline.

Jaeger internal components from Jaeger website

Jaeger internal components from Jaeger website

In a last effort to minimize impact on resources used, we tried Jaeger Ingest component to get tracing spans directly from Kafka-like queue in Azure Eventhubs, so there were no need to use OTLP between Jaeger and OpenTelemetry Collector and Jaeger Collector could be skipped. But sadly we could not complete successfully that integration (we are still investigating why) and we need to rely yet on Jaeger OTLP Collector component to forward tracing from OTEL to Jaeger (by means of OTLP protocol).

One last comment is that Istio service-mesh, OTEL Collector and Jaeger deployment in our platform has been done by using its corresponding operators and CRD (Custom Resource Definitions), which you can check here, here and here.

We hope our experiences are useful for you. We wish you happy tracing!!