As systems grow more complex and distributed, the ability to monitor, understand, and debug these systems in real-time has transitioned from a luxury to a necessity. Collecting and analysing logs, metrics, and traces provides a holistic view of system health, performance, and behaviour.
However, the granularity that tracing provides comes with its own set of challenges, notably in data volume and the complexity of integration across a varied landscape of technologies and platforms.
Implementing efficient, scalable tracing requires careful consideration of data collection, storage, and analysis techniques, with sampling strategies playing a major role in balancing detail with cost and performance.
Whether you are looking to refine your existing tracing strategy or implement tracing in your infrastructure, this guide will help you optimise your trace data management and storage costs.
A trace is the journey of a request across a system. Think about a web server - a trace would follow the request response lifecycle of a web request. As we have moved to cloud native and distributed systems, we have distributed traces that will follow a request through multiple and async systems. This provides insights into the sequence and relationships between various distributed events. It resembles an event log (created by a neurotic librarian) in structure, offering a comprehensive view of a request's path through different services and the effects of asynchronous execution points. We tend to think of Logs as chatty - Traces are much more so, quickly growing to TBs of data depending on your application’s throughput.
In practice, traces mark specific points in the application and its ecosystem—such as function calls, RPC boundaries, and concurrency mechanisms—to map out execution forks, network hops, or process interactions. These points form a directed acyclic graph (DAG) of spans, linked by references that maintain causality with "happens-before" semantics, thereby illustrating the workload at each layer of the request's journey through the system.
Tracing is the hardest part to retrofit into an existing infrastructure, because it takes every component in the path of the request to be modified for the information to be usable. If you do collect traces, it’s a must to implement sampling, as the quantity and density of trace data can lead to expensive ingest and storage costs.
The most common sampling strategies are:
Sampling can also be done in a dynamic way. Below are two conceptual examples of how you would implement selective tracing with configurable or adaptive sampling rates.
Selective Tracing with Configurable Sampling Rates
The below example demonstrates how to configure OpenTelemetry to selectively trace critical paths in your application and adjust sampling rates based on conditions like system load or time of day.
Adaptive Sampling
The following example shows how you might implement adaptive sampling, focusing on traces that are likely to offer valuable insights, such as those with errors or high latencies.
Depending on the specifics of your application and observability infrastructure, you may need to adjust configurations, sampling strategies, and tool integrations. Observability pipeline tools like Datable can perform sampling tasks after your data has been collected, making integrating this kind of sampling significantly less time-consuming. OpenTelemetry’s collector also offers features like Tail-based sampling out-of-the-box, which is worth exploring.
Datable is an observability pipeline tool that offers a flexible approach to managing telemetry traffic, enabling detailed routing, filtering, and transformations of data to optimise observability costs and improve operational efficiency.
In practice, Datable is essentially a real-time event processing engine, where data processing and manipulation is done with arbitrary JavaScript. This allows for granular control over telemetry data, and opens up a wide range of possibilities. Here are just a few:
If you’re interested in checking out Datable and taking control of your telemetry data, sign up to our beta waitlist here.
Observability is an evolving field, and while I can’t promise anything, I think these suggestions will take you a long way for a long while. If you have any questions, reach out to julian@datable.io.
Until next time!