As systems grow more complex and distributed, the ability to monitor, understand, and debug these systems in real-time has transitioned from a luxury to a necessity. Collecting and analysing logs, metrics, and traces provides a holistic view of system health, performance, and behaviour. 

However, the granularity that tracing provides comes with its own set of challenges, notably in data volume and the complexity of integration across a varied landscape of technologies and platforms. 

Implementing efficient, scalable tracing requires careful consideration of data collection, storage, and analysis techniques, with sampling strategies playing a major role in balancing detail with cost and performance.

Whether you are looking to refine your existing tracing strategy or implement tracing in your infrastructure, this guide will help you optimise your trace data management and storage costs.

What is a trace?

A trace is the journey of a request across a system. Think about a web server - a trace would follow the request response lifecycle of a web request. As we have moved to cloud native and distributed systems, we have distributed traces that will follow a request through multiple and async systems. This provides insights into the sequence and relationships between various distributed events. It resembles an event log (created by a neurotic librarian) in structure, offering a comprehensive view of a request's path through different services and the effects of asynchronous execution points. We tend to think of Logs as chatty - Traces are much more so, quickly growing to TBs of data depending on your application’s throughput. 

In practice, traces mark specific points in the application and its ecosystem—such as function calls, RPC boundaries, and concurrency mechanisms—to map out execution forks, network hops, or process interactions. These points form a directed acyclic graph (DAG) of spans, linked by references that maintain causality with "happens-before" semantics, thereby illustrating the workload at each layer of the request's journey through the system.

Tracing is the hardest part to retrofit into an existing infrastructure, because it takes every component in the path of the request to be modified for the information to be usable. If you do collect traces, it’s a must to implement sampling, as the quantity and density of trace data can lead to expensive ingest and storage costs.

The most common sampling strategies are:

  • Head based sampling: At the start of a request before any traces are generated
  • Tail based sampling: At the end, after all participating systems have recorded the traces for the entire course of the request execution

Sampling can also be done in a dynamic way. Below are two conceptual examples of how you would implement selective tracing with configurable or adaptive sampling rates.

Selective Tracing with Configurable Sampling Rates

The below example demonstrates how to configure OpenTelemetry to selectively trace critical paths in your application and adjust sampling rates based on conditions like system load or time of day.

const { trace } = require('@opentelemetry/api');
const { NodeTracerProvider } = require('@opentelemetry/node');
const { SimpleSpanProcessor } = require('@opentelemetry/tracing');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { ProbabilitySampler } = require('@opentelemetry/core');

// Initialize tracer provider
const provider = new NodeTracerProvider();

// Configure Jaeger exporter
const jaegerExporter = new JaegerExporter({
serviceName: 'your-service-name',
});
provider.addSpanProcessor(new SimpleSpanProcessor(jaegerExporter));

// Example of a function to determine sampling rate based on system load or time of day
function determineSamplingRate() {
const hour = new Date().getHours();
if (hour > 8 && hour < 20) { // Assuming higher load during daytime
return 0.1; // Sample 10% of requests
}
return 0.01; // Sample 1% of requests during off-peak hours
}

// Configurable sampling based on custom logic
const sampler = new ProbabilitySampler(determineSamplingRate());
provider.register({
sampler,
});

// Use the tracer
const tracer = trace.getTracer('example-tracer');

// Now you can create spans for critical paths
const span = tracer.startSpan('critical-operation');
try {
// Critical operation here
} finally {
span.end();
}

Adaptive Sampling

The following example shows how you might implement adaptive sampling, focusing on traces that are likely to offer valuable insights, such as those with errors or high latencies.

const { trace } = require('@opentelemetry/api');
const { ParentBasedSampler, TraceIdRatioBasedSampler } = require('@opentelemetry/core');

// Function to adjust sampling based on error occurrence
function shouldSampleOnError(spanContext, attributes) {
if (attributes['error'] === true) {
return { decision: 'RECORD_AND_SAMPLED' }; // Increase sampling for errors
}
return { decision: 'NOT_RECORD' }; // Default to not record
}

// Function to adjust sampling based on latency
function shouldSampleOnLatency(spanContext, attributes) {
const latencyThreshold = 1000; // 1000 milliseconds threshold
if (attributes['http.response_time'] > latencyThreshold) {
return { decision: 'RECORD_AND_SAMPLED' }; // Sample high-latency requests
}
return { decision: 'NOT_RECORD' };
}

// Combine both strategies using ParentBasedSampler for context-aware sampling
const sampler = new ParentBasedSampler({
root: new TraceIdRatioBasedSampler(0.01), // Default to 1% sampling
onShouldSample: (context, traceId, spanName, spanKind, attributes) => {
return shouldSampleOnError(context, attributes) || shouldSampleOnLatency(context, attributes);
},
});

// Assuming provider setup as in the previous example
provider.register({
sampler,
});

// Continue with tracer setup and usage

Depending on the specifics of your application and observability infrastructure, you may need to adjust configurations, sampling strategies, and tool integrations. Observability pipeline tools like Datable can perform sampling tasks after your data has been collected, making integrating this kind of sampling significantly less time-consuming. OpenTelemetry’s collector also offers features like Tail-based sampling out-of-the-box, which is worth exploring.

Reducing your observability costs with Datable

Datable is an observability pipeline tool that offers a flexible approach to managing telemetry traffic, enabling detailed routing, filtering, and transformations of data to optimise observability costs and improve operational efficiency. 

In practice, Datable is essentially a real-time event processing engine, where data processing and manipulation is done with arbitrary JavaScript. This allows for granular control over telemetry data, and opens up a wide range of possibilities. Here are just a few:

  1. Categorise and route traffic based on their significance - for example, errors go to SSD for quick access, warm data on EBS magnetic volumes, and less relevant logs and traces to S3.
  2. Normalise telemetry data in a format like parquet for easy analysis with tools like Spark, Presto, or Amazon Athena. 
  3. Filter PII and irrelevant information out of logs to reduce costs and ensure regulatory compliance.

If you’re interested in checking out Datable and taking control of your telemetry data, sign up today.

Conclusion

Observability is an evolving field, and while I can’t promise anything, I think these suggestions will take you a long way for a long while. If you have any questions, reach out to julian@datable.io.

Until next time!