Observability pipelines are a fairly new product category, and there’s a lot of terminology floating around that may seem contradictory or confusing. To help you navigate this emerging space, we’re going to break down what observability pipelines are - what they do, why you might need one, and how to choose the right tools for the job.
Key Takeaways
- An observability pipeline is a real-time, extract-transform-load (ETL) pipeline for telemetry data (logs, metrics and traces). It enables the user to pre-process data before it lands in a destination (for example, S3, Elasticsearch, or Datadog).
- Vendors like Mezmo, Cribl and Datable.io sell tools to make building these pipelines easier, with varying levels of operational complexity and capabilities.
- At their best, observability pipelines give insight into your data and help you take actions to reduce cloud costs, improve data quality and governance, and boost developer productivity.
- Not every observability pipeline is created equal. What kinds of telemetry data does it support? Is it a control plane, or a streaming pipeline? Does it support stateful processing? These questions will help you pick the right tool for your use case.
For a quick refresher on observability terminology, check out our blogs on logs, metrics, and traces.
How we got here
To better understand the what and why of observability pipelines, we’ll start with an overview of a typical observability set up - without observability pipelines.
Imagine the architectural components of a simplified monolithic e-commerce platform, circa 2010. You have a client for users to interact with your site, a web server for handling traffic, a core backend service where business logic happens, and a relational database. It’s all deployed on the same rack, with interprocess communication happening locally.
To start collecting and leveraging telemetry data, we might begin with collecting and analysing logs. Back in the before times, you’d have to page the person working at your data centre. They would log into the physical machine to surface application logs. Next came sshing into your hardware - but once your app starts running on multiple machines, SSHing isn’t sufficient.
That brings us to today’s centralised log forwarders. Assuming the web server and core service application are both emitting logs at runtime, we would add a log forwarder to run next to our app, ingesting the logs that are generated and sending them to some destination for processing and analysis. That’s one forwarder per app deployed. This is the world of fluentd and Logstash (the L in the ELK stack).
In a modern setup, you’d use a lightweight log forwarder like Vector or Fluentbit. Alternatively, you could use a more general purpose telemetry data forwarder - in OpenTelemetry parlance, this is called a collector. OTel collectors don’t stop at ingesting logs - they also support ingesting and forwarding metrics and traces.
It’s important to note that while collectors and log forwarders may support some light transformations on your data, they are both designed for high throughput and low latencies, not for processing.
Now we can see our logs from the outside of our application. Now let’s say we want to collect metrics. For metrics (and traces), agents are generally required to extract the information you need from the application during runtime. Agents act from within your application or infrastructure to collect and forward internal information. In the case of metrics, this could be your CPU usage, response times, or requests-per-second. They can be open source (eg. Prometheus, Nagios, OpenTelemetry) or proprietary (eg. New Relic, Datadog).
For simplicity, our agents will send metrics via the /metrics endpoint whenever they are requested by a Prometheus deployment.
Between logs and metrics, we should have a solid grasp of the health of our system. If we wanted traces, we’d again have to instrument our app with language native agents to generate and forward that data to a collector or straight to a vendor like New Relic.
As you may have noticed, our sample stack does not have a particularly modern architecture. Many orgs will reach for an event-driven microservices architecture, with many containerized apps communicating asynchronously. With each container generating its own metrics, logs and traces, observability for cloud-native deployments is particularly complex to implement and maintain.
Adoption of the OTel collector is growing, as it simplifies the collection and forwarding of all telemetry data (logs, metrics and traces). This centralises the collection process, allowing for simplified config management.
Collectors are good for parsing and redirecting data, but they are generally difficult to operate, limited in their processing capabilities, and give no visibility into the data you’re collecting and forwarding. In most orgs, simple adjustments require complicated configuration management that involves multiple stakeholders. Something as simple as dropping info logs or configuring sampling can take days to sort out. Ultimately, these tools are not designed to support cross-functional data exploration, or to perform complex (read: meaningful) transformations.
That’s where observability pipelines come in.
What is an Observability Pipeline?
Observability pipelines are real-time data processing pipelines that ingest, transform and route telemetry data. Observability pipeline vendors like Mezmo, Cribl or Datable.io offer platforms for building and managing these pipelines.
Without an observability pipeline, telemetry data is collected and forwarded to downstream consumers in (roughly) the same form as it was emitted. As mentioned above, collectors and forwarders are designed to be lightweight, with limited processing capabilities. Without a layer for standardisation or normalisation, data quality suffers, making it harder to extract meaningful insights and respond to outages.
With an observability pipeline tool in place, data in your pipeline can be aggregated, sampled, transformed, filtered, and routed easily.
Why are Observability Pipelines Necessary?
When applications are instrumented for telemetry, even small applications can produce massive amounts of real-time event data. To get the most out of that telemetry data and minimise costs, it’s necessary to process that data before it’s sent to downstream consumers (i.e. data warehouses, Elasticsearch clusters, or vendors like Datadog and New Relic).
Observability pipelines let you quickly understand the data you're collecting, and make decisions to manage the complexity, volume, and quality of telemetry data. But systems that can handle the bandwidth, throughput, and processing required when dealing with telemetry data are typically expensive to build and maintain.
Observability pipeline tools sit between your application and the destination, providing a singular interface for pre-processing data. They can be used for a wide range of data transformation tasks, including removing PII, enforcing data models, alerting on data spikes, and directing traffic between different downstream applications to reduce redundant data storage and processing.
Benefits of Observability Pipelines
The biggest reason devs turn to observability pipelines is to make their telemetry data more useful, with the aim of improving operational efficiency and minimising costs. After all, less data is easier to manage and faster to search.
Data Quality & Governance
One way to improve operational efficiency is to improve the quality and richness of your telemetry data. This, in turn, improves the usability of that data, which can lead to tangible differences in MTTR. In practice, this looks like schema enforcement, data normalisation and enrichment, and more. Additionally, many observability pipelines support filtering for PII, helping to ensure regulatory compliance.
Cost Reduction
Observability pipelines can lower observability costs in a few ways. They de-couple source systems that generate data from downstream observability vendors like New Relic or Datadog, which makes churning simple. This can provide a significant amount of leverage in pricing negotiations. They also provide the ability to direct and split traffic, which can help organisations with many downstream consumers reduce the cost of their data ingestion and storage.
For example, lower value data can be redirected to S3, while high-impact data is forwarded to Datadog, or multiple vendors simultaneously like Splunk and Snowflake. Observability pipelines can also help downsample and filter irrelevant data, further minimising the footprint of your telemetry data.
Reduced Tool Sprawl
Without a global view of your telemetry data, tracking down what is going where gets complicated fast. With an observability pipeline in place, managing data streams across different downstream consumers is faster and easier. Adopting or churning from tools is as simple as adding or removing pipelines. Many orgs end up paying for redundant feature sets and data storage. Observability pipelines can help consolidate tool sprawl and optimise storage costs.
Choosing an Observability Pipeline
Telemetry data support
Most observability pipelines will only support one or two of the three major telemetry data sources (logs, metrics, or traces), but not all three. Datable.io is capable of ingesting all three categories of telemetry data, offering a unified interface for insights into your observability.
Transformation Capabilities
Another important factor to consider when choosing an observability pipeline platform is what kind of transformations can be applied to your telemetry data, and how those transformations are implemented.
Companies like Mezmo and Cribl, or tools like Vector offer rigid, pre-defined functions you can run against your data for simple transformations. Others, particularly Datadog, use the term “observability pipelines'' to describe their Control Plane product. A control plane differs from an observability pipeline in that a control plane acts as a central repository for configurations that are applied to your collectors/forwarders. This enables efficient config management, but does not unlock advanced processing features.
Datable.io is the only observability pipeline tool that lets you write arbitrary transformations against your data in JavaScript. This means you can perform the exact transformations you need through a few lines of JS.
Routing options
Proprietary pipelines often require vendor-specific tooling and destinations. Datable.io supports destinations like S3, Datadog, and New Relic, as well as any HTTP endpoint. This makes testing different vendors simple. By decoupling your vendor from your observability, you also open the opportunity to negotiate for better rates. We’ve seen reductions in the range of 30%-50% off annual vendor bills.
Wrapping Up
Observing your system’s health is table stakes for building a successful product. But modern tooling introduces a lot of complexity, redundancy, and costs. Observability pipelines act as an intermediary processing layer, helping you take control over your observability ecosystem and maximise the value of your telemetry data.
If you’re interested in giving Datable.io a try, sign up today.
Thanks for reading, until next time!