Software systems have become increasingly complex and dynamic. Observability has become a critical aspect of operational excellence. Observability enables teams to understand not just what is happening within their systems, but why—facilitating rapid debugging, optimisation, and decision-making.
A key aspect of enhancing observability is the efficient management of telemetry data: logs, metrics, and traces. Each of these "pillars" of observability presents unique challenges and opportunities for optimisation.
In this guide, I dive into metrics, focusing on their role in observability, the challenge of managing high-cardinality data, and strategies for efficiently handling that data without compromising insights.
What is a metric?
Metrics are numeric representations of data measured over intervals of time. Metrics combined with mathematical modelling surfaces insights into the behaviour of a system over time and into the future.
The role of cardinality
Cardinality refers to the uniqueness of data values in a set. In the context of databases, low cardinality means that a column has a lot of duplicate values. Low cardinality sets have some efficiency advantages - they compress well, and in columnar storage systems (OLAP), low-cardinality columns are often more efficient to run queries against.
High cardinality, then, is when a column contains many unique values. Common examples include a unique ID column or a timestamp column. Cardinality is an important concept in observability, because high-cardinality info is often the most useful for debugging and understanding systems. For example, it’s useful to aggregate and sort fields like user ID, request ID, container, hostname, etc. They allow for pinpoint accuracy, identifying needles in the haystack of your telemetry data.
But there are some inherent challenges with high cardinality data.
Metrics-based tooling systems typically have limits on the cardinality of any given dimension, particularly at scale. Most monitoring systems use some form of indexing to speed up data retrieval. High cardinality dimensions can overwhelm these indexes, making them less effective and slowing down lookups. Effective indexing is necessary for real-time monitoring and analysis, so maintaining manageable cardinality levels is essential for performance.
Storage and computing resources cost money. High cardinality data requires more of both, as it increases the volume of data stored and the computational power needed to process queries. By imposing cardinality limits, tooling systems help control costs and ensure that resource usage remains sustainable.
Limiting cardinality in your metrics backend
The simplest way to reduce the cardinality of your data is to filter out labels from your data. While this is a quick and easy change, it comes at the expense of reducing the richness of the data you collect. Another quick approach that doesn’t reduce visibility into your system is to extract a label (like the name of the microservice) and embed it in the name of the metric. For example, this metric:
user_action_event_count{
user_id: user-56789
session_id: session-12345
action_type: purchase
page_id: page-987
timestamp: 2023-03-17T12:34:56.789Z
service_name: service_name_a
value: 1
}
Can have its cardinality reduced by embedding the service name in the collected metric.
service_name_a_user_action_event_count{
user_id: user-56789
session_id: session-12345
action_type: purchase
page_id: page-987
timestamp: 2023-03-17T12:34:56.789Z
value: 1
}
Grafana’s backend is able to use this information architecture to improve query performance. You’re essentially removing a portion of the aggregation and filtering required in every service-specific query.
You can also filter by value, so for example, if your metric looks like the above, you might filter out specific action_types that don’t carry information you’re interested in.
These are some effective ways to reduce cardinality, and are built into most or all metric-collecting backends. Next, we’ll look at client-side techniques to reduce cardinality.
Limiting cardinality at the source
For fields like user IDs, session IDs, or transaction IDs that have very high cardinality, hashing or bucketing these values into broader categories or ranges can be very effective in curtailing runaway cardinality.
Here’s an example of how to do so in Node.js and Prometheus.
Hashing
const crypto = require('crypto');
const metricsClient = require('prom-client');
// Function to hash user IDs
function hashUserID(userID) {
return crypto.createHash('sha256').update(userID).digest('hex').substring(0, 16); // Shorten hash for brevity
}
// Function to report user activity metrics
function reportUserActivityMetric(userID, activity) {
const hashedUserID = hashUserID(userID);
const metricName = `user_activity.${activity}`; // e.g., user_activity.login
const tags = { userID: hashedUserID };
// Report metric to the monitoring system
metricsClient.increment(metricName, 1, tags);
}
// Example usage
reportUserActivityMetric('user-1234567890', 'login');
reportUserActivityMetric('user-0987654321', 'purchase');
Bucketing
const metricsClient = require('prom-client'); // Hypothetical metrics client library
// Function to bucket session durations
function bucketSessionDuration(durationInSeconds) {
if (durationInSeconds < 300) return '0-5 minutes';
if (durationInSeconds < 600) return '5-10 minutes';
if (durationInSeconds < 1800) return '10-30 minutes';
return '30+ minutes';
}
// Function to report session duration metrics
function reportSessionDurationMetric(userID, durationInSeconds) {
const durationBucket = bucketSessionDuration(durationInSeconds);
const metricName = 'session_duration';
const tags = { duration: durationBucket }; // Use bucketed value as a tag
// Report metric to the monitoring system
metricsClient.histogram(metricName, durationInSeconds, tags);
}
// Example usage
reportSessionDurationMetric('user-1234567890', 450); // 7.5 minutes
reportSessionDurationMetric('user-0987654321', 1200); // 20 minutes
The benefits of this approach include:
- Hashing the user IDs anonymised the user data while still allowing for aggregated metrics per user activity.
- Bucketing durations will drastically reduce the number of unique values that would otherwise be reported for each possible session length.
Aggregations
Another way to limit cardinality and decrease the storage requirements of your metrics is to apply some pre-aggregations. Here’s an example of aggregating before ingestion, as well as calculating rolling aggregates.
Aggregate Before Ingestion
In this scenario, we collect metrics data over a specified interval and then aggregate this data before reporting it to the metrics system. This can significantly reduce the amount of data sent over the network and stored in the observability system.
const metricsClient = require('some-metrics-client'); // Hypothetical metrics client library
let activityCounts = {};
// Increment activity count
function incrementActivityCount(activity) {
activityCounts[activity] = (activityCounts[activity] || 0) + 1;
}
// Aggregate and report metrics every minute
setInterval(() => {
Object.keys(activityCounts).forEach(activity => {
const metricName = `user_activity.${activity}`; // e.g., user_activity.login
const count = activityCounts[activity];
// Report aggregated count to the monitoring system
metricsClient.gauge(metricName, count);
// Reset count for the next interval
activityCounts[activity] = 0;
});
}, 60000); // 60 seconds
// Example usage
incrementActivityCount('login');
incrementActivityCount('purchase');
Rolling Aggregates
For high-frequency data, maintaining rolling aggregates allows for continuous summarisation of data points into aggregates like sums, averages, or counts. This approach is particularly useful for metrics that are updated frequently.
const metricsClient = require('some-metrics-client'); // Hypothetical metrics client library
let rollingSum = 0;
let rollingCount = 0;
// Function to add a new value to the rolling aggregate
function addValueToRollingAggregate(value) {
rollingSum += value;
rollingCount += 1;
}
// Report rolling average every minute and reset
setInterval(() => {
if (rollingCount > 0) {
const rollingAverage = rollingSum / rollingCount;
metricsClient.gauge('rolling_average', rollingAverage);
// Reset for the next interval
rollingSum = 0;
rollingCount = 0;
}
}, 60000); // 60 seconds
// Example usage
addValueToRollingAggregate(10);
addValueToRollingAggregate(20);
addValueToRollingAggregate(30);
Conclusion
The strategies outlined—limiting cardinality at the source, hashing, bucketing, and pre-aggregations—provide a framework for handling high-cardinality metric data without incurring excessive costs or compromising data richness.
Observability pipeline tools like Datable.io can be used to apply unified aggregation and filtering logic on your metrics data before it lands with a vendor. To check out what we’re building, sign up for the private beta waitlist here.
Until next time!