It’s easy for observability to fall to the wayside until you’ve reached a critical mass of technical debt, infra spend, or received a surprisingly costly vendor bill. But there are a number of straightforward, practical tactics you can use to improve the quality of your telemetry data and minimise your observability spending.
We are going to focus on three techniques, one for each of the “pillars” of observability:
- Migrating from unstructured logs to structured logs
- Effectively managing metrics and cardinality, and
- Selectively sampling traces
And today’s focus is on logging. What fun! Let’s get started.
What is a log?
A log (aka “event log” for the educated, or “that wall of text that comes out of the computer” for the rest of us) is an immutable, timestamped record of a discrete event that happened at some point in time. Logs are great for debugging at a fine level of granularity. They provide insights into the long tail that averages and percentiles don’t. Logs are also the easiest to generate. Logs can be plaintext, structured, or binary.
Plaintext logs (aka unstructured logs) are the easiest to implement, but to get real value out of unstructured logs, you’ll have to write custom regex scripts or grok rules to parse those logs into structured data. Plaintext is wildly inefficient to query, and the lack of standardisation makes debugging harder. Over time, convention has brought us to semi-structured logs like nginx and apache format logs, where the Nth word across has a specific meaning.
As an example, here’s a semi structured log:
127.0.0.1 - - [26/Mar/2016:19:09:19 -0400] "GET / HTTP/1.1" 401 194 "" "Mozilla/5.0 Gecko" "-"
And here’s a parsing rule:
NGINX_ACCESS %{IPORHOST:clientip} (?:-|(%{WORD}.%{WORD})) %{USER:ident} \[%{HTTPDATE:timestamp}\] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{NUMBER:bytes}|-) %{QS:referrer} %{QS:agent} %{QS:forwarder}
The final form of log data is structured logs, where logs are emitted as objects. These logs are the easiest to query, visualise, track, and will otherwise help keep your sanity. Structured logs aren’t a silver bullet - they aren’t very fun for local development, but they’re well worth the upgrade in production.
Migrating from Unstructured Logs to Structured Logs
1. Define a Log Format
Decide on a structured format that all logs should adhere to. JSON is the most common choice, and is universally compatible with most tools and vendors. Every language has logging frameworks to help with this transition.
2. Standardise Log Content
Easy mode: wide json object, but make sure to enforce standards
Determine the shape of your log data and enforce it. At a minimum, all logs should include:
- Timestamp: The exact time the event occurred.
- Log Level: Severity of the log (e.g., DEBUG, INFO, WARN, ERROR).
- Message: A human-readable message describing the event.
- Service Name: The name of the application or service generating the log.
- Host name / container name: The name of the host name/container where the log was emitted.
- Unique identifiers: trace_id and span_id help correlate related logs, but request ID, user ID, or transaction ID are suitable options in lieu of trace_id.
- All your other keys
For advanced use cases, I recommend following the OTel standard for your logging data model. In the OTel standard, keys have a specific meaning, and will activate various views within vendor environments. This makes instrumenting and setting up dashboards much faster and more reliable.
3. Implement a Logging Library or Framework
Once you’ve aligned on a data model and standards, it’s time to actually structure the logs. While every language/framework/library comes with logging built in, these print methods often introduce unwanted overhead that can compromise the performance of your system. They also lack a consistent method of correlating log lines. I recommend adopting a framework specifically for structured logging in your given language.
Here are some of the most popular choices, by language:
Below is a code snippet to demonstrate using the logging library Pino in Node.js:
const express = require('express');const pino = require('pino');const expressPino = require('express-pino-logger');// Create a Pino logger instanceconst logger = pino({ level: process.env.LOG_LEVEL || 'info' });// Create an Express logger middlewareconst expressLogger = expressPino({ logger });const app = express();const port = 3000;// Use the Express logger middleware to automatically log all incoming requestsapp.use(expressLogger);app.get('/', (req, res) => { // Example of a manual log entry logger.info('Handling request for the root route'); res.send('Hello, world!');});app.get('/about', (req, res) => { // This could be a more detailed log, including dynamic data logger.info({ route: req.url }, 'Visited the about page'); res.send('This is the about page.');});// Error handling middlewareapp.use((err, req, res, next) => { logger.error(err, 'Something went wrong!'); res.status(500).send('Internal Server Error');});app.listen(port, () => { logger.info(`Server running on http://localhost:${port}`);});
4. Log at the Source
Replace old logging statements with your new library. Most structured logging libraries will make it easy to swap out the standard logger, and give you timestamps, messages, and pid decoration out-of-the-box. Where possible, you want to enrich your logs to include trace_id, span_id, and any other business information like account_id or user_id. Beware, though - you’ll want some standard in place across your teams (some of the most successful log schemas I’ve seen are managed out of a spreadsheet!).
5. Implement a centralised log forwarder
Don’t ship your logs over the open internet directly from your application. Latency to send your logs to a vendor (say, logs.datadog.com) could bring down your app as your worker thread waits on an http connection. Instead, opt for a log forwarder. Logstash is great if you were born before 1980. Otherwise Vector or Fluentbit. All the log forwarders have interoperability, so they’ll work regardless of whether you’re working with Elasticsearch or Splunk. I recommend using the OpenTelemetry collector, however it’s not quite at the same state as traditional log vendors. For OTLP data it’s great, for tailing a log file, stick with the classics. They are all compatible with a wide range of backends, and allow you to parse, filter, and send your logs to different destinations.
Monitor and Alerting
Set up monitoring and alerting based on the structured log data. Structured logs make it easier to create more precise alerts, as you can query specific fields and values rather than relying on text pattern matching.
The real superpower of structured logs comes from troubleshooting and all the other fun things, like:
- Detecting DDoS attacks
- Running business reports
- Querying and visualising your data by different dimensions on the fly
Best Practices
Here are some best practices to keep in mind when implementing your new logging regime:
- Consistency is key. All parts of the app or system should use the same logging structure and conventions.
- Follow strict guidelines on what kind of sensitive information can be logged. If sensitive information is logged, filtering may be necessary during the processing phase. This can be done with in-house event processing solutions, or with observability pipeline tools like Datable.
- Use nested JSON structures sparingly. Flatten whenever possible.
Conclusion
Migrating from unstructured to structured logs unlocks efficient querying, monitoring, and readability, making it easier to understand your application’s behaviour and performance. Adopting the above practices not only streamlines debugging and discovery of operational insights, but positions you to more effectively manage metrics and traces in the future.
Datable.io makes it easy to structure logs, enforce schemas and more. If you're ready to take your observability to the next level, sign up for our private beta waitlist.
In the next part of this series, I’ll take a look at metrics, and some techniques to reduce the cost of storing metric data.
Until next time!