Observability for DevOps — Logs, Metrics, Traces, and Beyond
In modern DevOps, observability is no longer optional—it’s critical. But it’s more than just throwing in a logging library or checking a Grafana dashboard. Observability is about understanding your system from the inside out, even when things go wrong.
In this guide, we’ll cover practical approaches to implement robust observability using tools like Prometheus, Grafana, Loki, Tempo, OpenTelemetry, and ELK Stack.
What Is Observability (And What It’s Not)
Observability isn’t the same as monitoring.
- Monitoring is knowing something is wrong.
- Observability is understanding why it’s wrong.
A fully observable system allows you to:
- Ask new questions without deploying new code
- Debug issues across distributed systems
- Measure SLOs, detect anomalies, and trace dependencies
The Three Pillars: Logs, Metrics, Traces
1. Logs — The Narrative
Logs give you the what happened.
✅ Use structured logging (JSON):
{
"timestamp": "2025-06-23T09:34:56Z",
"level": "error",
"message": "DB connection failed",
"context": {
"user_id": 42,
"retry_count": 3
}
}
Use a centralized logging stack:
- ELK Stack (Elasticsearch, Logstash, Kibana)
- Loki + Grafana (lightweight, better for k8s)
Index logs by:
- Service name
- Environment
- Correlation ID (important for tracing)
2. Metrics — The Pulse
Metrics give you quantitative insight.
Use Prometheus to collect and store metrics:
# Kubernetes PodMetrics scrape config
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Use cases:
- Request per second (RPS)
- CPU/memory usage
- Queue lengths
- Error rate, latency (via histogram or summary)
Visualize with Grafana and set up alerts:
alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate in production"
3. Traces — The Journey
Traces show how requests flow across microservices.
Use OpenTelemetry to instrument services:
npm install @opentelemetry/sdk-node
Example:
import { NodeSDK } from "@opentelemetry/sdk-node";
const sdk = new NodeSDK({
serviceName: "auth-service",
traceExporter: new OTLPTraceExporter({ url: "http://tempo:4317" }),
});
sdk.start();
Send traces to:
- Tempo (for Grafana integration)
- Jaeger (great for visualizing spans)
- Honeycomb, Datadog, etc.
Use correlation IDs to tie logs, metrics, and traces together.
Beyond the Pillars: Events & Continuous Profiling
- Events (e.g., Git deploys, config changes) provide timeline context
- Continuous Profiling (e.g., Pyroscope, Parca) shows CPU/memory usage over time
🎯 These help diagnose performance issues or regressions that metrics can’t explain.
Building an Observability Stack (Example)
OSS Stack:
- Prometheus — metrics collector
- Loki — log aggregation
- Tempo — tracing backend
- Grafana — unified dashboard & alerting
Architecture:
[ App ]
│
├─> OpenTelemetry SDK → Tempo (tracing)
├─> Structured logs → Loki
└─> Prometheus Exporters → Prometheus
↘ Grafana
Alerting Examples:
- CPU > 85% for 5 minutes
- Request latency > 1s P95
- Error rate > 5%
- Missing heartbeat (dead service)
Best Practices
- Use unique request IDs and propagate them
- Standardize labels/tags across telemetry
- Alert on SLOs, not just raw metrics
- Use dashboards for context, not just charts
- Store long-term logs in S3 or GCS buckets
- Automate alert routing (e.g., PagerDuty, Slack, Email)
Conclusion
Observability empowers DevOps teams to move fast and sleep at night. With the right stack and discipline, you can trace every issue to its root cause and prove reliability over time.
If you’ve already implemented an observability stack in your projects, how was your experience? Share your story on LinkedIn.