Observability for DevOps — Logs, Metrics, Traces, and Beyond

In modern DevOps, observability is no longer optional—it’s critical. But it’s more than just throwing in a logging library or checking a Grafana dashboard. Observability is about understanding your system from the inside out, even when things go wrong.

In this guide, we’ll cover practical approaches to implement robust observability using tools like Prometheus, Grafana, Loki, Tempo, OpenTelemetry, and ELK Stack.

What Is Observability (And What It’s Not)

Observability isn’t the same as monitoring.

Monitoring is knowing something is wrong.
Observability is understanding why it’s wrong.

A fully observable system allows you to:

Ask new questions without deploying new code
Debug issues across distributed systems
Measure SLOs, detect anomalies, and trace dependencies

The Three Pillars: Logs, Metrics, Traces

1. Logs — The Narrative

Logs give you the what happened.

✅ Use structured logging (JSON):

{
  "timestamp": "2025-06-23T09:34:56Z",
  "level": "error",
  "message": "DB connection failed",
  "context": {
    "user_id": 42,
    "retry_count": 3
  }
}

Use a centralized logging stack:

ELK Stack (Elasticsearch, Logstash, Kibana)
Loki + Grafana (lightweight, better for k8s)

Index logs by:

Service name
Environment
Correlation ID (important for tracing)

2. Metrics — The Pulse

Metrics give you quantitative insight.

Use Prometheus to collect and store metrics:

# Kubernetes PodMetrics scrape config
- job_name: "kubernetes-pods"
  kubernetes_sd_configs:
    - role: pod
  relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true

Use cases:

Request per second (RPS)
CPU/memory usage
Queue lengths
Error rate, latency (via histogram or summary)

Visualize with Grafana and set up alerts:

alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 2m
labels:
  severity: critical
annotations:
  summary: "High error rate in production"

3. Traces — The Journey

Traces show how requests flow across microservices.

Use OpenTelemetry to instrument services:

npm install @opentelemetry/sdk-node

Example:

import { NodeSDK } from "@opentelemetry/sdk-node";
const sdk = new NodeSDK({
  serviceName: "auth-service",
  traceExporter: new OTLPTraceExporter({ url: "http://tempo:4317" }),
});
sdk.start();

Send traces to:

Tempo (for Grafana integration)
Jaeger (great for visualizing spans)
Honeycomb, Datadog, etc.

Use correlation IDs to tie logs, metrics, and traces together.

Beyond the Pillars: Events & Continuous Profiling

Events (e.g., Git deploys, config changes) provide timeline context
Continuous Profiling (e.g., Pyroscope, Parca) shows CPU/memory usage over time

🎯 These help diagnose performance issues or regressions that metrics can’t explain.

Building an Observability Stack (Example)

OSS Stack:

Prometheus — metrics collector
Loki — log aggregation
Tempo — tracing backend
Grafana — unified dashboard & alerting

Architecture:

[ App ]
   │
   ├─> OpenTelemetry SDK → Tempo (tracing)
   ├─> Structured logs → Loki
   └─> Prometheus Exporters → Prometheus
                             ↘ Grafana

Alerting Examples:

CPU > 85% for 5 minutes
Request latency > 1s P95
Error rate > 5%
Missing heartbeat (dead service)

Best Practices

Use unique request IDs and propagate them
Standardize labels/tags across telemetry
Alert on SLOs, not just raw metrics
Use dashboards for context, not just charts
Store long-term logs in S3 or GCS buckets
Automate alert routing (e.g., PagerDuty, Slack, Email)

Conclusion

Observability empowers DevOps teams to move fast and sleep at night. With the right stack and discipline, you can trace every issue to its root cause and prove reliability over time.

If you’ve already implemented an observability stack in your projects, how was your experience? Share your story on LinkedIn.

🔗 Read more

DevSecOps in Practice — Shifting Security Left in Your Pipeline