Observability: Logging, Metrics, and Tracing for Production Systems

Building comprehensive observability into web and mobile backends using OpenTelemetry, Datadog, and structured logging practices.
Observability is the degree to which you can understand the internal state of your system from its external outputs. Systems with high observability allow engineers to answer questions they didn't anticipate in advance—to debug novel failures by exploring data rather than relying on pre-defined dashboards. As systems become more distributed (microservices, edge functions, external APIs), observability is not optional—it's the only way to understand why things fail. ## The Three Pillars of Observability **Logs** are timestamped records of events. They provide fine-grained context for specific events: "user 123 attempted checkout at 14:32:01, failed with payment error CARD_DECLINED, session_id: abc". **Metrics** are aggregated numerical measurements over time: request rate, error rate, p50/p95/p99 latency, database query time, cache hit rate. Metrics are efficient for alerting and dashboards because they're pre-aggregated. **Traces** record the path of a request through a distributed system. A trace for a checkout request might show: Edge Function (2ms) → API Server (45ms) → Payment Service (380ms) → Database (12ms) → Response (440ms total). Traces reveal which service is the bottleneck and how services interact. The three pillars are most valuable when correlated: a spike in p99 latency (metric) triggers an alert, you drill into the traces from that time window to find the slow path, and then read the logs for those specific trace IDs to understand the root cause. ## OpenTelemetry: The Universal Standard OpenTelemetry (OTel) is the vendor-neutral, open-source standard for instrumentation. Instead of choosing between Datadog, New Relic, and Honeycomb instrumentation libraries, you instrument once with OTel and route telemetry to any backend (or multiple backends) via the OTel Collector. ```typescript // Initializing OpenTelemetry in Node.js import { NodeSDK } from '@opentelemetry/sdk-node'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT, }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start(); ``` Auto-instrumentation automatically traces HTTP requests, database queries (Prisma, pg, MySQL), Redis operations, and external HTTP calls—without manual spans for every operation. ## Structured Logging Logs are most valuable when they're machine-readable and queryable. Structured JSON logs allow log aggregation platforms to index and search by field values: ```typescript // Unstructured (bad): hard to parse, search, or aggregate console.log(`User 123 checkout failed: CARD_DECLINED`); // Structured (good): machine-readable, searchable by field logger.error({ event: 'checkout_failed', user_id: '123', session_id: 'abc', error_code: 'CARD_DECLINED', payment_provider: 'stripe', amount_cents: 9999, trace_id: span.spanContext().traceId, }); ``` Include the trace ID in every log line so you can correlate logs with traces. Use Pino (Node.js) or Zap (Go) for high-performance structured logging. **Log levels**: Use levels consistently. `debug` for development troubleshooting (disabled in production). `info` for normal operations (request received, job completed). `warn` for unexpected but recoverable situations (cache miss fallback, retry succeeded). `error` for failures that require attention. Never log sensitive data (passwords, full credit card numbers, SSNs). ## Metrics and SLOs Define Service Level Objectives (SLOs)—measurable targets for system reliability and performance: - **Availability SLO**: 99.9% of requests return a non-5xx response - **Latency SLO**: 95th percentile API response time below 500ms - **Error rate SLO**: Error rate below 0.1% of requests Monitor SLOs with burn rate alerts—alert when you're consuming your error budget faster than sustainable. If your monthly 99.9% availability SLO allows 43 minutes of downtime, a burn rate of 10x means you'd exhaust that budget in 4 minutes. High burn rate → immediate page; moderate burn rate → non-urgent alert during business hours. **Key metrics to instrument**: - Request rate by endpoint and status code - Latency percentiles (p50, p95, p99) by endpoint - Database query time and connection pool utilization - Cache hit/miss rate - External API call latency and error rates - Background job queue depth and processing time ## Distributed Tracing Best Practices **Propagate context**: Trace context (trace ID, span ID) must be propagated across service boundaries via HTTP headers (W3C Trace Context standard: `traceparent` header). OTel handles this automatically for instrumented HTTP clients and servers. **Add business context to spans**: Auto-instrumentation captures technical context (HTTP method, status code, SQL query). Add business context manually: ```typescript const span = tracer.startSpan('process_checkout'); span.setAttributes({ 'checkout.user_id': userId, 'checkout.cart_item_count': cart.items.length, 'checkout.total_cents': cart.totalCents, }); try { await processPayment(cart); span.setStatus({ code: SpanStatusCode.OK }); } catch (error) { span.recordException(error); span.setStatus({ code: SpanStatusCode.ERROR }); throw error; } finally { span.end(); } ``` **Sampling**: Tracing every request at high traffic volumes is expensive. Implement head-based sampling (sample a percentage of all traces) for baseline visibility. Add tail-based sampling (keep 100% of traces with errors or high latency) to ensure failures are always captured. ## Alerting Philosophy Alert on symptoms, not causes. Alert when users are affected, not when internal metrics deviate. "Error rate above 1%" is a symptom alert—something is wrong for users right now. "CPU above 80%" is a cause alert that may or may not affect users. Keep alert volume low and actionable. If engineers start ignoring alerts because too many are false positives or untackle, alerts lose their value. Every alert that fires at 3am should be either: fix it now (it's impacting users), or tune the alert (it's not worth waking someone up for). Nothing else.
