When Monitoring Isn't Enough: Understanding Observability

Monitoring a single server is simple: CPU, memory, disk, network — a few commands, a few charts. When a search request fans out across a thousand machines and passes through dozens of services, latency jitter at any single hop can make things feel slow to the user — and no single engineer can point to where the problem is.

That's the question observability answers. It doesn't assume "what might go wrong." It ensures that after something breaks, you can find the root cause from the data the system itself emits. This isn't a toolset — it's a capability.

1. What Is Observability

The term "observability" comes from cybernetics. In 1960, Rudolf E. Kálmán defined it mathematically: a system's observability is the ability to infer its internal state from its external outputs. In cybernetics, this is a mathematical property — a system either has observability or it doesn't, regardless of how many sensors you install.

When software engineering borrowed the term, the meaning shifted. Starting around 2016, Honeycomb's Charity Majors and Christine Yen pushed it into industry awareness. Their logic: distributed system failure modes can't be enumerated exhaustively. Traditional monitoring — predefining "what counts as abnormal" and watching dashboards — fails. You need more than dashboards of known metrics; you need the ability to slice data along arbitrary dimensions to find root causes after problems appear.

Thus, the distinction between "monitoring" and "observability":

Monitoring handles known unknowns: you already know to watch for CPU spikes, error rate surges, and response time degradation. You've set thresholds and alert rules, and the system watches them.
Observability handles unknown unknowns: a problem occurs that you hadn't predefined. You need to compose ad-hoc queries over high-cardinality, high-dimensional telemetry data to find the root cause — a root cause whose shape you couldn't have predicted before the problem happened.

The two require different data types and storage models. Monitoring data is low-cardinality time series (CPU, memory, QPS) that can be efficiently aggregated and compressed. Observability data is high-cardinality (user IDs, request IDs, trace IDs, pod names, build versions) — when slicing across arbitrary dimensions, traditional aggregated storage isn't sufficient.

In August 2024, Charity Majors proposed "Observability 2.0" to distinguish these forms:

Observability 1.0: The three-pillar model. Metrics, logs, and traces each stored and queried separately. Engineers manually stitch clues across three tools.
Observability 2.0: A single data source — arbitrarily wide structured log events, from which metrics and traces are derived as views rather than independently stored data types.

Whether called 1.0 or 2.0, the core question remains: when the system has a problem you didn't know could happen, how fast and how freely can you explore the data to find the answer?

2. A Brief Technical History

1990s: Manual operations. SNMP (Simple Network Management Protocol) was the standard for network device monitoring, polling for CPU, memory, and interface traffic. Nagios launched in 1999, introducing threshold-based alerting — trigger a notification when a metric crosses a line. Logs were local files on servers, viewed with grep and tail -f, with no centralized storage; cross-machine correlation was manual.

2003-2005: SRE, log platforms, and APM. In 2003, Splunk was founded, the first to centralize machine-generated log data into a searchable platform. That same year, Google's internal Production Team under Ben Treynor Sloss began practicing Site Reliability Engineering — applying software engineering methods to operations, including defining SLOs (Service Level Objectives), error budgets, and blameless postmortems. SRE isn't observability technology itself, but it defines the practice framework for "why observe" and "how much observation is enough." In 2016, Google published Site Reliability Engineering, spreading this methodology industry-wide.

Concurrently, APM (Application Performance Monitoring) emerged. New Relic (2008), AppDynamics (2008), and Dynatrace (2005) provided code-level real-time performance monitoring, primarily targeting monolithic or three-tier applications — fixed structures with predictable failure modes.

2010: The Dapper paper. In April 2010, Google published "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure." The paper described a distributed tracing system covering nearly all of Google's production services, already running in production for over two years. Dapper's core model — using trace_id to link all services a request passes through, with each service's operation as a span, parent-child relationships forming a tree — became the prototype for all subsequent distributed tracing systems.

Several Dapper design decisions shaped the field: adaptive sampling (high-throughput services sampled as low as 1/1024) kept overhead negligible; instrumentation was confined to common libraries so application developers barely noticed the tracing system; spans supported key-value annotations for application-specific custom data.

2012-2015: Open source explosion. After the Dapper paper, the open source community began building similar systems. Twitter released Zipkin in 2012 — the first open source Dapper-style tracing system. Uber released Jaeger in 2015 (later donated to CNCF, graduating in 2019). Apache SkyWalking was open sourced by Wu Sheng in 2015, entering the Apache incubator in 2017.

2012 also saw another independent event: SoundCloud engineers released Prometheus. Prometheus's design derived from Google's internal Borgmon monitoring system, with key features including a multi-dimensional data model (key-value labels), the PromQL query language, and a pull-based collection model. Prometheus later became the de facto monitoring standard for the Kubernetes ecosystem.

On the logging side, Elasticsearch (2010) + Logstash + Kibana formed the ELK Stack, becoming the mainstream open source solution for centralized log management. Grafana, starting in 2014, became the cross-data-source visualization standard — it doesn't store data but can pull from Prometheus, Elasticsearch, InfluxDB, and dozens of other sources.

2017: Three pillars proposed. In 2017, Peter Bourgon published "Metrics, Tracing, and Logging," and Cindy Sridharan further elaborated in Distributed Systems Observability, naming metrics, logs, and traces as the three pillars of observability. This concept was hugely influential — it gave the industry a shared vocabulary, but also implied the three were independent and parallel, objectively encouraging data silos.

2019-2023: OpenTelemetry unifies standards. In the late 2010s, two competing open standards existed for distributed tracing: CNCF's OpenTracing (2015) and Google/Microsoft's OpenCensus (2017). They were incompatible.

In 2019, OpenTracing and OpenCensus announced a merger into OpenTelemetry (OTel), as a CNCF sandbox project. OTel's goal: become the unified collection standard for traces, metrics, and logs — one API, one SDK, one Collector, with data exported to any backend.

Key OTel milestones, roughly:

2021: Tracing API/SDK reached stable status
2022: OpenTracing archived and retired; OpenTelemetry became the migration target
2023: Metrics stabilized across multiple language ecosystems; Logs data model and collection pipeline gradually stabilized; OpenCensus archived

Thirteen years after the Dapper paper, distributed tracing went from one company's internal experiment to an industry-wide open standard.

2024-2025: eBPF and continuous profiling. Traditional approaches require applications to integrate SDKs or deploy agents in clusters for telemetry collection. eBPF (extended Berkeley Packet Filter) changed this paradigm: it allows running sandboxed programs safely in the kernel, directly capturing data at the syscall, network, and scheduler layers without modifying application code, with overhead typically below 1%.

This enabled several directions:

Zero-code distributed tracing: Projects like DeepFlow and Pixie (CNCF) automatically capture inter-service call relationships via eBPF, without application-level instrumentation
Continuous profiling: Grafana Pyroscope and Parca leverage eBPF for always-on CPU, memory, and I/O performance analysis, with overhead low enough for production
Kernel-level container visibility: Netflix uses eBPF to monitor scheduler latency for "noisy neighbor" detection; Cilium/Tetragon use eBPF for network and security observability

In 2025, Coralogix released a continuous profiling product based on eBPF + OpenTelemetry standards. The convergence direction of eBPF and OTel: eBPF for zero-code collection, OTel Collector for processing, transformation, and export.

I personally only followed the latter half of this history; most of the early parts came from reading papers and blogs retroactively. Omissions and misunderstandings are likely.

3. Three Core Data Types

The Observability 2.0 wide-events model is theoretically more unified — a single set of high-cardinality structured logs supporting all query patterns. But in current engineering practice, the industry still widely organizes and understands telemetry data by the three categories of metrics / logs / traces, each with its own storage engine, query patterns, and applicable scenarios. From a user's perspective, these three distinctions remain useful.

Metrics are numeric time series. CPU utilization, request counts, error counts, latency percentiles — aggregated numbers with low storage cost, fast queries, suitable for dashboards and alerting. Prometheus represents the multi-dimensional metrics model. The limitation: metrics lose detail — you know p99 latency is high, but not which user or which request triggered it.

Logs are immutable event records. A request arrives, a function is called, an error is thrown — each generates a log line. Logs have the finest granularity but the highest storage and query cost. Structured logging (JSON format, consistent field names) is a baseline requirement for modern practice — unstructured logs can't be efficiently parsed by machines, making observability impossible.

Traces are the complete path of a single request. A trace consists of multiple spans, each representing the request's processing segment at a particular service. Parent-child and sibling relationships between spans describe the request's topology through the distributed system. Traces are especially valuable for locating performance bottlenecks and understanding service dependencies — you learn it's not that service B is slow, it's the 3rd database query in service C that B calls that's slow.

Correlation between the three is what turns data islands into a network:

Inject trace_id and span_id into logs: from a slow span, jump directly to corresponding logs for details
Use exemplars: correlate specific traces from metrics, bridging aggregated metrics with individual requests
OpenTelemetry's unified data model: metrics, logs, and traces share the same resource and attribute semantic conventions

Regarding logs, one direction worth expanding: wide structured log events. The idea is to have applications output arbitrarily wide, high-cardinality structured logs (user_id, request_id, feature_flag, build_version, region, etc.), then use these logs to simultaneously generate metrics (aggregate by a field) and traces (link by request_id). Essentially, one data set supporting three views. Honeycomb was an early advocate; OpenTelemetry's Logs data model also supports this usage.

4. How Data Is Collected: From SDK to eBPF

The evolution of the collection layer is the most critical infrastructure shift in the observability tech stack.

Phase 1: SDK + Agent. Applications produce telemetry data via OpenTelemetry SDKs (or language-specific Prometheus clients and logging libraries), sending it to a local agent or directly to a Collector. SDKs offer high precision — you know the exact meaning and business semantics of variables in code — at the cost of manual instrumentation.

Phase 2: OpenTelemetry Collector. The OTel Collector is telemetry's unified gateway. It receives OTLP protocol data, processes it through a pipeline (filtering, redaction, batching, sampling, attribute editing), and exports to different backends — traces to Jaeger, metrics to Prometheus, logs to Loki, all through the same Collector. The Collector can be deployed as sidecar, daemonset, or standalone cluster, depending on traffic volume.

Phase 3: eBPF. eBPF programs run in the kernel, hooking into syscalls, network packets, scheduling events, and other kernel paths, directly exposing system and network behavior. For several scenarios, eBPF provides value that traditional approaches can't easily match:

Applications that can't be instrumented (third-party software, legacy systems)
Kernel-level visibility scenarios (inter-container network latency, filesystem I/O hotspots, CPU scheduling jitter)
Security observability (detecting anomalous syscalls, container escapes, reverse shells)
Service mesh and network-layer automatic topology discovery

eBPF's limitation: it operates at the kernel layer and can't see application-level business logic or variables. So eBPF won't replace SDK instrumentation — the two are complementary. SDKs provide business semantics; eBPF provides infrastructure visibility.

Continuous Profiling is another important direction in the collection layer. Traditional profiling uses tools like pprof and JFR against specific processes in development or staging environments, with significant overhead unsuitable for continuous production use. eBPF-based profiling (Grafana Pyroscope, Parca, Coralogix) reduces overhead to under 1%, making it feasible to continuously collect CPU flame graphs, memory allocation hotspots, and I/O wait distributions in production. When a 3 AM alert wakes you up, you can directly look at the past 24 hours of CPU profiles to determine which function is burning CPU, instead of trying to reproduce locally.

5. Who Provides Observability

Observability is an ecosystem, not a product. The following outlines the main options in the current landscape, aiming not to recommend specific solutions but to give readers a structured map.

Open source stack centers on the Prometheus + Grafana combination:

Prometheus: Metric collection and storage. The default monitoring solution in Kubernetes environments
Grafana: Cross-data-source visualization. The same dashboard can query Prometheus (metrics), Loki (logs), Tempo (traces), and Pyroscope (profiles) — Grafana Labs calls this open source stack LGTM (Loki, Grafana, Tempo, Mimir)
Grafana Loki: Log aggregation system inspired by Prometheus's label model. Doesn't index full text; indexes metadata by label and streams log content, with lower storage cost than ELK
Grafana Tempo: Focused on high-cardinality trace storage and search, doesn't index span-level attributes
Grafana Pyroscope / Parca: Continuous profiling
OpenTelemetry Collector: Unified collection pipeline

This open source stack suits teams that require data sovereignty. All components can be self-hosted; the cost is operations burden.

Commercial observability platforms provide hosting and integration:

Datadog: 500+ integrations covering infrastructure monitoring, APM, logs, RUM, and security monitoring. Significant investment in AI-driven anomaly detection in 2024-2025
Grafana Cloud: Managed LGTM stack, lower onboarding cost for teams already in the Prometheus ecosystem
Honeycomb: Emphasizes high-cardinality event-driven query model, oriented toward developer debugging scenarios. In the ETR 2024 survey, 75% of respondents endorsed its observability innovation
New Relic, Dynatrace, Splunk (Cisco): Each with strong APM and AIOps heritage, holding stable shares in large enterprises

Cloud-native services have the lowest onboarding cost for workloads already on the corresponding cloud:

Google Cloud Operations Suite (formerly Stackdriver) provides:

Cloud Monitoring: Metric dashboards, alerting, SLO monitoring, multi-project workspaces
Cloud Logging: Centralized log aggregation with SQL-style queries (Log Analytics), routeable via Log Router to BigQuery / Pub/Sub / Cloud Storage
Cloud Trace: Distributed tracing with OpenTelemetry support, propagating trace context via X-Cloud-Trace-Context header
Cloud Profiler: Continuous CPU and heap profiling
Error Reporting: Automatic aggregation and grouping of similar exceptions
Cross-cloud ingestion (GCP + AWS + on-prem) via Ops Agent

Cloudflare Observability, built on the CDN and Workers platform, focuses on observability for traffic flowing through the Cloudflare network:

Log Explorer (June 2025 GA): Native log storage and querying, covering HTTP events, security events, Zero Trust datasets (Access, Gateway DNS/HTTP/Network, CASB, Device Posture), 30-day retention
Workers Observability (2025): Logs, metrics, and console for Workers, including a Query Builder with aggregation/filtering/grouping
Observatory (2025 Open Beta): Combining RUM, backend telemetry, error rates, cache hit ratios, synthetic testing (browser + network), with actionable recommendations
Workers Automatic Tracing (November 2025 Open Beta): Zero-code export of OpenTelemetry-compatible traces
Cloudflare Radar: Internet-scale observability — BGP routing, AS traffic, certificate transparency, AI crawler activity
According to Cloudflare's official blog, in June 2024 their internal logging pipeline migrated from syslog-ng to OpenTelemetry Collector; the acquired Baselime was migrated from AWS to their own platform, reducing costs by over 80%

AWS CloudWatch and Azure Monitor provide similar functionality, with differences mainly in integration depth with their respective cloud product lines.

6. Putting It Into Practice

Data coverage sequence. You don't need to cover everything at once. For monolithic applications or teams with fewer than five services, getting structured logging right first, plus core metrics like latency, error rate, and traffic, covers most scenarios. For microservice architectures with more than ten services, distributed tracing's priority rises significantly — without traces, tracing cross-service latency spikes to root causes is very difficult.

Structured logging is the foundation. If logs are unstructured, or JSON but with inconsistent field names across services — the same concept with different keys — any downstream query and analysis becomes painful. Invest time in defining a shared logging field convention across services, including at minimum: timestamp, service_name, trace_id, span_id, level, message.

SLO-driven alerting. Alerts shouldn't be based on arbitrarily picked thresholds ("CPU > 90%") but from the user's perspective: how fast should the service be (latency SLO), what availability (availability SLO), what success rate (success rate SLO). Configure alerts based on SLOs and error budgets — trigger when error budget consumption outpaces expectations, not on every metric wobble — to significantly reduce late-night alert noise.

Cost control. Observability data volume typically grows faster than the business. Watch for:

More logs isn't always better. Debug level is usually only valuable in development; production should retain as needed
Metric label cardinality — each unique label combination in Prometheus creates a new time series; using user_id or request_id as a label is a common mistake
Trace sampling strategy — 100% sampling is usually unnecessary. Tail-based sampling (deciding which complete traces to keep based on latency, error, and other attributes at the backend) suits most scenarios better than head-based (random)
Hot-cold tiering — infrequently accessed historical data in cold storage, frequently queried recent data in hot storage

Long-term value of unified data models. Adopting OpenTelemetry's semantic conventions — regardless of whether the backend is Prometheus, Datadog, or a cloud provider's monitoring — http.method means http.method everywhere. This standardization reduces cross-team interpretation costs and lowers migration costs when switching backends.

Closing

Observability ultimately solves not a technical problem but a cognitive one.

The trend of software systems becoming increasingly complex is irreversible. From monolith to SOA to microservices to serverless, from single data center to multi-region to edge computing — each step increases the system's combinatorial state space, and with it, the number of failure modes. At some point, no one can enumerate all possible "what could go wrong" scenarios — not from lack of diligence, but because the combinatorial space is too large.

Observability is the response to this. It shifts thinking from "predefine failures" to "collect data upfront, explore afterward." Its concept originated in 1960s cybernetics; its technology was ignited by the 2010 Dapper paper, implemented and spread by open source communities, standardized by OpenTelemetry, and advanced to the kernel level by eBPF and continuous profiling. But its purpose remains singular: when the system has a problem you didn't know could happen, how fast can you find the answer?

References

Rudolf E. Kálmán, "On the General Theory of Control Systems", 1960.
Google, "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure", 2010.
Google, Site Reliability Engineering.
Peter Bourgon, Metrics, Tracing, and Logging.
Cindy Sridharan, Distributed Systems Observability.
OpenTelemetry, Versioning and stability for OpenTelemetry clients.
Cloudflare, Log Explorer is GA.
Cloudflare, Workers automatic tracing, now in open beta.

1. What Is Observability #

2. A Brief Technical History #

3. Three Core Data Types #

4. How Data Is Collected: From SDK to eBPF #

5. Who Provides Observability #

6. Putting It Into Practice #

Closing #

References #