Monitoring tells you something is wrong. Logging tells you why it's wrong.
Monitoring is about numbers. You collect metrics like CPU usage, error rates, and response times, and you watch them continuously. When something crosses a threshold, you get an alert. Logging is about events. A failed request, an unauthorized login attempt, a timed-out database query. Each one gets recorded with a timestamp and enough context to be useful later. Think of metrics as vital signs and logs as the detailed medical history.
Most teams learn this distinction the hard way. A PagerDuty alert wakes you up because error rates jumped to 5%. Monitoring did its job. But now you need to figure out which errors, in which service, and what changed. That's when you open the logs. You really can't run a production system well without both.
What Is Monitoring?
Monitoring means continuously observing your systems through metrics, dashboards, and alerts. At its core, it answers three questions:
- Is the system up and responding?
- Is performance within acceptable limits?
- Are any trends heading somewhere bad?
You're not looking at individual requests here. You're looking at rates, averages, percentiles, and counts. Aggregated views that tell you whether things are generally healthy or starting to degrade.
Types of Monitoring
Infrastructure monitoring covers the basics: servers, VMs, containers, storage, and networks. You're tracking CPU utilization, memory usage, disk I/O, and network throughput. If the infrastructure is struggling, everything running on top of it will suffer too.
Application performance monitoring (APM) goes deeper into how your applications behave from the inside. Response times, transaction volumes, error rates, throughput. APM tools can spot things that infrastructure metrics miss entirely, like a slow database query or a memory leak that's growing over days.
Network monitoring focuses on traffic patterns and connectivity between services. Latency, packet loss, bandwidth utilization, connection failures. In distributed systems where services talk to each other over the network, these issues can quietly degrade the user experience without any single service looking unhealthy.
There's also real user monitoring (RUM), which measures what actual users experience in real browsers and devices, and synthetic monitoring, which runs automated scripts to test your application from multiple locations even when no real users are active.
How Monitoring Systems Work
The mechanics are pretty consistent regardless of which tool you pick:
- Instrumentation. Your application exposes metrics through an endpoint (like
/metricsfor Prometheus), a client library, or an agent running on the host. - Collection. The monitoring system scrapes or receives those numbers on a schedule. Every 15 seconds, every minute, whatever you configure.
- Storage. Metrics land in a time-series database optimized for high write throughput and time-range queries.
- Visualization. Dashboards turn the numbers into line charts, heatmaps, and gauges so a human can spot patterns at a glance.
- Alerting. When a metric crosses a threshold (error rate above 1%, disk above 90%, latency above 500ms), the system fires a notification to Slack, PagerDuty, or wherever your team pays attention.
What to Monitor
Two popular frameworks help you decide what to track. The RED method covers request-facing services: Rate (requests per second), Errors (failed requests per second), and Duration (response time distribution). The USE method covers infrastructure resources: Utilization, Saturation, and Errors. Either framework works. The point is to measure both what users experience and what's happening under the hood.
Don't overlook business metrics either: signups per minute, completed checkouts, active sessions. Your servers can look perfectly healthy while your checkout flow is silently broken.
What Is Logging?
Logging captures discrete, timestamped records of events that happen within your applications and infrastructure. A single log entry might look like: At 14:23:01, user 12345 tried to connect to db-primary on port 5432, and the connection was refused.
If monitoring is the cockpit dashboard, logging is the black box flight recorder. Logs can be verbose and expensive to store. But when something goes wrong, they're often the only way to piece together what actually happened.
What Gets Logged
Application logs are what developers reach for first during an incident. Errors, warnings, stack traces, and business events like "payment failed" or "order shipped."
Access logs record every HTTP request hitting your web server: method, URL path, status code, response time, client IP. Nginx and Apache generate these out of the box.
System logs capture OS-level events like service restarts, SSH logins, kernel panics, and cron job output. On Linux, these typically flow through syslog or journald.
Audit logs track who did what, when, and from where. These matter a lot in regulated industries. If you're in healthcare (HIPAA), finance (PCI DSS), or government (FedRAMP), you need audit logs to demonstrate compliance.
Security logs cover failed authentication attempts, firewall denials, and permission errors. These are what feed into SIEM systems for threat detection and incident response.
Structured vs. Unstructured Logs
Old-school logs are just lines of text:
[2026-03-18 14:23:01] ERROR: Connection refused to db-primary:5432
That's easy to read in a terminal. It's painful to search across a million of those for a specific trace ID.
Structured logging fixes this by outputting events as JSON with consistent fields:
{
"timestamp": "2026-03-18T14:23:01Z",
"level": "error",
"message": "Connection refused",
"host": "db-primary",
"port": 5432,
"service": "api-gateway",
"trace_id": "abc123"
}
Every field becomes searchable, filterable, and aggregatable. If you're starting a new project, do structured logging from the beginning. Seriously. Retrofitting it onto an existing codebase sounds easy but takes forever.
Log Levels
Almost every logging framework uses the same severity hierarchy:
- DEBUG is detailed diagnostic output for development. You almost never want this on in production.
- INFO covers routine events: "Server started on port 8080," "Batch job completed."
- WARN means something unexpected happened, but it's not a failure yet. "Disk at 87%," "Retried request to payments API."
- ERROR means something broke. "Database connection refused," "Stripe returned a 500."
- FATAL / CRITICAL means the process is crashing or already has. "Out of memory," "Corrupted write-ahead log."
Getting the right level for production is a balancing act. Log too much and you'll drown in noise (and storage costs). Log too little and you're debugging blind when the 2 AM alert fires.
Monitoring vs. Logging: Key Differences
| Monitoring | Logging | |
|---|---|---|
| Data type | Numeric metrics (counters, gauges, histograms) | Timestamped event records (structured or unstructured) |
| Primary purpose | Detect problems, surface trends, trigger alerts | Diagnose root causes, maintain audit trails |
| Granularity | Aggregated (e.g., p99 latency over 5 minutes) | Individual events (e.g., this specific failed request) |
| Timeframe | Real-time and recent trends | Historical records, sometimes spanning years |
| Storage cost | Relatively compact since numbers compress well | Can grow enormous at high request volumes |
| Typical query | "What's the error rate over the last hour?" | "Show all errors from service X with trace ID abc123" |
| Answers | "Is something wrong?" | "Why did it go wrong?" |
The short version: monitoring spots the problem, logging explains it. You need the alert to know something broke, and you need the logs to figure out the fix.
The Three Pillars of Observability
Once you're running distributed systems, monitoring and logging alone aren't enough. The industry has converged on three complementary types of telemetry data, often called the three pillars of observability.
Metrics are numerical measurements sampled at regular intervals. They're compact, fast to query, and great for dashboards and alerting. They tell you how the system is performing overall.
Logs are discrete event records with full contextual detail. They're verbose, but they're what you need for root cause analysis. They tell you what exactly happened.
Traces are the piece that ties it all together in a microservices world. A trace follows a single request as it flows through multiple services, showing which services were called, in what order, and how long each step took. When a request is slow and it touched six services along the way, tracing tells you which one is the bottleneck.
The workflow in practice: metrics alert you to a problem, traces show you which service in the chain is responsible, and logs give you the detail to actually fix it.
Monitoring and Logging for Security
You can't defend systems you can't observe. Monitoring and logging are foundational to security operations.
SIEM and Threat Detection
A Security Information and Event Management (SIEM) system aggregates logs from across your infrastructure (firewalls, authentication systems, applications, cloud services) and correlates events to detect threats. Modern SIEMs also use machine learning to establish behavioral baselines and flag anomalies that rule-based systems would miss.
Platforms like Splunk (now part of Cisco), CrowdStrike Falcon, Microsoft Sentinel, and IBM QRadar sit at the center of Security Operations Centers (SOCs). They provide centralized visibility, automated alerting, and incident response workflows.
Compliance and Audit Trails
Regulatory frameworks like PCI DSS, HIPAA, ISO 27001, GDPR, and NIS2 all require organizations to maintain detailed audit logs and demonstrate access controls. Logging provides the traceability you need to prove compliance: who accessed what data, when, and from where.
For compliance-grade logging, you'll need tamper-resistant storage, defined retention policies (often years, not days), encryption of sensitive data within logs, and role-based access controls over who can view log data.
AWS Security Monitoring and Logging
If you're on AWS, the platform provides purpose-built services for security observability:
- AWS CloudTrail records every API call in your account: who made it, from which IP, when, and what resources were affected. It's the foundation of AWS security auditing.
- Amazon CloudWatch provides real-time monitoring for AWS resources and applications, covering metrics, logs, alarms, and automated responses. Recent updates have unified operational and security log management, with native support for third-party sources like CrowdStrike, Okta, and Microsoft Office 365.
- Amazon GuardDuty does continuous threat detection using threat intelligence and anomaly detection, and integrates with CloudWatch for notifications and automated responses.
Popular Monitoring Tools
Prometheus + Grafana
If you've worked with Kubernetes, you've almost certainly seen this pairing. It's the dominant open-source monitoring stack.
Prometheus scrapes metrics from your services via HTTP, stores them in its own time-series database, and lets you query them with PromQL. The pull-based model (Prometheus fetches metrics from your apps, not the other way around) makes service discovery straightforward in dynamic environments. Prometheus 3.0 shipped in November 2024, the first major version in seven years, and brought a new UI, native OpenTelemetry (OTLP) metrics ingestion, native histograms, and UTF-8 support for metric and label names. As of early 2026, the current release is 3.10, with new minor versions shipping every six weeks.
Grafana handles the visualization side, connecting to Prometheus and dozens of other data sources. It's basically the industry standard for infrastructure dashboards at this point. Current version is v12.4.
One thing to know: Prometheus stores data locally on a single node by default, so there's no built-in high availability or long-term retention. If you need that, you'll want Thanos, Cortex, or Grafana Mimir on top. Both Prometheus (Apache 2.0) and Grafana (AGPLv3) are open-source CNCF projects.
Datadog
A fully managed SaaS platform that bundles infrastructure monitoring, APM, logs, tracing, synthetics, and security monitoring under one roof. Their agent auto-discovers services, containers, and cloud resources, and they offer 750+ native integrations.
Datadog launched Bits AI (an AI-powered SRE agent) as generally available in December 2025, and was named a Leader in the 2025 Gartner Magic Quadrant for Digital Experience Monitoring. Revenue hit $3.43 billion for full-year 2025.
The downside is the bill. Datadog charges per host for infrastructure, per GB for logs, and per analyzed span for APM. Teams with moderate workloads find it reasonable. Teams at scale often find themselves in uncomfortable conversations with finance. It's one of those tools where costs creep up gradually until someone notices.
New Relic
New Relic covers similar ground: APM, infrastructure, logs, synthetics, error tracking, all on a single platform. The pricing model is different though. New Relic charges primarily per GB of data ingested (starting at $0.40/GB beyond the free allowance) and per full-platform user seat, rather than per host. For some configurations, this works out noticeably cheaper than Datadog.
The free tier is genuinely useful for small projects: 100 GB/month of data ingest and one full-platform user, with no expiration and no credit card required.
Zabbix
Zabbix has been around since 2001, and it shows in both good and bad ways. It does network monitoring, server monitoring, cloud monitoring, SNMP, IPMI, and JMX. Both agent-based and agentless. The current version is 7.4.8 (March 2026), licensed under AGPLv3 (the license changed from GPLv2 to AGPLv3 starting with Zabbix 7.0 in June 2024).
It's a solid choice for enterprises with traditional infrastructure: physical servers, VMware, network switches, storage appliances. Zabbix handles large-scale deployments reliably and has a massive install base, particularly in Europe and Asia. But the UI and configuration experience feel dated compared to Grafana, and if your stack is mostly containers and cloud services, Prometheus is a better starting point.
Nagios
One of the originals, dating back to 1999. Nagios monitors hosts and services using a plugin architecture, and there are thousands of community plugins for monitoring everything from HTTP endpoints to RAID arrays. Nagios Core 4.5.11 was released in January 2026, and the commercial Nagios XI continues active development.
Most teams starting fresh today wouldn't pick Nagios. The configuration is file-based and verbose, the web UI is minimal, and it doesn't natively understand containers, auto-scaling, or service discovery. Prometheus has largely replaced it in the open-source ecosystem. But existing Nagios deployments are common, and the plugin ecosystem covers niche use cases that newer tools sometimes don't.
Popular Logging Tools
ELK Stack (Elasticsearch, Logstash, Kibana)
The most widely deployed open-source logging stack. Logstash (or the lighter-weight Filebeat) collects and ships logs. Elasticsearch indexes and stores them with full-text search. Kibana provides a web UI for searching, filtering, and building visualizations. If you've ever searched production logs through a web interface, there's a decent chance it was Kibana.
Elasticsearch 9 is the current major version (9.3.1, February 2026), built on Lucene 10. The 8.x line is still supported too. The query capabilities are hard to beat: aggregations, fuzzy matching, geospatial queries, complex filters.
A quick note on licensing, since it matters: Elasticsearch was originally Apache 2.0. In January 2021, Elastic changed the license to dual SSPL/Elastic License, which prompted the OpenSearch fork (more on that below). In August 2024, Elastic added AGPLv3 as a third option.
The catch with Elasticsearch is operational complexity. Running it in production means dealing with cluster sizing, shard management, index lifecycle policies, JVM tuning, and rolling upgrades. Teams that underestimate this tend to end up with slow queries and cluster stability problems at the worst possible times.
OpenSearch
OpenSearch is the fork that AWS created in 2021 after Elastic's license change. It lives under the Apache 2.0 license, which means there are no restrictions on offering it as a managed service or building commercial products on top of it. OpenSearch Dashboards replaces Kibana as the visualization layer.
The current release is 3.5.0 (February 2026), also built on Lucene 10. Out of the box, it includes vector/hybrid search, AI/ML integration, anomaly detection, and security analytics. AWS offers Amazon OpenSearch Service as a managed option, supporting clusters up to 10 PB of hot data, plus a serverless option that auto-scales.
How does it compare to Elasticsearch? Functionally they're similar, since they share the same roots. On raw performance, Elastic's own benchmarks claim Elasticsearch is 40-140% faster in complex query scenarios, though independent benchmarks have shown mixed results depending on the workload. If your main concern is licensing freedom or tight AWS integration, OpenSearch is the way to go. If you need maximum query performance or Elastic's commercial APM and ML features, stick with the original.
Grafana Loki
Loki takes a fundamentally different approach from Elasticsearch and OpenSearch. Instead of indexing the content of every log line, it only indexes the labels (metadata) attached to log streams. The log text itself is stored compressed but not indexed.
This makes Loki dramatically cheaper to run than Elasticsearch, often by an order of magnitude. The trade-off is less powerful querying. You can filter by labels instantly, but searching within log content requires scanning, which is slower. The current version is 3.6.7 (February 2026), licensed under AGPLv3. One thing worth noting: Promtail, the original log collector for Loki, has reached end of life and been replaced by Grafana Alloy.
Loki works best if you're already running Prometheus and Grafana. The Grafana integration is the real selling point: you can click on a spike in a metric graph and jump straight to the corresponding logs from the same time window.
ClickHouse (ClickStack)
ClickHouse is a column-oriented OLAP database that was originally created at Yandex in 2009 and open-sourced in 2016 under the Apache 2.0 license. It's built for analytical queries over massive datasets, with aggressive compression and massively parallel query execution.
In May 2025, ClickHouse, Inc. launched ClickStack, an open-source observability stack that unifies logs, metrics, traces, and session replays on top of ClickHouse. It includes the HyperDX UI for dashboards and trace exploration, a custom OpenTelemetry Collector optimized for ClickHouse ingestion, and ClickHouse as the data store. It positions itself as an open-source alternative to Datadog, and natively supports OTLP ingestion. The current ClickHouse version is v26.2 (March 2026).
The cost story is compelling. According to ClickHouse's benchmarks, the columnar architecture compresses observability data significantly better than Lucene-based engines like Elasticsearch and OpenSearch (they claim 2-5x, and some independent tests have shown even higher ratios). Storage can sit on cheap object storage like S3 rather than expensive block storage.
The downsides: individual row lookups ("find this one specific log entry") are slower than in Elasticsearch, because columnar storage has to reconstruct full rows from separate columns. Full-text search isn't as strong as inverted-index engines. And ClickStack is still fairly new compared to ELK or Splunk, so the ecosystem and community are smaller.
Splunk
The enterprise incumbent. Splunk has been doing log management and machine data analytics since 2003, and it still dominates in large enterprises, especially in security operations. The query language (SPL) is powerful, and the SIEM capabilities are deeply integrated. Cisco completed its acquisition of Splunk in March 2024 for approximately $28 billion, making it the largest acquisition in Cisco's history. Integration with Cisco's security and observability portfolio is ongoing.
Splunk is a great fit for large organizations with security, compliance, and audit requirements. The SOC integrations, compliance reporting, and mature alerting workflows are hard to replicate with open-source alternatives.
The trade-off, as always with Splunk, is cost. Licensing is based on daily data ingestion volume, and enterprise-scale log volumes produce enterprise-scale bills. Six- and seven-figure annual contracts are common, and it's equally common to see those organizations running projects to reduce how much data they send to Splunk.
Fluentd and Fluent Bit
These are log collectors, not storage systems. Their job is to gather logs from many sources (files, containers, system journals, network streams) and route them to whatever backend you use: Elasticsearch, OpenSearch, Loki, ClickHouse, S3, Splunk, Datadog, or multiple destinations at once.
Fluentd is the original, written in Ruby with a huge plugin ecosystem. Current version is v1.19.2. Fluent Bit is its lighter sibling, written in C, built for environments where resource usage matters, like sidecar containers in Kubernetes. Fluent Bit has become the default log forwarder in many Kubernetes distributions. Both are CNCF projects.
Graylog
A dedicated log management platform for collecting, indexing, and analyzing log data. Graylog 7.0 shipped in November 2025. The open edition is licensed under SSPL and includes ingestion, search, dashboards, and alerts. Enterprise and Security editions add compliance packs, cost controls, and SLA features.
Graylog is worth a look if you want something more focused than raw ELK but with similar underlying capabilities and a friendlier setup experience.
OpenTelemetry
OpenTelemetry (OTel) is a vendor-neutral, open-source framework for generating, collecting, and exporting telemetry data: metrics, logs, and traces. It gives you standardized APIs, SDKs for every major language, and a Collector component that works with any backend.
The project reached stable 1.0 specifications across all three signal types in 2024 (though individual language SDK implementations vary in maturity). It's the second most active CNCF project by contributor count, behind only Kubernetes.
Here's why OTel matters: it decouples your instrumentation from your backend choice. You instrument your code once with OTel, and then you can send data to Prometheus, Datadog, Jaeger, Splunk, ClickHouse, or any combination. If you want to switch backends later, you don't have to re-instrument everything. Prometheus 3.0's native OTLP ingestion and ClickStack's native OTel support are good examples of the ecosystem converging around this standard.
If you're starting a new project in 2026, OpenTelemetry is the safe default for instrumentation.
How to Choose the Right Stack
There's no universal "best" combination. It depends on team size, budget, existing infrastructure, and what kind of systems you're running. But some patterns show up again and again:
Small team, limited budget. Prometheus + Grafana for monitoring, Loki for logs, Fluent Bit to collect them. Everything is open-source and integrates natively. You can run the whole stack on a single decent server to start.
Cost-sensitive at scale. ClickStack for unified observability (logs, metrics, traces) on ClickHouse. The columnar compression and cheap object storage keep costs manageable as data volume grows. The trade-off is a newer ecosystem with a smaller community.
Mid-size team that doesn't want to operate infrastructure. Datadog or New Relic. You pay more per GB, but you're not spending engineering time maintaining your own observability stack. That trade-off makes sense when every hour of engineering time counts.
Enterprise with compliance and security needs. Splunk for security logs and SIEM, plus Prometheus/Grafana or Datadog for application and infrastructure monitoring. It's not unusual for large companies to run three or four observability tools serving different teams and use cases.
Kubernetes-native stack. Prometheus for metrics, Loki for logs, Grafana Tempo (or Jaeger) for traces, all visualized through Grafana. This combination, sometimes called the "PLG stack," is the most common cloud-native observability setup.
AWS-centric. CloudWatch for metrics and logs, CloudTrail for API audit trails, GuardDuty for threat detection. If your workloads are on AWS and you want to avoid managing observability infrastructure entirely, the native services integrate tightly with the rest of the platform.
Best Practices
Build observability in from the start. Adding instrumentation after your first major outage is a miserable experience. Bake metrics and structured logging into your code as you write it, not as a panicked retrofit weeks later.
Alert on symptoms, not causes. Set alerts on things users actually experience: high error rates, slow response times, failed transactions. An alert for "CPU above 80%" isn't useful if the service is handling traffic fine. High CPU with zero errors is just a busy server.
Use structured logging from day one. You'll be glad you did the moment you need to trace a single request through five services. Plain-text grep works until it doesn't.
Correlate across all three signals. Use consistent service names, environment labels, and trace IDs across metrics, logs, and traces. The real payoff comes when you can go from a monitoring alert to the relevant logs to the distributed trace in a few clicks.
Set retention policies before you have a storage crisis. Logs accumulate faster than you expect. Most operational logs are useful for 7 to 30 days. Audit logs might need to stick around for years. Decide early and set up automatic rotation, archival, or deletion.
Protect your logs. Encrypt sensitive data, mask personally identifiable information, and restrict access with role-based controls. Logs often contain credentials, API keys, or user data that shouldn't be broadly accessible.
Kill noisy alerts and unused dashboards. If an alert fires daily and nobody investigates, either fix the underlying issue or remove the alert. If a dashboard hasn't been opened in months, delete it. The team that ignores 50 alerts a day will also ignore the one that matters.
Frequently Asked Questions
What is the difference between monitoring and logging?
Monitoring collects numeric metrics (CPU usage, error rates, request latency) at regular intervals and uses them for dashboards and alerting. Logging records individual events (a specific failed request, a user login attempt) as timestamped entries. Monitoring tells you something is wrong. Logging helps you figure out why.
Can I use logging instead of monitoring?
You can derive metrics from logs. Tools like Loki's LogQL and Elasticsearch aggregations support this. But it's slower, more expensive, and more fragile than purpose-built metrics systems. Time-series databases are optimized for fast numerical queries over time ranges. Log stores are optimized for text search. Use each for what it's good at.
What are the three pillars of observability?
Metrics (monitoring), logs (logging), and traces (distributed tracing). Metrics show aggregate system performance. Logs provide event-level detail. Traces follow individual requests across service boundaries. Together they give you a complete picture of system behavior.
Do I need distributed tracing?
If your application is a monolith, probably not. Your stack traces and logs already have the full picture. Once a single request crosses multiple service boundaries over the network, tracing becomes important for understanding where time is spent and where failures originate.
What is OpenTelemetry and should I use it?
OpenTelemetry is a vendor-neutral open-source framework for generating and collecting metrics, logs, and traces. It provides standardized APIs and SDKs that work with any backend. The specification reached stable 1.0 across all signal types in 2024, and it's become the industry standard for instrumentation. If you're starting a new project, yes, use it.
How much does an observability stack cost?
It varies enormously. A self-hosted open-source stack (Prometheus, Grafana, Loki) can run on a few hundred dollars a month in cloud compute. ClickStack on ClickHouse Cloud offers competitive pricing through columnar compression. Managed platforms like Datadog or New Relic typically range from a few hundred to tens of thousands per month depending on the number of hosts and data volume. Splunk enterprise deployments commonly hit six figures annually. The biggest cost driver in every case is log volume.
What is SIEM and how does it relate to logging?
SIEM (Security Information and Event Management) aggregates logs from across your infrastructure and correlates events to detect security threats. It builds on logging by adding threat intelligence, behavioral analysis, and automated alerting. Common SIEM platforms include Splunk, CrowdStrike Falcon, Microsoft Sentinel, and IBM QRadar.
What's the difference between APM and monitoring?
Application Performance Monitoring (APM) is a specialized form of monitoring focused on application-level behavior: request tracing, slow query detection, error tracking, and service dependency mapping. General infrastructure monitoring covers lower-level resources like CPU, memory, disk, and network. Most modern observability platforms combine both.