Log Management Best Practices: Collection, Storage, and Analysis at Scale

Log management is the practice of collecting, processing, storing, and analyzing log data generated by applications, infrastructure, and services. Done well, it gives engineering teams the ability to debug production issues in minutes, detect security incidents early, and maintain compliance audit trails. Done poorly, it becomes a money pit that nobody trusts.

This guide covers the engineering practices that separate effective log management from log chaos.

Structured Logging

The single most impactful log management decision happens at the source — before logs ever reach your pipeline.

Use Structured Formats

Emit logs as structured data (JSON is the most common format), not free-text strings:

{"timestamp":"2025-06-15T14:23:01.234Z","level":"error","service":"payment-api","trace_id":"abc123","user_id":"u_789","message":"Payment processing failed","error_code":"CARD_DECLINED","duration_ms":342}

Not:

2025-06-15 14:23:01 ERROR payment-api - Payment processing failed for user u_789: CARD_DECLINED (342ms)

Structured logs are:

Parseable without regex: No grok patterns, no brittle parsing rules
Queryable: Filter by any field (service:payment-api AND error_code:CARD_DECLINED)
Aggregatable: Compute p99 duration, count errors by code, group by service
Forward-compatible: Adding new fields doesn't break existing parsers

Standardize Field Names

Adopt a consistent schema across all services. At minimum, every log line should include:

Field	Purpose	Example
`timestamp`	When the event occurred (ISO 8601 with timezone)	`2025-06-15T14:23:01.234Z`
`level`	Severity (debug, info, warn, error, fatal)	`error`
`service`	Which service emitted the log	`payment-api`
`message`	Human-readable description	`Payment processing failed`
`trace_id`	Distributed tracing correlation	`abc123def456`
`host`	Machine or container identifier	`prod-web-03`

Standards like the OpenTelemetry Semantic Conventions or Elastic Common Schema (ECS) provide comprehensive field name standards you can adopt rather than inventing your own.

Log Levels Matter

Use log levels consistently across your organization:

DEBUG: Detailed diagnostic information for development. Never enable in production by default.
INFO: Normal operational events — service started, request completed, job finished.
WARN: Unexpected but recoverable situations — retry succeeded, cache miss, deprecated API called.
ERROR: Failures requiring attention — request failed, external service unavailable, data inconsistency.
FATAL: Unrecoverable failures — process crashing, critical dependency down.

The most common mistake is logging everything at INFO level, making it impossible to filter signal from noise.

Collection Pipeline Architecture

The Standard Pipeline

A production log pipeline typically follows this flow:

Applications → Shippers → Message Queue → Processors → Storage → Analysis UI

Shippers (Filebeat, Fluent Bit, Vector, OpenTelemetry Collector) collect logs from files, stdout, or directly from applications and forward them to the pipeline.

Message Queue (Kafka, Amazon Kinesis, Redis) buffers logs between collection and processing. This decouples producers from consumers and absorbs traffic spikes without dropping data.

Processors (Logstash, Fluentd, Vector) parse, enrich, filter, and transform logs before storage. Common operations: parsing timestamps, adding GeoIP data, redacting sensitive fields, routing to different indices.

Storage (Elasticsearch, OpenSearch, ClickHouse, object storage) indexes logs for fast querying and holds them for the configured retention period.

Pipeline Best Practices

Buffer with a message queue: Never ship logs directly from applications to storage. A queue absorbs spikes and prevents log loss during storage maintenance or outages.
Parse at the edge when possible: Ship structured logs and do minimal processing in the pipeline. The less transformation you need, the simpler and more reliable your pipeline becomes.
Use backpressure, not dropping: Configure shippers and queues to apply backpressure (slow down producers) rather than silently dropping logs when the pipeline is overloaded.
Separate pipelines by criticality: Security/audit logs should flow through a separate pipeline from debug logs, with stricter durability guarantees.
Monitor your monitoring: Your log pipeline itself needs health monitoring — queue depth, processing lag, indexing errors, and storage capacity.

Storage and Retention

Tiered Storage

Not all logs need the same storage performance:

Hot tier (0–7 days): Fast SSDs, full indexing, immediate query access. This is where active debugging and alerting happens.
Warm tier (7–30 days): Standard storage, reduced replicas. For investigating recent incidents that weren't caught immediately.
Cold tier (30–90 days): Compressed, minimal replicas, slower queries. For trend analysis and compliance.
Archive (90+ days): Object storage (S3, GCS). For compliance retention requirements. Searchable snapshots or on-demand rehydration.

Both Elasticsearch and OpenSearch support Index Lifecycle Management (ILM) or Index State Management (ISM) to automate data movement between tiers.

Retention Policies

Define retention per log type based on actual needs:

Log Type	Typical Retention	Rationale
Application debug	3–7 days	Only needed for active debugging
Application info/warn	14–30 days	Operational context for recent issues
Application errors	30–90 days	Root-cause analysis for recurring issues
Security/audit logs	1–7 years	Compliance (SOC 2, HIPAA, PCI-DSS)
Infrastructure metrics	30–90 days	Capacity planning and trending
Access logs	30–90 days	Security investigation, traffic analysis

Over-retaining logs is the most common cost driver. Audit your retention policies quarterly — if nobody queries logs older than 30 days, don't keep them on expensive hot storage for 90.

Index Strategy

For Elasticsearch and OpenSearch deployments:

Time-based indices: Create daily or weekly indices (logs-2025.06.15) to enable efficient rollover and deletion.
Index templates: Define mappings and settings once; apply them automatically to new indices.
Rollover policies: Roll indices based on size (50 GB) or age (1 day) to keep shard sizes manageable.
Shard sizing: Target 10–50 GB per shard. Too many small shards waste memory; too few large shards slow queries.

Analysis and Alerting

Effective Log Queries

Structure your logs and indices to support the queries you actually run:

Service + time range + level: The most common pattern. Ensure service, timestamp, and level are properly mapped.
Trace ID lookup: Finding all logs for a single request across services. Index trace_id as a keyword field.
Error aggregation: Counting errors by type, service, and time bucket. Use aggregations, not scrolling through raw logs.
Full-text search on messages: For hunting unknown issues. Use appropriate analyzers and consider dedicated search fields.

Alerting Best Practices

Alert on rates, not individual events: "Error rate > 5% for 5 minutes" is actionable. "An error occurred" is noise.
Use anomaly detection for unknown-unknowns: ML-based anomaly detection on error rates, latency distributions, and log volume catches issues you didn't predict.
Include context in alerts: Link directly to the relevant dashboard or log query so responders don't waste time navigating.
Route alerts by severity: Page on-call for critical issues; send to Slack for warnings; write to a ticket for informational trends.

Security Considerations

Sensitive Data in Logs

Logs frequently contain data that shouldn't be stored:

Passwords, API keys, tokens
Credit card numbers, SSNs
Personal identifiable information (PII)
Session identifiers

Implement redaction at the application level (preferred) or in the processing pipeline. Use field-level masking patterns:

# In Logstash
filter {
  mutate {
    gsub => ["message", "\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b", "[REDACTED_CC]"]
  }
}

Access Control

Restrict log access by role — developers see application logs, security teams see audit logs, nobody sees everything by default.
Log all access to the logging system itself for audit compliance.
Encrypt logs in transit (TLS) and at rest.

Cost Optimization

Log management costs grow linearly with volume. The most effective levers:

Don't log what you don't need: Remove debug logs from production, suppress health check access logs, and deduplicate repeated messages.
Sample high-volume, low-value logs: For high-traffic services, sample verbose logs at 10–25% while keeping all errors and warnings.
Compress aggressively: Use ZSTD compression on warm/cold tiers. Typical savings: 5–10x over uncompressed.
Use appropriate storage tiers: Don't keep 90 days of logs on hot SSDs when only 7 days are actively queried.
Monitor ingestion volume by service: Identify which services produce the most log volume and whether that volume is justified. A single misconfigured debug flag can double your storage costs.

Frequently Asked Questions

Q: How much log storage should I plan for?

A rough formula: daily_volume_GB = events_per_second × average_event_size_KB × 86400 / 1_000_000. With compression (typical 5–10x in Elasticsearch/OpenSearch), divide by the compression ratio. Plan for 20% headroom above peak volumes.

Q: Should I use Elasticsearch, OpenSearch, or ClickHouse for logs?

Elasticsearch and OpenSearch excel at full-text search across log messages and offer mature ecosystems (Kibana/OpenSearch Dashboards, alerting, ILM/ISM). ClickHouse is better for structured, high-volume analytical queries (aggregations, time-series analysis) but lacks built-in full-text search UIs. Many organizations use both — ClickHouse for metrics and structured events, Elasticsearch/OpenSearch for text-heavy logs.

Q: How do I handle multi-line logs like stack traces?

Configure your shipper to join multi-line messages before sending. Filebeat and Fluent Bit both support multi-line patterns. Alternatively, emit stack traces as a single JSON field at the application level — this is more reliable than pipeline-based joining.

Q: What's the difference between logging and observability?

Logging is one pillar of observability, alongside metrics and traces. Logs capture discrete events, metrics capture aggregated measurements over time, and traces capture request flows across services. Modern observability correlates all three — a metric alert leads to traces that identify the slow service, which leads to logs that show the root cause.

Q: How do I reduce log volume without losing visibility?

Start by auditing: which log lines are actually queried? Remove or reduce retention for lines nobody reads. Use sampling for high-volume endpoints. Move verbose diagnostic data behind dynamic log levels that can be enabled per-service on demand. And always log errors — those are the logs you'll wish you had.