Database Monitoring Best Practices: Complete Guide (2025)
Effective database monitoring is critical for maintaining application performance, preventing outages, and ensuring data reliability. This comprehensive guide covers monitoring strategies, essential metrics, and best practices for modern database systems.
Why Database Monitoring Matters
Database monitoring helps you:
- Prevent outages by catching issues before they become critical
- Optimize performance through data-driven insights
- Reduce costs by identifying inefficiencies
- Improve user experience with faster query response times
- Plan capacity based on usage trends
- Meet SLAs with proactive alerting
- Debug issues faster with historical data
- Ensure security by detecting anomalous behavior
Key Metrics to Monitor
1. Availability Metrics
Uptime and Downtime
- Database availability percentage
- Time to detect failures
- Time to recovery (MTTR)
- Planned vs unplanned downtime
Connection Health
- Successful vs failed connection attempts
- Connection pool utilization
- Connection errors and timeouts
- Active connections vs maximum allowed
Monitoring Example (PostgreSQL):
-- Check if database is accepting connections
SELECT pg_is_in_recovery(),
now() - pg_postmaster_start_time() as uptime;
-- Monitor connection count
SELECT count(*) as current_connections,
(SELECT setting::int FROM pg_settings WHERE name = 'max_connections') as max_connections,
round(100.0 * count(*) / (SELECT setting::int FROM pg_settings WHERE name = 'max_connections'), 2) as pct_used
FROM pg_stat_activity;
Monitoring Example (MySQL):
-- Check uptime
SHOW GLOBAL STATUS LIKE 'Uptime';
-- Monitor connections
SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Max_used_connections';
SHOW VARIABLES LIKE 'max_connections';
2. Performance Metrics
Query Performance
- Average query execution time
- Slow query count
- Query throughput (queries per second)
- Query errors and failures
- Long-running queries
Monitoring Example (PostgreSQL):
-- Enable pg_stat_statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Top 10 slowest queries by average time
SELECT
query,
calls,
total_exec_time / 1000 as total_sec,
mean_exec_time / 1000 as avg_sec,
max_exec_time / 1000 as max_sec
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
-- Currently running queries
SELECT
pid,
now() - query_start as duration,
usename,
state,
query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
Monitoring Example (MySQL):
-- Enable performance schema
UPDATE performance_schema.setup_instruments
SET ENABLED = 'YES', TIMED = 'YES';
-- Top queries by execution time
SELECT
DIGEST_TEXT as query,
COUNT_STAR as exec_count,
AVG_TIMER_WAIT/1000000000 as avg_ms,
SUM_TIMER_WAIT/1000000000 as total_ms
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;
-- Current processes
SHOW FULL PROCESSLIST;
Throughput Metrics
- Reads per second
- Writes per second
- Transactions per second
- Commits and rollbacks
Monitoring Example (PostgreSQL):
-- Transaction statistics
SELECT
datname,
xact_commit,
xact_rollback,
xact_commit + xact_rollback as total_xacts,
round(100.0 * xact_rollback / NULLIF(xact_commit + xact_rollback, 0), 2) as rollback_pct
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1');
-- Row operations
SELECT
schemaname,
tablename,
seq_scan,
seq_tup_read,
idx_scan,
idx_tup_fetch,
n_tup_ins,
n_tup_upd,
n_tup_del
FROM pg_stat_user_tables
ORDER BY seq_tup_read + idx_tup_fetch DESC
LIMIT 10;
3. Resource Utilization Metrics
CPU Usage
- Database process CPU utilization
- Per-query CPU consumption
- CPU wait time
- Context switches
Memory Usage
- Buffer pool/cache hit ratio
- Memory allocated vs used
- Swap usage
- Memory available
- Cache efficiency
Monitoring Example (PostgreSQL):
-- Buffer cache hit ratio (target > 99%)
SELECT
sum(heap_blks_read) as heap_read,
sum(heap_blks_hit) as heap_hit,
round(sum(heap_blks_hit) / NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0) * 100, 2) as cache_hit_ratio
FROM pg_statio_user_tables;
-- Index cache hit ratio
SELECT
sum(idx_blks_read) as idx_read,
sum(idx_blks_hit) as idx_hit,
round(sum(idx_blks_hit) / NULLIF(sum(idx_blks_hit) + sum(idx_blks_read), 0) * 100, 2) as idx_cache_hit_ratio
FROM pg_statio_user_indexes;
Monitoring Example (MySQL):
-- InnoDB buffer pool hit ratio (target > 99%)
SHOW STATUS LIKE 'Innodb_buffer_pool_read_requests';
SHOW STATUS LIKE 'Innodb_buffer_pool_reads';
-- Calculate hit ratio
SELECT
(1 - (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads') /
(SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests')) * 100
AS buffer_pool_hit_ratio;
Disk I/O
- Read IOPS
- Write IOPS
- Read throughput (MB/s)
- Write throughput (MB/s)
- I/O latency
- Queue depth
Monitoring Example (PostgreSQL):
-- Table I/O statistics
SELECT
schemaname,
tablename,
heap_blks_read,
heap_blks_hit,
idx_blks_read,
idx_blks_hit,
toast_blks_read,
toast_blks_hit
FROM pg_statio_user_tables
WHERE heap_blks_read + idx_blks_read > 0
ORDER BY heap_blks_read + idx_blks_read DESC
LIMIT 10;
Storage Metrics
- Database size
- Table sizes
- Index sizes
- Growth rate
- Free space
- Disk space available
Monitoring Example (PostgreSQL):
-- Database sizes
SELECT
datname,
pg_size_pretty(pg_database_size(datname)) as size
FROM pg_database
ORDER BY pg_database_size(datname) DESC;
-- Table and index sizes
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename)) AS index_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;
-- Growth tracking (requires periodic sampling)
CREATE TABLE IF NOT EXISTS db_size_history (
measured_at TIMESTAMP DEFAULT NOW(),
database_name TEXT,
size_bytes BIGINT
);
INSERT INTO db_size_history (database_name, size_bytes)
SELECT datname, pg_database_size(datname) FROM pg_database;
Monitoring Example (MySQL):
-- Database sizes
SELECT
table_schema AS database_name,
ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS size_mb
FROM information_schema.TABLES
GROUP BY table_schema
ORDER BY size_mb DESC;
-- Table sizes
SELECT
table_name,
ROUND(((data_length + index_length) / 1024 / 1024), 2) AS size_mb,
ROUND((data_length / 1024 / 1024), 2) AS data_mb,
ROUND((index_length / 1024 / 1024), 2) AS index_mb,
table_rows
FROM information_schema.TABLES
WHERE table_schema = 'mydb'
ORDER BY (data_length + index_length) DESC
LIMIT 10;
4. Lock and Concurrency Metrics
Lock Statistics
- Lock wait time
- Lock wait count
- Deadlock count
- Blocking queries
- Lock queue depth
Monitoring Example (PostgreSQL):
-- Current locks and blocking queries
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS blocking_statement,
blocked_activity.application_name AS blocked_application,
blocking_activity.application_name AS blocking_application
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
-- Deadlock count (check logs)
SHOW deadlock_timeout;
Monitoring Example (MySQL):
-- Current locks
SELECT * FROM performance_schema.data_locks;
-- Lock waits
SELECT * FROM performance_schema.data_lock_waits;
-- InnoDB status (includes deadlock information)
SHOW ENGINE INNODB STATUS;
-- Metadata locks
SELECT
OBJECT_SCHEMA,
OBJECT_NAME,
LOCK_TYPE,
LOCK_DURATION,
LOCK_STATUS,
OWNER_THREAD_ID,
OWNER_EVENT_ID
FROM performance_schema.metadata_locks
WHERE OBJECT_SCHEMA NOT IN ('mysql', 'performance_schema', 'information_schema');
5. Replication Metrics
Replication Health
- Replication lag
- Replication errors
- Replica status (running/stopped)
- Last replicated transaction
- Replication throughput
Monitoring Example (PostgreSQL Streaming Replication):
-- On primary: check replication status
SELECT
client_addr,
state,
sync_state,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) as send_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) as write_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) as flush_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as replay_lag
FROM pg_stat_replication;
-- On replica: check replication delay
SELECT
now() - pg_last_xact_replay_timestamp() AS replication_delay,
pg_is_in_recovery() as is_replica;
Monitoring Example (MySQL Replication):
-- Check replica status
SHOW REPLICA STATUS\G
-- Key metrics to monitor:
-- Replica_IO_Running: Yes/No
-- Replica_SQL_Running: Yes/No
-- Seconds_Behind_Master: Replication lag
-- Last_IO_Error: Replication errors
-- Last_SQL_Error: SQL errors
-- Or using performance schema (MySQL 8.0+)
SELECT
CHANNEL_NAME,
SERVICE_STATE,
LAST_ERROR_MESSAGE,
LAST_ERROR_TIMESTAMP
FROM performance_schema.replication_connection_status;
SELECT
CHANNEL_NAME,
COUNT_TRANSACTIONS_IN_QUEUE,
COUNT_TRANSACTIONS_RETRIES,
LAST_APPLIED_TRANSACTION,
LAST_QUEUED_TRANSACTION
FROM performance_schema.replication_applier_status_by_worker;
6. Error and Event Metrics
Error Tracking
- Connection errors
- Query errors
- Replication errors
- Checkpoint failures
- Corruption errors
Monitoring Example:
-- PostgreSQL: Monitor logs
-- Configure in postgresql.conf:
log_min_messages = error
log_min_error_statement = error
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
-- MySQL: Monitor error log
-- Configure in my.cnf:
log_error = /var/log/mysql/error.log
log_error_verbosity = 2
7. Maintenance Metrics
Vacuum/Maintenance Statistics (PostgreSQL)
-- Check when tables were last vacuumed/analyzed
SELECT
schemaname,
tablename,
last_vacuum,
last_autovacuum,
last_analyze,
last_autoanalyze,
n_dead_tup,
n_live_tup,
ROUND(n_dead_tup * 100.0 / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_tuple_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
-- Monitor autovacuum activity
SELECT
pid,
now() - xact_start as duration,
query
FROM pg_stat_activity
WHERE query LIKE 'autovacuum:%';
Table Optimization (MySQL)
-- Check table fragmentation
SELECT
table_name,
ROUND(data_length / 1024 / 1024, 2) AS data_mb,
ROUND(data_free / 1024 / 1024, 2) AS free_mb,
ROUND((data_free / (data_length + index_length + data_free)) * 100, 2) AS fragmentation_pct
FROM information_schema.TABLES
WHERE table_schema = 'mydb'
AND data_free > 0
ORDER BY fragmentation_pct DESC;
Monitoring Strategies
1. Real-Time Monitoring
Purpose: Immediate detection of critical issues
Implementation:
- Monitor availability every 10-30 seconds
- Track active connections continuously
- Alert on threshold breaches (CPU > 80%, disk > 90%)
- Monitor long-running queries (> 30 seconds)
Tools:
- Prometheus + Grafana
- Datadog
- New Relic
- Pulse (specialized for databases)
2. Historical Analysis
Purpose: Identify trends and patterns
Implementation:
- Store metrics with appropriate retention (e.g., 1-year)
- Sample high-frequency metrics (1-minute intervals)
- Aggregate data for long-term storage
- Create baseline performance profiles
Example Storage Strategy:
- Raw data (1-min intervals): 7 days
- Hourly aggregates: 90 days
- Daily aggregates: 1 year
- Monthly aggregates: 3 years
3. Alerting Strategy
Alert Levels:
Critical (Immediate Response):
- Database down/unreachable
- Replication broken
- Out of disk space (< 5%)
- Connection pool exhausted
- Data corruption detected
Warning (Within Hours):
- High CPU usage (> 80% for 10 min)
- Low disk space (< 20%)
- Replication lag > 30 seconds
- High number of slow queries
- Buffer cache hit ratio < 95%
Info (Monitor):
- Table bloat > 30%
- Unused indexes
- Query plan changes
- Moderate resource usage
Alert Best Practices:
# Example alert configuration (Prometheus)
groups:
- name: database_alerts
rules:
- alert: DatabaseDown
expr: up{job="postgresql"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database is down"
description: "PostgreSQL instance has been down for more than 1 minute"
- alert: HighConnectionUsage
expr: (pg_stat_database_numbackends / pg_settings_max_connections) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High connection usage on "
description: "Connection usage is %"
- alert: ReplicationLag
expr: pg_replication_lag_seconds > 30
for: 2m
labels:
severity: warning
annotations:
summary: "Replication lag on "
description: "Replication lag is seconds"
- alert: SlowQueries
expr: rate(pg_stat_statements_mean_exec_time[5m]) > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "Slow queries detected on "
description: "Average query time is ms"
4. Capacity Planning
Track Growth Metrics:
-- PostgreSQL: Database growth rate
WITH size_samples AS (
SELECT
measured_at,
database_name,
size_bytes,
LAG(size_bytes) OVER (PARTITION BY database_name ORDER BY measured_at) as prev_size,
LAG(measured_at) OVER (PARTITION BY database_name ORDER BY measured_at) as prev_time
FROM db_size_history
)
SELECT
database_name,
pg_size_pretty(size_bytes) as current_size,
pg_size_pretty((size_bytes - prev_size)) as growth,
EXTRACT(EPOCH FROM (measured_at - prev_time)) / 3600 as hours_elapsed,
pg_size_pretty(
(size_bytes - prev_size) * 24 * 30 /
EXTRACT(EPOCH FROM (measured_at - prev_time)) * 3600
) as estimated_monthly_growth
FROM size_samples
WHERE prev_size IS NOT NULL
ORDER BY database_name, measured_at DESC;
Forecast Future Needs:
- Disk space needed in 3/6/12 months
- Connection pool sizing
- Memory requirements
- CPU requirements
- When to scale up or out
Monitoring Tools Comparison
Open Source Tools
Prometheus + Grafana
- Pros: Flexible, powerful, large ecosystem
- Cons: Requires setup and maintenance
- Best for: Teams with DevOps resources
Postgres Exporter / MySQL Exporter
- Integrates with Prometheus
- Exports database metrics
- Customizable queries
pg_stat_statements (PostgreSQL)
- Built-in query statistics
- Low overhead
- Essential for query optimization
Performance Schema (MySQL)
- Built-in performance monitoring
- Detailed instrumentation
- Can impact performance if over-used
Percona Monitoring and Management (PMM)
- Pros: Free, database-focused, query analytics
- Cons: Resource intensive
- Best for: MySQL, PostgreSQL, MongoDB
Commercial Tools
Datadog
- Pros: Easy setup, comprehensive, great UI
- Cons: Can be expensive
- Best for: Organizations wanting full-stack observability
New Relic
- Pros: APM integration, AI insights
- Cons: Pricing model
- Best for: Application performance correlation
Pulse
- Pros: Specialized for databases, query insights, anomaly detection
- Cons: Database-focused only
- Best for: Teams focused on database performance
SolarWinds Database Performance Analyzer
- Pros: Deep database insights, wait analysis
- Cons: Complex setup
- Best for: Enterprise environments
Best Practices Checklist
Setup and Configuration
- Enable query statistics collection (pg_stat_statements, performance_schema)
- Configure slow query logging with appropriate thresholds
- Set up connection monitoring
- Enable checkpoint/vacuum logging
- Configure log rotation and retention
- Set up automated backups with monitoring
Metrics Collection
- Monitor all key metrics (availability, performance, resources)
- Use appropriate sampling intervals (balance granularity vs storage)
- Set up historical data retention policy
- Create dashboards for different audiences (DBAs, developers, ops)
- Document metric baselines and SLAs
Alerting
- Define clear alert thresholds based on baselines
- Implement alert fatigue prevention (grouping, throttling)
- Set up escalation policies
- Configure multiple notification channels
- Test alert delivery regularly
- Document alert response procedures
Security and Compliance
- Monitor failed login attempts
- Track privilege escalations
- Log schema changes
- Monitor data access patterns
- Audit sensitive data queries
- Comply with data retention policies
Performance Optimization
- Review slow queries weekly
- Monitor index usage and effectiveness
- Track query plan changes
- Analyze cache hit ratios
- Monitor table bloat and fragmentation
- Review and optimize configuration regularly
Capacity Planning
- Track growth trends (storage, connections, queries)
- Forecast resource needs quarterly
- Monitor resource utilization trends
- Plan for peak load capacity
- Test scaling strategies
Team and Process
- Assign monitoring responsibilities
- Create runbooks for common issues
- Conduct regular performance reviews
- Share metrics with stakeholders
- Continuous improvement based on incidents
- Train team on monitoring tools
Common Monitoring Pitfalls
1. Monitoring Too Much
Problem: Collecting every possible metric Impact: Storage costs, analysis paralysis, alert fatigue Solution: Focus on metrics that drive decisions
2. Not Monitoring Enough
Problem: Missing critical metrics Impact: Undetected issues, poor performance Solution: Cover all categories: availability, performance, resources, errors
3. Poor Alert Configuration
Problem: Too many false positives or missing real issues Impact: Alert fatigue or missed incidents Solution: Tune thresholds based on baselines, use composite alerts
4. Ignoring Trends
Problem: Only reactive monitoring Impact: Preventable outages, capacity issues Solution: Regular trend analysis, capacity planning
5. No Action on Data
Problem: Collecting data but not using it Impact: Wasted resources, no improvement Solution: Regular review meetings, automated actions where possible
6. Lack of Context
Problem: Metrics without application context Impact: Difficulty diagnosing issues Solution: Correlate database metrics with application metrics
Related Resources
- PostgreSQL vs MySQL
- PostgreSQL Performance Tuning
- MySQL Performance Optimization
- Common PostgreSQL Errors
Implement database monitoring best practices effortlessly with Pulse. Get automated insights, intelligent alerting, and comprehensive visibility into your database performance.