Database Monitoring Best Practices: Complete Guide (2025)

Database Monitoring Best Practices: Complete Guide (2025)

Effective database monitoring is critical for maintaining application performance, preventing outages, and ensuring data reliability. This comprehensive guide covers monitoring strategies, essential metrics, and best practices for modern database systems.

Why Database Monitoring Matters

Database monitoring helps you:

  • Prevent outages by catching issues before they become critical
  • Optimize performance through data-driven insights
  • Reduce costs by identifying inefficiencies
  • Improve user experience with faster query response times
  • Plan capacity based on usage trends
  • Meet SLAs with proactive alerting
  • Debug issues faster with historical data
  • Ensure security by detecting anomalous behavior

Key Metrics to Monitor

1. Availability Metrics

Uptime and Downtime

  • Database availability percentage
  • Time to detect failures
  • Time to recovery (MTTR)
  • Planned vs unplanned downtime

Connection Health

  • Successful vs failed connection attempts
  • Connection pool utilization
  • Connection errors and timeouts
  • Active connections vs maximum allowed

Monitoring Example (PostgreSQL):

-- Check if database is accepting connections
SELECT pg_is_in_recovery(),
       now() - pg_postmaster_start_time() as uptime;

-- Monitor connection count
SELECT count(*) as current_connections,
       (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') as max_connections,
       round(100.0 * count(*) / (SELECT setting::int FROM pg_settings WHERE name = 'max_connections'), 2) as pct_used
FROM pg_stat_activity;

Monitoring Example (MySQL):

-- Check uptime
SHOW GLOBAL STATUS LIKE 'Uptime';

-- Monitor connections
SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Max_used_connections';
SHOW VARIABLES LIKE 'max_connections';

2. Performance Metrics

Query Performance

  • Average query execution time
  • Slow query count
  • Query throughput (queries per second)
  • Query errors and failures
  • Long-running queries

Monitoring Example (PostgreSQL):

-- Enable pg_stat_statements
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Top 10 slowest queries by average time
SELECT
    query,
    calls,
    total_exec_time / 1000 as total_sec,
    mean_exec_time / 1000 as avg_sec,
    max_exec_time / 1000 as max_sec
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Currently running queries
SELECT
    pid,
    now() - query_start as duration,
    usename,
    state,
    query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;

Monitoring Example (MySQL):

-- Enable performance schema
UPDATE performance_schema.setup_instruments
SET ENABLED = 'YES', TIMED = 'YES';

-- Top queries by execution time
SELECT
    DIGEST_TEXT as query,
    COUNT_STAR as exec_count,
    AVG_TIMER_WAIT/1000000000 as avg_ms,
    SUM_TIMER_WAIT/1000000000 as total_ms
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;

-- Current processes
SHOW FULL PROCESSLIST;

Throughput Metrics

  • Reads per second
  • Writes per second
  • Transactions per second
  • Commits and rollbacks

Monitoring Example (PostgreSQL):

-- Transaction statistics
SELECT
    datname,
    xact_commit,
    xact_rollback,
    xact_commit + xact_rollback as total_xacts,
    round(100.0 * xact_rollback / NULLIF(xact_commit + xact_rollback, 0), 2) as rollback_pct
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1');

-- Row operations
SELECT
    schemaname,
    tablename,
    seq_scan,
    seq_tup_read,
    idx_scan,
    idx_tup_fetch,
    n_tup_ins,
    n_tup_upd,
    n_tup_del
FROM pg_stat_user_tables
ORDER BY seq_tup_read + idx_tup_fetch DESC
LIMIT 10;

3. Resource Utilization Metrics

CPU Usage

  • Database process CPU utilization
  • Per-query CPU consumption
  • CPU wait time
  • Context switches

Memory Usage

  • Buffer pool/cache hit ratio
  • Memory allocated vs used
  • Swap usage
  • Memory available
  • Cache efficiency

Monitoring Example (PostgreSQL):

-- Buffer cache hit ratio (target > 99%)
SELECT
    sum(heap_blks_read) as heap_read,
    sum(heap_blks_hit) as heap_hit,
    round(sum(heap_blks_hit) / NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0) * 100, 2) as cache_hit_ratio
FROM pg_statio_user_tables;

-- Index cache hit ratio
SELECT
    sum(idx_blks_read) as idx_read,
    sum(idx_blks_hit) as idx_hit,
    round(sum(idx_blks_hit) / NULLIF(sum(idx_blks_hit) + sum(idx_blks_read), 0) * 100, 2) as idx_cache_hit_ratio
FROM pg_statio_user_indexes;

Monitoring Example (MySQL):

-- InnoDB buffer pool hit ratio (target > 99%)
SHOW STATUS LIKE 'Innodb_buffer_pool_read_requests';
SHOW STATUS LIKE 'Innodb_buffer_pool_reads';

-- Calculate hit ratio
SELECT
    (1 - (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads') /
         (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests')) * 100
    AS buffer_pool_hit_ratio;

Disk I/O

  • Read IOPS
  • Write IOPS
  • Read throughput (MB/s)
  • Write throughput (MB/s)
  • I/O latency
  • Queue depth

Monitoring Example (PostgreSQL):

-- Table I/O statistics
SELECT
    schemaname,
    tablename,
    heap_blks_read,
    heap_blks_hit,
    idx_blks_read,
    idx_blks_hit,
    toast_blks_read,
    toast_blks_hit
FROM pg_statio_user_tables
WHERE heap_blks_read + idx_blks_read > 0
ORDER BY heap_blks_read + idx_blks_read DESC
LIMIT 10;

Storage Metrics

  • Database size
  • Table sizes
  • Index sizes
  • Growth rate
  • Free space
  • Disk space available

Monitoring Example (PostgreSQL):

-- Database sizes
SELECT
    datname,
    pg_size_pretty(pg_database_size(datname)) as size
FROM pg_database
ORDER BY pg_database_size(datname) DESC;

-- Table and index sizes
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
    pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename)) AS index_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;

-- Growth tracking (requires periodic sampling)
CREATE TABLE IF NOT EXISTS db_size_history (
    measured_at TIMESTAMP DEFAULT NOW(),
    database_name TEXT,
    size_bytes BIGINT
);

INSERT INTO db_size_history (database_name, size_bytes)
SELECT datname, pg_database_size(datname) FROM pg_database;

Monitoring Example (MySQL):

-- Database sizes
SELECT
    table_schema AS database_name,
    ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS size_mb
FROM information_schema.TABLES
GROUP BY table_schema
ORDER BY size_mb DESC;

-- Table sizes
SELECT
    table_name,
    ROUND(((data_length + index_length) / 1024 / 1024), 2) AS size_mb,
    ROUND((data_length / 1024 / 1024), 2) AS data_mb,
    ROUND((index_length / 1024 / 1024), 2) AS index_mb,
    table_rows
FROM information_schema.TABLES
WHERE table_schema = 'mydb'
ORDER BY (data_length + index_length) DESC
LIMIT 10;

4. Lock and Concurrency Metrics

Lock Statistics

  • Lock wait time
  • Lock wait count
  • Deadlock count
  • Blocking queries
  • Lock queue depth

Monitoring Example (PostgreSQL):

-- Current locks and blocking queries
SELECT
    blocked_locks.pid AS blocked_pid,
    blocked_activity.usename AS blocked_user,
    blocking_locks.pid AS blocking_pid,
    blocking_activity.usename AS blocking_user,
    blocked_activity.query AS blocked_statement,
    blocking_activity.query AS blocking_statement,
    blocked_activity.application_name AS blocked_application,
    blocking_activity.application_name AS blocking_application
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
    ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
    AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
    AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
    AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
    AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
    AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
    AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
    AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
    AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
    AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

-- Deadlock count (check logs)
SHOW deadlock_timeout;

Monitoring Example (MySQL):

-- Current locks
SELECT * FROM performance_schema.data_locks;

-- Lock waits
SELECT * FROM performance_schema.data_lock_waits;

-- InnoDB status (includes deadlock information)
SHOW ENGINE INNODB STATUS;

-- Metadata locks
SELECT
    OBJECT_SCHEMA,
    OBJECT_NAME,
    LOCK_TYPE,
    LOCK_DURATION,
    LOCK_STATUS,
    OWNER_THREAD_ID,
    OWNER_EVENT_ID
FROM performance_schema.metadata_locks
WHERE OBJECT_SCHEMA NOT IN ('mysql', 'performance_schema', 'information_schema');

5. Replication Metrics

Replication Health

  • Replication lag
  • Replication errors
  • Replica status (running/stopped)
  • Last replicated transaction
  • Replication throughput

Monitoring Example (PostgreSQL Streaming Replication):

-- On primary: check replication status
SELECT
    client_addr,
    state,
    sync_state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) as send_lag,
    pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) as write_lag,
    pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) as flush_lag,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as replay_lag
FROM pg_stat_replication;

-- On replica: check replication delay
SELECT
    now() - pg_last_xact_replay_timestamp() AS replication_delay,
    pg_is_in_recovery() as is_replica;

Monitoring Example (MySQL Replication):

-- Check replica status
SHOW REPLICA STATUS\G

-- Key metrics to monitor:
-- Replica_IO_Running: Yes/No
-- Replica_SQL_Running: Yes/No
-- Seconds_Behind_Master: Replication lag
-- Last_IO_Error: Replication errors
-- Last_SQL_Error: SQL errors

-- Or using performance schema (MySQL 8.0+)
SELECT
    CHANNEL_NAME,
    SERVICE_STATE,
    LAST_ERROR_MESSAGE,
    LAST_ERROR_TIMESTAMP
FROM performance_schema.replication_connection_status;

SELECT
    CHANNEL_NAME,
    COUNT_TRANSACTIONS_IN_QUEUE,
    COUNT_TRANSACTIONS_RETRIES,
    LAST_APPLIED_TRANSACTION,
    LAST_QUEUED_TRANSACTION
FROM performance_schema.replication_applier_status_by_worker;

6. Error and Event Metrics

Error Tracking

  • Connection errors
  • Query errors
  • Replication errors
  • Checkpoint failures
  • Corruption errors

Monitoring Example:

-- PostgreSQL: Monitor logs
-- Configure in postgresql.conf:
log_min_messages = error
log_min_error_statement = error
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '

-- MySQL: Monitor error log
-- Configure in my.cnf:
log_error = /var/log/mysql/error.log
log_error_verbosity = 2

7. Maintenance Metrics

Vacuum/Maintenance Statistics (PostgreSQL)

-- Check when tables were last vacuumed/analyzed
SELECT
    schemaname,
    tablename,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze,
    n_dead_tup,
    n_live_tup,
    ROUND(n_dead_tup * 100.0 / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_tuple_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

-- Monitor autovacuum activity
SELECT
    pid,
    now() - xact_start as duration,
    query
FROM pg_stat_activity
WHERE query LIKE 'autovacuum:%';

Table Optimization (MySQL)

-- Check table fragmentation
SELECT
    table_name,
    ROUND(data_length / 1024 / 1024, 2) AS data_mb,
    ROUND(data_free / 1024 / 1024, 2) AS free_mb,
    ROUND((data_free / (data_length + index_length + data_free)) * 100, 2) AS fragmentation_pct
FROM information_schema.TABLES
WHERE table_schema = 'mydb'
    AND data_free > 0
ORDER BY fragmentation_pct DESC;

Monitoring Strategies

1. Real-Time Monitoring

Purpose: Immediate detection of critical issues

Implementation:

  • Monitor availability every 10-30 seconds
  • Track active connections continuously
  • Alert on threshold breaches (CPU > 80%, disk > 90%)
  • Monitor long-running queries (> 30 seconds)

Tools:

  • Prometheus + Grafana
  • Datadog
  • New Relic
  • Pulse (specialized for databases)

2. Historical Analysis

Purpose: Identify trends and patterns

Implementation:

  • Store metrics with appropriate retention (e.g., 1-year)
  • Sample high-frequency metrics (1-minute intervals)
  • Aggregate data for long-term storage
  • Create baseline performance profiles

Example Storage Strategy:

  • Raw data (1-min intervals): 7 days
  • Hourly aggregates: 90 days
  • Daily aggregates: 1 year
  • Monthly aggregates: 3 years

3. Alerting Strategy

Alert Levels:

Critical (Immediate Response):

  • Database down/unreachable
  • Replication broken
  • Out of disk space (< 5%)
  • Connection pool exhausted
  • Data corruption detected

Warning (Within Hours):

  • High CPU usage (> 80% for 10 min)
  • Low disk space (< 20%)
  • Replication lag > 30 seconds
  • High number of slow queries
  • Buffer cache hit ratio < 95%

Info (Monitor):

  • Table bloat > 30%
  • Unused indexes
  • Query plan changes
  • Moderate resource usage

Alert Best Practices:

# Example alert configuration (Prometheus)
groups:
  - name: database_alerts
    rules:
      - alert: DatabaseDown
        expr: up{job="postgresql"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Database  is down"
          description: "PostgreSQL instance has been down for more than 1 minute"

      - alert: HighConnectionUsage
        expr: (pg_stat_database_numbackends / pg_settings_max_connections) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High connection usage on "
          description: "Connection usage is %"

      - alert: ReplicationLag
        expr: pg_replication_lag_seconds > 30
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Replication lag on "
          description: "Replication lag is  seconds"

      - alert: SlowQueries
        expr: rate(pg_stat_statements_mean_exec_time[5m]) > 1000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Slow queries detected on "
          description: "Average query time is ms"

4. Capacity Planning

Track Growth Metrics:

-- PostgreSQL: Database growth rate
WITH size_samples AS (
    SELECT
        measured_at,
        database_name,
        size_bytes,
        LAG(size_bytes) OVER (PARTITION BY database_name ORDER BY measured_at) as prev_size,
        LAG(measured_at) OVER (PARTITION BY database_name ORDER BY measured_at) as prev_time
    FROM db_size_history
)
SELECT
    database_name,
    pg_size_pretty(size_bytes) as current_size,
    pg_size_pretty((size_bytes - prev_size)) as growth,
    EXTRACT(EPOCH FROM (measured_at - prev_time)) / 3600 as hours_elapsed,
    pg_size_pretty(
        (size_bytes - prev_size) * 24 * 30 /
        EXTRACT(EPOCH FROM (measured_at - prev_time)) * 3600
    ) as estimated_monthly_growth
FROM size_samples
WHERE prev_size IS NOT NULL
ORDER BY database_name, measured_at DESC;

Forecast Future Needs:

  • Disk space needed in 3/6/12 months
  • Connection pool sizing
  • Memory requirements
  • CPU requirements
  • When to scale up or out

Monitoring Tools Comparison

Open Source Tools

Prometheus + Grafana

  • Pros: Flexible, powerful, large ecosystem
  • Cons: Requires setup and maintenance
  • Best for: Teams with DevOps resources

Postgres Exporter / MySQL Exporter

  • Integrates with Prometheus
  • Exports database metrics
  • Customizable queries

pg_stat_statements (PostgreSQL)

  • Built-in query statistics
  • Low overhead
  • Essential for query optimization

Performance Schema (MySQL)

  • Built-in performance monitoring
  • Detailed instrumentation
  • Can impact performance if over-used

Percona Monitoring and Management (PMM)

  • Pros: Free, database-focused, query analytics
  • Cons: Resource intensive
  • Best for: MySQL, PostgreSQL, MongoDB

Commercial Tools

Datadog

  • Pros: Easy setup, comprehensive, great UI
  • Cons: Can be expensive
  • Best for: Organizations wanting full-stack observability

New Relic

  • Pros: APM integration, AI insights
  • Cons: Pricing model
  • Best for: Application performance correlation

Pulse

  • Pros: Specialized for databases, query insights, anomaly detection
  • Cons: Database-focused only
  • Best for: Teams focused on database performance

SolarWinds Database Performance Analyzer

  • Pros: Deep database insights, wait analysis
  • Cons: Complex setup
  • Best for: Enterprise environments

Best Practices Checklist

Setup and Configuration

  • Enable query statistics collection (pg_stat_statements, performance_schema)
  • Configure slow query logging with appropriate thresholds
  • Set up connection monitoring
  • Enable checkpoint/vacuum logging
  • Configure log rotation and retention
  • Set up automated backups with monitoring

Metrics Collection

  • Monitor all key metrics (availability, performance, resources)
  • Use appropriate sampling intervals (balance granularity vs storage)
  • Set up historical data retention policy
  • Create dashboards for different audiences (DBAs, developers, ops)
  • Document metric baselines and SLAs

Alerting

  • Define clear alert thresholds based on baselines
  • Implement alert fatigue prevention (grouping, throttling)
  • Set up escalation policies
  • Configure multiple notification channels
  • Test alert delivery regularly
  • Document alert response procedures

Security and Compliance

  • Monitor failed login attempts
  • Track privilege escalations
  • Log schema changes
  • Monitor data access patterns
  • Audit sensitive data queries
  • Comply with data retention policies

Performance Optimization

  • Review slow queries weekly
  • Monitor index usage and effectiveness
  • Track query plan changes
  • Analyze cache hit ratios
  • Monitor table bloat and fragmentation
  • Review and optimize configuration regularly

Capacity Planning

  • Track growth trends (storage, connections, queries)
  • Forecast resource needs quarterly
  • Monitor resource utilization trends
  • Plan for peak load capacity
  • Test scaling strategies

Team and Process

  • Assign monitoring responsibilities
  • Create runbooks for common issues
  • Conduct regular performance reviews
  • Share metrics with stakeholders
  • Continuous improvement based on incidents
  • Train team on monitoring tools

Common Monitoring Pitfalls

1. Monitoring Too Much

Problem: Collecting every possible metric Impact: Storage costs, analysis paralysis, alert fatigue Solution: Focus on metrics that drive decisions

2. Not Monitoring Enough

Problem: Missing critical metrics Impact: Undetected issues, poor performance Solution: Cover all categories: availability, performance, resources, errors

3. Poor Alert Configuration

Problem: Too many false positives or missing real issues Impact: Alert fatigue or missed incidents Solution: Tune thresholds based on baselines, use composite alerts

4. Ignoring Trends

Problem: Only reactive monitoring Impact: Preventable outages, capacity issues Solution: Regular trend analysis, capacity planning

5. No Action on Data

Problem: Collecting data but not using it Impact: Wasted resources, no improvement Solution: Regular review meetings, automated actions where possible

6. Lack of Context

Problem: Metrics without application context Impact: Difficulty diagnosing issues Solution: Correlate database metrics with application metrics


Implement database monitoring best practices effortlessly with Pulse. Get automated insights, intelligent alerting, and comprehensive visibility into your database performance.

Pulse - Elasticsearch Operations Done Right

Pulse can solve your Elasticsearch issues

Subscribe to the Pulse Newsletter

Get early access to new Pulse features, insightful blogs & exclusive events , webinars, and workshops.

We use cookies to provide an optimized user experience and understand our traffic. To learn more, read our use of cookies; otherwise, please choose 'Accept Cookies' to continue using our website.