The Elasticsearch hot threads API provides insights into which threads are consuming the most CPU time. This is essential for diagnosing performance issues, identifying runaway queries, and understanding cluster resource usage.
Understanding Hot Threads
What Are Hot Threads?
Hot threads are the threads currently consuming the most CPU cycles. Analyzing them helps identify:
- Expensive queries
- Indexing bottlenecks
- Garbage collection issues
- Segment merge operations
- Network or transport problems
Using the Hot Threads API
Basic Request
GET /_nodes/hot_threads
With Parameters
GET /_nodes/hot_threads?threads=10&interval=500ms&type=cpu
Parameters Explained
| Parameter | Default | Description |
|---|---|---|
threads |
3 | Number of hot threads to return |
interval |
500ms | Sampling interval |
type |
cpu |
Type: cpu, wait, or block |
snapshots |
10 | Number of samples to take |
ignore_idle_threads |
true | Skip idle threads |
Per-Node Request
Analyze specific nodes:
GET /_nodes/node_name/hot_threads?threads=10
Interpreting Hot Threads Output
Sample Output
::: {node_name}{node_id}{host}{ip}
Hot threads at 2024-01-15T10:30:00.000Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
33.3% (166.4ms out of 500ms) cpu usage by thread 'elasticsearch[node_name][search][T#1]'
5/10 snapshots sharing following 15 elements
java.base@11.0.11/java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3963)
java.base@11.0.11/java.util.regex.Pattern$Branch.match(Pattern.java:4766)
org.apache.lucene.util.automaton.RegExp.parseUnionExp(RegExp.java:509)
...
22.1% (110.5ms out of 500ms) cpu usage by thread 'elasticsearch[node_name][write][T#2]'
7/10 snapshots sharing following 12 elements
org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.flush(...)
...
Understanding the Output
- Thread name: Shows the thread pool (
search,write,generic, etc.) - CPU percentage: Portion of interval spent executing
- Stack trace: What the thread is doing
- Snapshots: How many samples showed this stack
Common Hot Thread Patterns
Pattern 1: Search Threads Busy
cpu usage by thread 'elasticsearch[...][search][T#...]'
org.apache.lucene.search.*
Meaning: Active search queries consuming CPU
Investigation:
- Check slow query logs
- Review active tasks:
GET /_tasks?actions=*search* - Look for expensive query patterns (regex, wildcards)
Solutions:
- Optimize queries
- Add query timeouts
- Scale search capacity
Pattern 2: Regex or Wildcard Queries
cpu usage by thread 'elasticsearch[...][search][T#...]'
java.util.regex.Pattern*
org.apache.lucene.util.automaton.RegExp*
Meaning: Expensive regex or wildcard query running
Solutions:
- Avoid leading wildcards
- Use simpler patterns
- Consider ngram indexing
Pattern 3: Index Writer / Merging
cpu usage by thread 'elasticsearch[...][generic][T#...]'
org.apache.lucene.index.IndexWriter*
org.apache.lucene.codecs*
Meaning: Segment merging or flushing
Investigation:
GET /_cat/segments?v
GET /_nodes/stats/indices/merges
Solutions:
- Increase refresh interval
- Tune merge policy
- Schedule force merge during off-peak
Pattern 4: Garbage Collection
cpu usage by thread 'GC Thread#...'
<no stack trace or JVM native code>
Meaning: JVM garbage collection consuming CPU
Investigation:
GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.gc
Solutions:
- Reduce heap pressure
- Tune GC settings
- Add more nodes
Pattern 5: Transport/Network Operations
cpu usage by thread 'elasticsearch[...][transport_worker][T#...]'
org.elasticsearch.transport*
io.netty*
Meaning: Network communication overhead
Investigation:
- Check network latency between nodes
- Look for oversized requests
Solutions:
- Reduce bulk request sizes
- Ensure low-latency network
- Use dedicated transport network
Pattern 6: Aggregations
cpu usage by thread 'elasticsearch[...][search][T#...]'
org.elasticsearch.search.aggregations*
org.apache.lucene.search.FieldComparator*
Meaning: Aggregation computation
Solutions:
- Reduce aggregation bucket sizes
- Use
compositefor high-cardinality - Add doc_values for aggregated fields
Hot Threads Analysis Workflow
Step 1: Capture Hot Threads During Issue
# Capture every 30 seconds during issue
for i in {1..10}; do
curl -s "localhost:9200/_nodes/hot_threads?threads=10" >> hot_threads_$(date +%s).txt
sleep 30
done
Step 2: Identify Pattern
Look for:
- Same thread pool consistently busy
- Same stack traces repeating
- Correlation with specific operations
Step 3: Correlate with Other Metrics
GET /_nodes/stats/thread_pool
GET /_tasks?detailed=true
GET /_cat/nodes?v&h=name,cpu,load_1m
Step 4: Take Action
Based on the pattern:
- Search threads: Optimize queries, add capacity
- Write threads: Tune indexing, increase workers
- Generic threads: Check merge policy, reduce segments
- GC threads: Address memory pressure
Comparing Thread Types
CPU Type (Default)
GET /_nodes/hot_threads?type=cpu
Shows threads actively using CPU.
Wait Type
GET /_nodes/hot_threads?type=wait
Shows threads waiting (e.g., for I/O, locks).
Block Type
GET /_nodes/hot_threads?type=block
Shows blocked threads (waiting for monitors/locks).
Best Practices
Regular Monitoring
- Capture hot threads during normal operation for baseline
- Set up automated capture during performance alerts
- Keep historical data for comparison
Analysis Tips
- Compare hot threads across nodes
- Look at consistency of patterns
- Correlate with application events
- Focus on highest percentage threads first
Documentation
Keep notes on observed patterns:
- What workload causes this pattern?
- What was the resolution?
- Are there recurring patterns?