The Custom Chart debug page in the DB Console lets you create one or multiple custom charts showing any combination of available metrics.
The definition of the customized dashboard is encoded in the URL. To share the dashboard with someone, send them the URL. Like any other URL, it can be bookmarked, sit in a pinned tab in your browser, etc.
To view the Custom Chart page, access the DB Console, click Advanced Debug In the left-hand navigation bar, and in the Reports section select Custom Time Series Chart.
Use the Custom Chart page
On the Custom Chart page, you can set the time span for all charts, add new custom charts, and customize each chart:
To set the time span for the page, use the dropdown menu above the charts and select the desired time span. In addition, once you have selected a metric to display, you can drag within the chart itself to a set a custom time range.
To add a chart, click Add Chart and customize the new chart.
To customize each chart, use the Units dropdown menu to set the units to display. Then use the table below the chart to select the metrics being queried, and how they'll be combined and displayed. Options include:
Column Description Metric Name How the system refers to this metric, e.g., sql.bytesin
.Downsampler The "Downsampler" operation is used to combine the individual datapoints over the longer period into a single datapoint. We store one data point every ten seconds, but for queries over long time spans the backend lowers the resolution of the returned data, perhaps only returning one data point for every minute, five minutes, or even an entire hour in the case of the 30 day view.
Options:
- AVG: Returns the average value over the time period.
- MIN: Returns the lowest value seen.
- MAX: Returns the highest value seen.
- SUM: Returns the sum of all values seen.
Aggregator Used to combine data points from different nodes. It has the same operations available as the Downsampler.
Options:
- AVG: Returns the average value over the time period.
- MIN: Returns the lowest value seen.
- MAX: Returns the highest value seen.
- SUM: Returns the sum of all values seen.
Rate Determines how to display the rate of change during the selected time period.
Options:
- Normal: Returns the actual recorded value.
- Rate: Returns the rate of change of the value per second.
- Non-negative Rate: Returns the rate-of-change, but returns 0 instead of negative values. A large number of the stats we track are actually tracked as monotonically increasing counters so each sample is just the total value of that counter. The rate of change of that counter represents the rate of events being counted, which is usually what you want to graph. "Non-negative Rate" is needed because the counters are stored in memory, and thus if a node resets it goes back to zero (whereas normally they only increase).
Source The set of nodes being queried, which is either: - The entire cluster.
- A single, named node.
Per Node If checked, the chart will show a line for each node's value of this metric.
Examples
Query user and system CPU usage
To compare system vs. userspace CPU usage, select the following values under Metric Name:
sys.cpu.sys.percent
sys.cpu.user.percent
The Y-axis label is the Count. A count of 1 represents 100% utilization. The Aggregator of Sum can show the count to be above 1, which would mean CPU utilization is greater than 100%.
Checking Per Node displays statistics for each node, which could show whether an individual node's CPU usage was higher or lower than the average.
Available metrics
Select your CockroachDB deployment to see the metrics available:
This list is taken directly from the source code and is subject to change. Some of the metrics listed below are already visible in other areas of the DB Console.
Name | Description |
---|---|
addsstable.applications |
Number of SSTable ingestions applied (i.e., applied by Replicas) |
addsstable.copies |
Number of SSTable ingestions that required copying files during application |
addsstable.proposals |
Number of SSTable ingestions proposed (i.e., sent to Raft by lease holders) |
build.timestamp |
Build information |
capacity.available |
Available storage capacity |
capacity.reserved |
Capacity reserved for snapshots |
capacity.used |
Used storage capacity |
capacity |
Total storage capacity |
changefeed.failures |
Total number of changefeed jobs which have failed |
changefeed.running |
Number of currently running changefeeds, including sinkless |
clock-offset.meannanos |
Mean clock offset with other nodes in nanoseconds |
clock-offset.stddevnanos |
Std dev clock offset with other nodes in nanoseconds |
cluster.preserve-downgrade-option.last-updated |
Unix timestamp of last updated time for cluster.preserve_downgrade_option |
compactor.compactingnanos |
Number of nanoseconds spent compacting ranges |
compactor.compactions.failure |
Number of failed compaction requests sent to the storage engine |
compactor.compactions.success |
Number of successful compaction requests sent to the storage engine |
compactor.suggestionbytes.compacted |
Number of logical bytes compacted from suggested compactions |
compactor.suggestionbytes.queued |
Number of logical bytes in suggested compactions in the queue |
compactor.suggestionbytes.skipped |
Number of logical bytes in suggested compactions which were not compacted |
distsender.batches.partial |
Number of partial batches processed |
distsender.batches |
Number of batches processed |
distsender.errors.notleaseholder |
Number of NotLeaseHolderErrors encountered |
distsender.rpc.sent.local |
Number of local RPCs sent |
distsender.rpc.sent.nextreplicaerror |
Number of RPCs sent due to per-replica errors |
distsender.rpc.sent |
Number of RPCs sent |
exec.error |
Number of batch KV requests that failed to execute on this node |
exec.latency |
Latency in nanoseconds of batch KV requests executed on this node |
exec.success |
Number of batch KV requests executed successfully on this node |
gcbytesage |
Cumulative age of non-live data in seconds |
gossip.bytes.received |
Number of received gossip bytes |
gossip.bytes.sent |
Number of sent gossip bytes |
gossip.connections.incoming |
Number of active incoming gossip connections |
gossip.connections.outgoing |
Number of active outgoing gossip connections |
gossip.connections.refused |
Number of refused incoming gossip connections |
gossip.infos.received |
Number of received gossip Info objects |
gossip.infos.sent |
Number of sent gossip Info objects |
intentage |
Cumulative age of intents in seconds |
intentbytes |
Number of bytes in intent KV pairs |
intentcount |
Count of intent keys |
jobs.row_level_ttl.num_active_spans |
Number of active spans the TTL job is deleting from |
jobs.row_level_ttl.span_total_duration |
Duration for processing a span during row level TTL |
keybytes |
Number of bytes taken up by keys |
keycount |
Count of all keys |
lastupdatenanos |
Time in nanoseconds since Unix epoch at which bytes/keys/intents metrics were last updated |
leases.epoch |
Number of replica leaseholders using epoch-based leases |
leases.error |
Number of failed lease requests |
leases.expiration |
Number of replica leaseholders using expiration-based leases |
leases.success |
Number of successful lease requests |
leases.transfers.error |
Number of failed lease transfers |
leases.transfers.success |
Number of successful lease transfers |
livebytes |
Number of bytes of live data (keys plus values), including unreplicated data |
livecount |
Count of live keys |
liveness.epochincrements |
Number of times this node has incremented its liveness epoch |
liveness.heartbeatfailures |
Number of failed node liveness heartbeats from this node |
liveness.heartbeatlatency |
Node liveness heartbeat latency in nanoseconds |
liveness.heartbeatsuccesses |
Number of successful node liveness heartbeats from this node |
liveness.livenodes |
Number of live nodes in the cluster (will be 0 if this node is not itself live) |
node-id |
node ID with labels for advertised RPC and HTTP addresses |
queue.consistency.pending |
Number of pending replicas in the consistency checker queue |
queue.consistency.process.failure |
Number of replicas which failed processing in the consistency checker queue |
queue.consistency.process.success |
Number of replicas successfully processed by the consistency checker queue |
queue.consistency.processingnanos |
Nanoseconds spent processing replicas in the consistency checker queue |
queue.gc.info.abortspanconsidered |
Number of AbortSpan entries eligible for removal based on their ages |
queue.gc.info.abortspangcnum |
Number of AbortSpan entries fit for removal |
queue.gc.info.abortspanscanned |
Number of transactions present in the AbortSpan scanned from the engine |
queue.gc.info.intentsconsidered |
Number of intents eligible to be considered because they are at least two hours old |
queue.gc.info.intenttxns |
Number of associated distinct transactions |
queue.gc.info.numkeysaffected |
Number of keys with GC'able data |
queue.gc.info.pushtxn |
Number of attempted pushes |
queue.gc.info.resolvesuccess |
Number of successful intent resolutions |
queue.gc.info.resolvetotal |
Number of attempted intent resolutions |
queue.gc.info.transactionspangcaborted |
Number of GC'able entries corresponding to aborted txns |
queue.gc.info.transactionspangccommitted |
Number of GC'able entries corresponding to committed txns |
queue.gc.info.transactionspangcpending |
Number of GC'able entries corresponding to pending txns |
queue.gc.info.transactionspanscanned |
Number of entries in transaction spans scanned from the engine |
queue.gc.pending |
Number of pending replicas in the GC queue |
queue.gc.process.failure |
Number of replicas which failed processing in the GC queue |
queue.gc.process.success |
Number of replicas successfully processed by the GC queue |
queue.gc.processingnanos |
Nanoseconds spent processing replicas in the GC queue |
queue.raftlog.pending |
Number of pending replicas in the Raft log queue |
queue.raftlog.process.failure |
Number of replicas which failed processing in the Raft log queue |
queue.raftlog.process.success |
Number of replicas successfully processed by the Raft log queue |
queue.raftlog.processingnanos |
Nanoseconds spent processing replicas in the Raft log queue |
queue.raftsnapshot.pending |
Number of pending replicas in the Raft repair queue |
queue.raftsnapshot.process.failure |
Number of replicas which failed processing in the Raft repair queue |
queue.raftsnapshot.process.success |
Number of replicas successfully processed by the Raft repair queue |
queue.raftsnapshot.processingnanos |
Nanoseconds spent processing replicas in the Raft repair queue |
queue.replicagc.pending |
Number of pending replicas in the replica GC queue |
queue.replicagc.process.failure |
Number of replicas which failed processing in the replica GC queue |
queue.replicagc.process.success |
Number of replicas successfully processed by the replica GC queue |
queue.replicagc.processingnanos |
Nanoseconds spent processing replicas in the replica GC queue |
queue.replicagc.removereplica |
Number of replica removals attempted by the replica gc queue |
queue.replicate.addreplica |
Number of replica additions attempted by the replicate queue |
queue.replicate.addreplica.error |
Number of failed replica additions processed by the replicate queue |
queue.replicate.addreplica.success |
Number of successful replica additions processed by the replicate queue |
queue.replicate.pending |
Number of pending replicas in the replicate queue |
queue.replicate.process.failure |
Number of replicas which failed processing in the replicate queue |
queue.replicate.process.success |
Number of replicas successfully processed by the replicate queue |
queue.replicate.processingnanos |
Nanoseconds spent processing replicas in the replicate queue |
queue.replicate.purgatory |
Number of replicas in the replicate queue's purgatory, awaiting allocation options |
queue.replicate.rebalancereplica |
Number of replica rebalancer-initiated additions attempted by the replicate queue |
queue.replicate.removedeadreplica |
Number of dead replica removals attempted by the replicate queue (typically in response to a node outage) |
queue.replicate.removedeadreplica.error |
Number of failed dead replica removals processed by the replicate queue |
queue.replicate.removedeadreplica.success |
Number of successful dead replica removals processed by the replicate queue |
queue.replicate.removedecommissioningreplica.error |
Number of failed decommissioning replica removals processed by the replicate queue |
queue.replicate.removedecommissioningreplica.success |
Number of successful decommissioning replica removals processed by the replicate queue |
queue.replicate.removereplica |
Number of replica removals attempted by the replicate queue (typically in response to a rebalancer-initiated addition) |
queue.replicate.removereplica.error |
Number of failed replica removals processed by the replicate queue |
queue.replicate.removereplica.success |
Number of successful replica removals processed by the replicate queue |
queue.replicate.replacedeadreplica.error |
Number of failed dead replica replacements processed by the replicate queue |
queue.replicate.replacedeadreplica.success |
Number of successful dead replica replacements processed by the replicate queue |
queue.replicate.replacedecommissioningreplica.error |
Number of failed decommissioning replica replacements processed by the replicate queue |
queue.replicate.replacedecommissioningreplica.success |
Number of successful decommissioning replica replacements processed by the replicate queue |
queue.replicate.transferlease |
Number of range lease transfers attempted by the replicate queue |
queue.split.pending |
Number of pending replicas in the split queue |
queue.split.process.failure |
Number of replicas which failed processing in the split queue |
queue.split.process.success |
Number of replicas successfully processed by the split queue |
queue.split.processingnanos |
Nanoseconds spent processing replicas in the split queue |
queue.tsmaintenance.pending |
Number of pending replicas in the time series maintenance queue |
queue.tsmaintenance.process.failure |
Number of replicas which failed processing in the time series maintenance queue |
queue.tsmaintenance.process.success |
Number of replicas successfully processed by the time series maintenance queue |
queue.tsmaintenance.processingnanos |
Nanoseconds spent processing replicas in the time series maintenance queue |
raft.commandsapplied |
Count of Raft commands applied |
raft.enqueued.pending |
Number of pending outgoing messages in the Raft Transport queue |
raft.heartbeats.pending |
Number of pending heartbeats and responses waiting to be coalesced |
raft.process.commandcommit.latency |
Latency histogram in nanoseconds for committing Raft commands |
raft.process.logcommit.latency |
Latency histogram in nanoseconds for committing Raft log entries |
raft.process.tickingnanos |
Nanoseconds spent in store.processRaft() processing replica.Tick() |
raft.process.workingnanos |
Nanoseconds spent in store.processRaft() working |
raft.rcvd.app |
Number of MsgApp messages received by this store |
raft.rcvd.appresp |
Number of MsgAppResp messages received by this store |
raft.rcvd.dropped |
Number of dropped incoming Raft messages |
raft.rcvd.heartbeat |
Number of (coalesced, if enabled) MsgHeartbeat messages received by this store |
raft.rcvd.heartbeatresp |
Number of (coalesced, if enabled) MsgHeartbeatResp messages received by this store |
raft.rcvd.prevote |
Number of MsgPreVote messages received by this store |
raft.rcvd.prevoteresp |
Number of MsgPreVoteResp messages received by this store |
raft.rcvd.prop |
Number of MsgProp messages received by this store |
raft.rcvd.snap |
Number of MsgSnap messages received by this store |
raft.rcvd.timeoutnow |
Number of MsgTimeoutNow messages received by this store |
raft.rcvd.transferleader |
Number of MsgTransferLeader messages received by this store |
raft.rcvd.vote |
Number of MsgVote messages received by this store |
raft.rcvd.voteresp |
Number of MsgVoteResp messages received by this store |
raft.ticks |
Number of Raft ticks queued |
raftlog.behind |
Number of Raft log entries followers on other stores are behind |
raftlog.truncated |
Number of Raft log entries truncated |
range.adds |
Number of range additions |
range.raftleadertransfers |
Number of Raft leader transfers |
range.removes |
Number of range removals |
range.snapshots.recv-in-progress |
Number of non-empty snapshots in progress on a receiver store |
range.snapshots.recv-queue |
Number of queued non-empty snapshots on a receiver store |
range.snapshots.recv-total-in-progress |
Number of empty and non-empty snapshots in progress on a receiver store |
range.snapshots.send-in-progress |
Number of non-empty snapshots in progress on a sender store |
range.snapshots.send-queue |
Number of queued non-empty snapshots on a sender store |
range.snapshots.send-total-in-progress |
Number of empty and non-empty in-progress snapshots on a sender store |
range.snapshots.generated |
Number of generated snapshots |
range.snapshots.normal-applied |
Number of applied snapshots |
range.snapshots.preemptive-applied |
Number of applied preemptive snapshots |
range.snapshots.rcvd-bytes |
Number of snapshot bytes received |
range.snapshots.rebalancing.rcvd-bytes |
Number of rebalancing snapshot bytes received |
range.snapshots.rebalancing.sent-bytes |
Number of rebalancing snapshot bytes sent |
range.snapshots.recovery.rcvd-bytes |
Number of recovery snapshot bytes received |
range.snapshots.recovery.sent-bytes |
Number of recovery snapshot bytes sent |
range.snapshots.recv-in-progress |
Number of non-empty snapshots being received |
range.snapshots.recv-queue |
Number of snapshots queued to receive |
range.snapshots.recv-total-in-progress |
Number of total snapshots being received |
range.snapshots.send-in-progress |
Number of non-empty snapshots being sent |
range.snapshots.send-queue |
Number of snapshots queued to send |
range.snapshots.send-total-in-progress |
Number of total snapshots being sent |
range.snapshots.sent-bytes |
Number of snapshot bytes sent |
range.snapshots.unknown.rcvd-bytes |
Number of unknown snapshot bytes received |
range.snapshots.unknown.sent-bytes |
Number of unknown snapshot bytes sent |
range.splits |
Number of range splits |
rangekeybytes |
Number of bytes taken up by range keys (e.g., MVCC range tombstones) |
rangekeycount |
Count of all range keys (e.g., MVCC range tombstones) |
ranges.unavailable |
Number of ranges with fewer live replicas than needed for quorum |
ranges.underreplicated |
Number of ranges with fewer live replicas than the replication target |
ranges |
Number of ranges |
rangevalbytes |
Number of bytes taken up by range key values (e.g., MVCC range tombstones) |
rangevalcount |
Count of all range key values (e.g., MVCC range tombstones) |
rebalancing.readbytespersecond |
Average number of bytes written recently per second |
rebalancing.readspersecond |
Average number of keys read recently per second |
rebalancing.requestspersecond |
Average number of requests received recently per second |
rebalancing.writebytespersecond |
Average number of bytes read recently per second |
rebalancing.writespersecond |
Number of keys written (i.e., applied by Raft) per second to the store, averaged over a large time period as used in rebalancing decisions |
replicas.commandqueue.combinedqueuesize |
Number of commands in all CommandQueues combined |
replicas.commandqueue.combinedreadcount |
Number of read-only commands in all CommandQueues combined |
replicas.commandqueue.combinedwritecount |
Number of read-write commands in all CommandQueues combined |
replicas.commandqueue.maxoverlaps |
Largest number of overlapping commands seen when adding to any CommandQueue |
replicas.commandqueue.maxreadcount |
Largest number of read-only commands in any CommandQueue |
replicas.commandqueue.maxsize |
Largest number of commands in any CommandQueue |
replicas.commandqueue.maxtreesize |
Largest number of intervals in any CommandQueue's interval tree |
replicas.commandqueue.maxwritecount |
Largest number of read-write commands in any CommandQueue |
replicas.leaders_invalid_lease |
Number of replicas that are Raft leaders whose lease is invalid |
replicas.leaders_not_leaseholders |
Number of replicas that are Raft leaders whose range lease is held by another store |
replicas.leaders |
Number of Raft leaders |
replicas.leaseholders |
Number of lease holders |
replicas.quiescent |
Number of quiesced replicas |
replicas.reserved |
Number of replicas reserved for snapshots |
replicas |
Number of replicas |
requests.backpressure.split |
Number of backpressured writes waiting on a Range split |
requests.slow.commandqueue |
Number of requests that have been stuck for a long time in the command queue |
requests.slow.distsender |
Number of requests that have been stuck for a long time in the dist sender |
requests.slow.lease |
Number of requests that have been stuck for a long time acquiring a lease |
requests.slow.raft |
Number of requests that have been stuck for a long time in Raft |
rocksdb.block.cache.hits |
Count of block cache hits |
rocksdb.block.cache.misses |
Count of block cache misses |
rocksdb.block.cache.pinned-usage |
Bytes pinned by the block cache |
rocksdb.block.cache.usage |
Bytes used by the block cache |
rocksdb.bloom.filter.prefix.checked |
Number of times the bloom filter was checked |
rocksdb.bloom.filter.prefix.useful |
Number of times the bloom filter helped avoid iterator creation |
rocksdb.compactions |
Number of table compactions |
rocksdb.flushes |
Number of table flushes |
rocksdb.memtable.total-size |
Current size of memtable in bytes |
rocksdb.num-sstables |
Number of storage engine SSTables |
rocksdb.read-amplification |
Number of disk reads per query |
rocksdb.table-readers-mem-estimate |
Memory used by index and filter blocks |
round-trip-latency |
Distribution of round-trip latencies with other nodes in nanoseconds |
security.certificate.expiration.ca |
Expiration timestamp in seconds since Unix epoch for the CA certificate. 0 means no certificate or error. |
security.certificate.expiration.node |
Expiration timestamp in seconds since Unix epoch for the node certificate. 0 means no certificate or error. |
sql.bytesin |
Number of sql bytes received |
sql.bytesout |
Number of sql bytes sent |
sql.conns |
Number of active sql connections |
sql.ddl.count |
Number of SQL DDL statements |
sql.delete.count |
Number of SQL DELETE statements |
sql.distsql.exec.latency |
Latency in nanoseconds of SQL statement executions running on the distributed execution engine. This metric does not include the time to parse and plan the statement. |
sql.distsql.flows.active |
Number of distributed SQL flows currently active |
sql.distsql.flows.total |
Number of distributed SQL flows executed |
sql.distsql.queries.active |
Number of distributed SQL queries currently active |
sql.distsql.queries.total |
Number of distributed SQL queries executed |
sql.distsql.select.count |
Number of DistSQL SELECT statements |
sql.distsql.service.latency |
Latency in nanoseconds of SQL statement executions running on the distributed execution engine, including the time to parse and plan the statement. |
sql.exec.latency |
Latency in nanoseconds of all SQL statement executions. This metric does not include the time to parse and plan the statement. |
sql.guardrails.max_row_size_err.count |
Number of times a large row violates the corresponding sql.guardrails.max_row_size_err limit. |
sql.guardrails.max_row_size_log.count |
Number of times a large row violates the corresponding sql.guardrails.max_row_size_log limit. |
sql.insert.count |
Number of SQL INSERT statements |
sql.mem.current |
Current sql statement memory usage |
sql.mem.distsql.current |
Current sql statement memory usage for distsql |
sql.mem.distsql.max |
Memory usage per sql statement for distsql |
sql.mem.max |
Memory usage per sql statement |
sql.mem.session.current |
Current sql session memory usage |
sql.mem.session.max |
Memory usage per sql session |
sql.mem.txn.current |
Current sql transaction memory usage |
sql.mem.txn.max |
Memory usage per sql transaction |
sql.misc.count |
Number of other SQL statements |
sql.pgwire_cancel.total |
Counter of the number of pgwire query cancel requests |
sql.pgwire_cancel.ignored |
Counter of the number of pgwire query cancel requests that were ignored due to rate limiting |
sql.pgwire_cancel.successful |
Counter of the number of pgwire query cancel requests that were successful |
sql.query.count |
Number of SQL queries |
sql.select.count |
Number of SQL SELECT statements |
sql.service.latency |
Latency in nanoseconds of SQL request execution, including the time to parse and plan the statement. |
sql.txn.abort.count |
Number of SQL transaction ABORT statements |
sql.txn.begin.count |
Number of SQL transaction BEGIN statements |
sql.txn.commit.count |
Number of SQL transaction COMMIT statements |
sql.txn.contended.count |
Number of SQL transactions that experienced contention |
sql.txn.rollback.count |
Number of SQL transaction ROLLBACK statements |
sql.update.count |
Number of SQL UPDATE statements |
storage.l0-level-score |
Compaction score of level 0 |
storage.l1-level-score |
Compaction score of level 1 |
storage.l2-level-score |
Compaction score of level 2 |
storage.l3-level-score |
Compaction score of level 3 |
storage.l4-level-score |
Compaction score of level 4 |
storage.l5-level-score |
Compaction score of level 5 |
storage.l6-level-score |
Compaction score of level 6 |
storage.l0-level-size |
Size of the SSTables in level 0 |
storage.l1-level-size |
Size of the SSTables in level 1 |
storage.l2-level-size |
Size of the SSTables in level 2 |
storage.l3-level-size |
Size of the SSTables in level 3 |
storage.l4-level-size |
Size of the SSTables in level 4 |
storage.l5-level-size |
Size of the SSTables in level 5 |
storage.l6-level-size |
Size of the SSTables in level 6 |
storage.keys.range-key-set.count |
Approximate count of RangeKeySet internal keys across the storage engine. |
storage.marked-for-compaction-files |
Count of SSTables marked for compaction |
sys.cgo.allocbytes |
Current bytes of memory allocated by cgo |
sys.cgo.totalbytes |
Total bytes of memory allocated by cgo, but not released |
sys.cgocalls |
Total number of cgo call |
sys.cpu.sys.ns |
Total system cpu time in nanoseconds |
sys.cpu.sys.percent |
Current system cpu percentage |
sys.cpu.user.ns |
Total user cpu time in nanoseconds |
sys.cpu.user.percent |
Current user cpu percentage |
sys.fd.open |
Process open file descriptors |
sys.fd.softlimit |
Process open FD soft limit |
sys.gc.count |
Total number of GC runs |
sys.gc.pause.ns |
Total GC pause in nanoseconds |
sys.gc.pause.percent |
Current GC pause percentage |
sys.go.allocbytes |
Current bytes of memory allocated by go |
sys.go.totalbytes |
Total bytes of memory allocated by go, but not released |
sys.goroutines |
Current number of goroutines |
sys.rss |
Current process RSS |
sys.uptime |
Process uptime in seconds |
sysbytes |
Number of bytes in system KV pairs |
syscount |
Count of system KV pairs |
timeseries.write.bytes |
Total size in bytes of metric samples written to disk |
timeseries.write.errors |
Total errors encountered while attempting to write metrics to disk |
timeseries.write.samples |
Total number of metric samples written to disk |
totalbytes |
Total number of bytes taken up by keys and values including non-live data |
tscache.skl.read.pages |
Number of pages in the read timestamp cache |
tscache.skl.read.rotations |
Number of page rotations in the read timestamp cache |
tscache.skl.write.pages |
Number of pages in the write timestamp cache |
tscache.skl.write.rotations |
Number of page rotations in the write timestamp cache |
txn.abandons |
Number of abandoned KV transactions |
txn.aborts |
Number of aborted KV transactions |
txn.autoretries |
Number of automatic retries to avoid serializable restarts |
txn.commits1PC |
Number of committed one-phase KV transactions |
txn.commits |
Number of committed KV transactions (including 1PC) |
txn.durations |
KV transaction durations in nanoseconds |
txn.restarts.deleterange |
Number of restarts due to a forwarded commit timestamp and a DeleteRange command |
txn.restarts.possiblereplay |
Number of restarts due to possible replays of command batches at the storage layer |
txn.restarts.serializable |
Number of restarts due to a forwarded commit timestamp and isolation=SERIALIZABLE |
txn.restarts.writetooold |
Number of restarts due to a concurrent writer committing first |
txn.restarts |
Number of restarted KV transactions |
valbytes |
Number of bytes taken up by values |
valcount |
Count of all values |
Name | Description |
---|---|
addsstable.applications |
Number of SSTable ingestions applied (i.e., applied by Replicas) |
addsstable.copies |
number of SSTable ingestions that required copying files during application |
addsstable.proposals |
Number of SSTable ingestions proposed (i.e., sent to Raft by lease holders) |
admission.wait_sum.kv-stores |
Total wait time in micros |
admission.wait_sum.kv |
Total wait time in micros |
admission.wait_sum.sql-kv-response |
Total wait time in micros |
admission.wait_sum.sql-sql-response |
Total wait time in micros |
capacity.available |
Available storage capacity |
capacity.reserved |
Capacity reserved for snapshots |
capacity.used |
Used storage capacity |
capacity |
Total storage capacity |
changefeed.backfill_count |
Number of changefeeds currently executing backfill |
changefeed.backfill_pending_ranges |
Number of ranges in an ongoing backfill that are yet to be fully emitted |
changefeed.commit_latency |
Event commit latency: a difference between event MVCC timestamp and the time it was acknowledged by the downstream sink. If the sink batches events, then the difference between the earliest event in the batch and acknowledgement is recorded; Excludes latency during backfill |
changefeed.emitted_messages |
Messages emitted by all feeds |
changefeed.error_retries |
Total retryable errors encountered by all changefeeds |
changefeed.failures |
Total number of changefeed jobs which have failed |
changefeed.max_behind_nanos |
Largest commit-to-emit duration of any running feed |
changefeed.message_size_hist |
Message size histogram |
changefeed.running |
Number of currently running changefeeds, including sinkless |
clock-offset.meannanos |
Mean clock offset with other nodes |
clock-offset.stddevnanos |
Stddev clock offset with other nodes |
distsender.batches.partial |
Number of partial batches processed after being divided on range boundaries |
distsender.batches |
Number of batches processed |
distsender.errors.notleaseholder |
Number of NotLeaseHolderErrors encountered from replica-addressed RPCs |
distsender.rpc.sent.local |
Number of replica-addressed RPCs sent through the local-server optimization |
distsender.rpc.sent.nextreplicaerror |
Number of replica-addressed RPCs sent due to per-replica errors |
distsender.rpc.sent |
Number of replica-addressed RPCs sent |
exec.error |
Number of batch KV requests that failed to execute on this node. This count excludes transaction restart/abort errors. However, it will include other errors expected during normal operation, such as ConditionFailedError. This metric is thus not an indicator of KV health. |
exec.latency |
Latency of batch KV requests (including errors) executed on this node. This measures requests already addressed to a single replica, from the moment at which they arrive at the internal gRPC endpoint to the moment at which the response (or an error) is returned. This latency includes in particular commit waits, conflict resolution and replication, and end-users can easily produce high measurements via long-running transactions that conflict with foreground traffic. This metric thus does not provide a good signal for understanding the health of the KV layer. |
exec.success |
Number of batch KV requests executed successfully on this node. A request is considered to have executed 'successfully' if it either returns a result or a transaction restart/abort error. |
gcbytesage |
Cumulative age of non-live data |
gossip.bytes.received |
Number of received gossip bytes |
gossip.bytes.sent |
Number of sent gossip bytes |
gossip.connections.incoming |
Number of active incoming gossip connections |
gossip.connections.outgoing |
Number of active outgoing gossip connections |
gossip.connections.refused |
Number of refused incoming gossip connections |
gossip.infos.received |
Number of received gossip Info objects |
gossip.infos.sent |
Number of sent gossip Info objects |
intentage |
Cumulative age of intents |
intentbytes |
Number of bytes in intent KV pairs |
intentcount |
Count of intent keys |
jobs.changefeed.resume_retry_error |
Number of changefeed jobs which failed with a retryable error |
keybytes |
Number of bytes taken up by keys |
keycount |
Count of all keys |
leases.epoch |
Number of replica leaseholders using epoch-based leases |
leases.error |
Number of failed lease requests |
leases.expiration |
Number of replica leaseholders using expiration-based leases |
leases.success |
Number of successful lease requests |
leases.transfers.error |
Number of failed lease transfers |
leases.transfers.success |
Number of successful lease transfers |
livebytes |
Number of bytes of live data (keys plus values) |
livecount |
Count of live keys |
liveness.epochincrements |
Number of times this node has incremented its liveness epoch |
liveness.heartbeatfailures |
Number of failed node liveness heartbeats from this node |
liveness.heartbeatlatency |
Node liveness heartbeat latency |
liveness.heartbeatsuccesses |
Number of successful node liveness heartbeats from this node |
liveness.livenodes |
Number of live nodes in the cluster (will be 0 if this node is not itself live) |
queue.consistency.pending |
Number of pending replicas in the consistency checker queue |
queue.consistency.process.failure |
Number of replicas which failed processing in the consistency checker queue |
queue.consistency.process.success |
Number of replicas successfully processed by the consistency checker queue |
queue.consistency.processingnanos |
Nanoseconds spent processing replicas in the consistency checker queue |
queue.gc.info.abortspanconsidered |
Number of AbortSpan entries eligible for removal based on their ages |
queue.gc.info.abortspangcnum |
Number of AbortSpan entries fit for removal |
queue.gc.info.abortspanscanned |
Number of transactions present in the AbortSpan scanned from the engine |
queue.gc.info.intentsconsidered |
Number of intents eligible to be considered because they are at least two hours old |
queue.gc.info.intenttxns |
Number of associated distinct transactions |
queue.gc.info.numkeysaffected |
Number of keys with data that is eligible for garbage collection |
queue.gc.info.pushtxn |
Number of attempted pushes |
queue.gc.info.resolvesuccess |
Number of successful intent resolutions |
queue.gc.info.resolvetotal |
Number of attempted intent resolutions |
queue.gc.info.transactionspangcaborted |
Number of entries eligible for garbage collection that correspond to aborted txns |
queue.gc.info.transactionspangccommitted |
Number of entries eligible for garbage collection that correspond to committed txns |
queue.gc.info.transactionspangcpending |
Number of entries eligible for garbage collection that correspond to pending txns |
queue.gc.info.transactionspanscanned |
Number of entries in transaction spans scanned from the engine |
queue.gc.pending |
Number of pending replicas in the MVCC garbage collection queue |
queue.gc.process.failure |
Number of replicas which failed processing in the MVCC garbage collection queue |
queue.gc.process.success |
Number of replicas successfully processed by the MVCC garbage collection queue |
queue.gc.processingnanos |
Nanoseconds spent processing replicas in the MVCC garbage collection queue |
queue.raftlog.pending |
Number of pending replicas in the Raft log queue |
queue.raftlog.process.failure |
Number of replicas which failed processing in the Raft log queue |
queue.raftlog.process.success |
Number of replicas successfully processed by the Raft log queue |
queue.raftlog.processingnanos |
Nanoseconds spent processing replicas in the Raft log queue |
queue.raftsnapshot.pending |
Number of pending replicas in the Raft repair queue |
queue.raftsnapshot.process.failure |
Number of replicas which failed processing in the Raft repair queue |
queue.raftsnapshot.process.success |
Number of replicas successfully processed by the Raft repair queue |
queue.raftsnapshot.processingnanos |
Nanoseconds spent processing replicas in the Raft repair queue |
queue.replicagc.pending |
Number of pending replicas in the replica queue |
queue.replicagc.process.failure |
Number of replicas which failed processing in the replica garbage collection queue |
queue.replicagc.process.success |
Number of replicas successfully processed by the replica garbage collection queue |
queue.replicagc.processingnanos |
Nanoseconds spent processing replicas in the replica garbage collection queue |
queue.replicagc.removereplica |
Number of replica removals attempted by the replica garbage collection queue |
queue.replicate.addreplica |
Number of replica additions attempted by the replicate queue |
queue.replicate.pending |
Number of pending replicas in the replicate queue |
queue.replicate.process.failure |
Number of replicas which failed processing in the replicate queue |
queue.replicate.process.success |
Number of replicas successfully processed by the replicate queue |
queue.replicate.processingnanos |
Nanoseconds spent processing replicas in the replicate queue |
queue.replicate.purgatory |
Number of replicas in the replicate queue's purgatory, awaiting allocation options |
queue.replicate.rebalancereplica |
Number of replica rebalancer-initiated additions attempted by the replicate queue |
queue.replicate.removedeadreplica |
Number of dead replica removals attempted by the replicate queue (typically in response to a node outage) |
queue.replicate.removereplica |
Number of replica removals attempted by the replicate queue (typically in response to a rebalancer-initiated addition) |
queue.replicate.transferlease |
Number of range lease transfers attempted by the replicate queue |
queue.split.pending |
Number of pending replicas in the split queue |
queue.split.process.failure |
Number of replicas which failed processing in the split queue |
queue.split.process.success |
Number of replicas successfully processed by the split queue |
queue.split.processingnanos |
Nanoseconds spent processing replicas in the split queue |
queue.tsmaintenance.pending |
Number of pending replicas in the time series maintenance queue |
queue.tsmaintenance.process.failure |
Number of replicas which failed processing in the time series maintenance queue |
queue.tsmaintenance.process.success |
Number of replicas successfully processed by the time series maintenance queue |
queue.tsmaintenance.processingnanos |
Nanoseconds spent processing replicas in the time series maintenance queue |
raft.commandsapplied |
Count of Raft commands applied. This measurement is taken on the Raft apply loops of all Replicas (leaders and followers alike), meaning that it does not measure the number of Raft commands proposed (in the hypothetical extreme case, all Replicas may apply all commands through snapshots, thus not increasing this metric at all). Instead, it is a proxy for how much work is being done advancing the Replica state machines on this node. |
raft.heartbeats.pending |
Number of pending heartbeats and responses waiting to be coalesced |
raft.process.commandcommit.latency |
Latency histogram for applying a batch of Raft commands to the state machine. This metric is misnamed: it measures the latency for applying a batch of committed Raft commands to a Replica state machine. This requires only non-durable I/O (except for replication configuration changes). Note that a "batch" in this context is really a sub-batch of the batch received for application during Raft ready handling. The 'raft.process.applycommitted.latency' histogram is likely more suitable in most cases, as it measures the total latency across all sub-batches (i.e., the sum of commandcommit.latency for a complete batch). |
raft.process.logcommit.latency |
Latency histogram for committing Raft log entries to stable storage. This measures the latency of durably committing a group of newly received Raft entries as well as the HardState entry to disk. This excludes any data processing, i.e., we measure purely the commit latency of the resulting Engine write. Homogeneous bands of p50-p99 latencies (in the presence of regular Raft traffic), make it likely that the storage layer is healthy. Spikes in the latency bands can either hint at the presence of large sets of Raft entries being received, or at performance issues at the storage layer. |
raft.process.tickingnanos |
Nanoseconds spent in store.processRaft() processing replica.Tick() |
raft.process.workingnanos |
Nanoseconds spent in store.processRaft() working. This is the sum of the measurements passed to the raft.process.handleready.latency histogram. |
raft.rcvd.app |
Number of MsgApp messages received by this store |
raft.rcvd.appresp |
Number of MsgAppResp messages received by this store |
raft.rcvd.dropped |
Number of incoming Raft messages dropped (due to queue length or size) |
raft.rcvd.heartbeat |
Number of (coalesced, if enabled) MsgHeartbeat messages received by this store |
raft.rcvd.heartbeatresp |
Number of (coalesced, if enabled) MsgHeartbeatResp messages received by this store |
raft.rcvd.prevote |
Number of MsgPreVote messages received by this store |
raft.rcvd.prevoteresp |
Number of MsgPreVoteResp messages received by this store |
raft.rcvd.prop |
Number of MsgProp messages received by this store |
raft.rcvd.snap |
Number of MsgSnap messages received by this store |
raft.rcvd.timeoutnow |
Number of MsgTimeoutNow messages received by this store |
raft.rcvd.transferleader |
Number of MsgTransferLeader messages received by this store |
raft.rcvd.vote |
Number of MsgVote messages received by this store |
raft.rcvd.voteresp |
Number of MsgVoteResp messages received by this store |
raft.ticks |
Number of Raft ticks queued |
raftlog.behind |
Number of Raft log entries followers on other stores are behind. This gauge provides a view of the aggregate number of log entries the Raft leaders on this node think the followers are behind. Since a Raft leader may not always have a good estimate for this information for all of its followers, and since followers are expected to be behind (when they are not required as part of a quorum) and the aggregate thus scales like the count of such followers, it is difficult to meaningfully interpret this metric. |
raftlog.truncated |
Number of Raft log entries truncated |
range.adds |
Number of range additions |
range.raftleadertransfers |
Number of Raft leader transfers |
range.removes |
Number of range removals |
range.snapshots.generated |
Number of generated snapshots |
range.snapshots.recv-in-progress |
Number of non-empty snapshots in progress on a receiver store |
range.snapshots.recv-queue |
Number of queued non-empty snapshots on a receiver store |
range.snapshots.recv-total-in-progress |
Number of empty and non-empty snapshots in progress on a receiver store |
range.snapshots.send-in-progress |
Number of non-empty snapshots in progress on a sender store |
range.snapshots.send-queue |
Number of queued non-empty snapshots on a sender store |
range.snapshots.send-total-in-progress |
Number of empty and non-empty in-progress on a sender store |
range.splits |
Number of range splits |
ranges.overreplicated |
Number of ranges with more live replicas than the replication target |
ranges.unavailable |
Number of ranges with fewer live replicas than needed for quorum |
ranges.underreplicated |
Number of ranges with fewer live replicas than the replication target |
ranges |
Number of ranges |
rebalancing.writespersecond |
Number of keys written (i.e., applied by raft) per second to the store, averaged over a large time period as used in rebalancing decisions |
replicas.leaders_not_leaseholders |
Number of replicas that are Raft leaders whose range lease is held by another store |
replicas.leaders |
Number of Raft leaders |
replicas.leaseholders |
Number of lease holders |
replicas.quiescent |
Number of quiesced replicas |
replicas.reserved |
Number of replicas reserved for snapshots |
replicas |
Number of replicas |
requests.backpressure.split |
Number of backpressured writes waiting on a range split. A range will backpressure (roughly) non-system traffic when the range is above the configured size until the range splits. When the rate of this metric is nonzero over extended periods of time, it should be investigated why splits are not occurring. |
requests.slow.distsender |
Number of replica-bound RPCs currently stuck or retrying for a long time. Note that this is not a good signal for KV health. The remote side of the RPCs tracked here may experience contention, so an end user can easily cause values for this metric to be emitted by leaving a transaction open for a long time and contending with it using a second transaction. |
requests.slow.lease |
Number of requests that have been stuck for a long time acquiring a lease. A nonzero value usually indicates range or replica unavailability, and should be investigated. Commonly, requests.slow.raft is also a nonzero value, which indicates that the lease requests are not getting a timely response from the replication layer. |
requests.slow.raft |
Number of requests that have been stuck for a long time in the replication layer. An (evaluated) request has to pass through the replication layer, notably the quota pool and Raft. If it fails to do so within a highly permissive duration, this metric is incremented (and decremented again once the request is either applied or returns an error). A nonzero value indicates range or replica unavailability, and should be investigated. |
rocksdb.block.cache.hits |
Count of block cache hits |
rocksdb.block.cache.misses |
Count of block cache misses |
rocksdb.block.cache.pinned-usage |
Bytes pinned by the block cache |
rocksdb.block.cache.usage |
Bytes used by the block cache |
rocksdb.bloom.filter.prefix.checked |
Number of times the bloom filter was checked |
rocksdb.bloom.filter.prefix.useful |
Number of times the bloom filter helped avoid iterator creation |
rocksdb.compactions |
Number of table compactions |
rocksdb.flushes |
Number of table flushes |
rocksdb.memtable.total-size |
Current size of memtable in bytes |
rocksdb.num-sstables |
Number of storage engine SSTables |
rocksdb.read-amplification |
Number of disk reads per query |
rocksdb.table-readers-mem-estimate |
Memory used by index and filter blocks |
round-trip-latency |
Distribution of round-trip latencies with other nodes |
sql.bytesin |
Number of sql bytes received |
sql.bytesout |
Number of sql bytes sent |
sql.conn.latency |
Latency to establish and authenticate a SQL connection |
sql.conns |
Number of active sql connections |
sql.ddl.count |
Number of SQL DDL statements successfully executed |
sql.delete.count |
Number of SQL DELETE statements successfully executed |
sql.distsql.contended_queries.count |
Number of SQL queries that experienced contention |
sql.distsql.exec.latency |
Latency of DistSQL statement execution |
sql.distsql.flows.active |
Number of distributed SQL flows currently active |
sql.distsql.flows.total |
Number of distributed SQL flows executed |
sql.distsql.queries.active |
Number of SQL queries currently active |
sql.distsql.queries.total |
Number of SQL queries executed |
sql.distsql.select.count |
Number of DistSQL SELECT statements |
sql.distsql.service.latency |
Latency of DistSQL request execution |
sql.exec.latency |
Latency of SQL statement execution |
sql.failure.count |
Number of statements resulting in a planning or runtime error |
sql.full.scan.count |
Number of full table or index scans |
sql.insert.count |
Number of SQL INSERT statements successfully executed |
sql.mem.distsql.current |
Current sql statement memory usage for distsql |
sql.mem.distsql.max |
Memory usage per sql statement for distsql |
sql.mem.internal.session.current |
Current sql session memory usage for internal |
sql.mem.internal.session.max |
Memory usage per sql session for internal |
sql.mem.internal.txn.current |
Current sql transaction memory usage for internal |
sql.mem.internal.txn.max |
Memory usage per sql transaction for internal |
sql.misc.count |
Number of other SQL statements successfully executed |
sql.query.count |
Number of SQL queries executed |
sql.select.count |
Number of SQL SELECT statements successfully executed |
sql.service.latency |
Latency of SQL request execution |
sql.statements.active |
Number of currently active user SQL statements |
sql.txn.abort.count |
Number of SQL transaction abort errors |
sql.txn.begin.count |
Number of SQL transaction BEGIN statements successfully executed |
sql.txn.commit.count |
Number of SQL transaction COMMIT statements successfully executed |
sql.txn.latency |
Latency of SQL transactions |
sql.txn.rollback.count |
Number of SQL transaction ROLLBACK statements successfully executed |
sql.txns.open |
Number of currently open user SQL transactions |
sql.update.count |
Number of SQL UPDATE statements successfully executed |
sys.cgo.allocbytes |
Current bytes of memory allocated by cgo |
sys.cgo.totalbytes |
Total bytes of memory allocated by cgo, but not released |
sys.cgocalls |
Total number of cgo calls |
sys.cpu.combined.percent-normalized |
Current user+system cpu percentage, normalized 0-1 by number of cores |
sys.cpu.sys.ns |
Total system cpu time |
sys.cpu.sys.percent |
Current system cpu percentage |
sys.cpu.user.ns |
Total user cpu time |
sys.cpu.user.percent |
Current user cpu percentage |
sys.fd.open |
Process open file descriptors |
sys.fd.softlimit |
Process open FD soft limit |
sys.gc.count |
Total number of garbage collection runs |
sys.gc.pause.ns |
Total garbage collection pause |
sys.gc.pause.percent |
Current garbage collection pause percentage |
sys.go.allocbytes |
Current bytes of memory allocated by go |
sys.go.totalbytes |
Total bytes of memory allocated by go, but not released |
sys.goroutines |
Current number of goroutines |
sys.host.net.recv.bytes |
Bytes received on all network interfaces since this process started |
sys.host.net.send.bytes |
Bytes sent on all network interfaces since this process started |
sys.rss |
Current process RSS |
sys.uptime |
Process uptime |
sysbytes |
Number of bytes in system KV pairs |
syscount |
Count of system KV pairs |
timeseries.write.bytes |
Total size in bytes of metric samples written to disk |
timeseries.write.errors |
Total errors encountered while attempting to write metrics to disk |
timeseries.write.samples |
Total number of metric samples written to disk |
totalbytes |
Total number of bytes taken up by keys and values including non-live data |
txn.aborts |
Number of aborted KV transactions |
txn.commits1PC |
Number of KV transaction one-phase commit attempts |
txn.commits |
Number of committed KV transactions (including 1PC) |
txn.durations |
KV transaction durations |
txn.restarts.serializable |
Number of restarts due to a forwarded commit timestamp and isolation=SERIALIZABLE |
txn.restarts.writetooold |
Number of restarts due to a concurrent writer committing first |
txn.restarts |
Number of restarted KV transactions |
valbytes |
Number of bytes taken up by values |
valcount |
Count of all values |