#959 statsmanager should have a proper counter metric

Reporter

Jonas Wielicki

Owner

Nobody

Created

Updated

Stars

★★★ (5)

Tags

Priority-Medium

Status-New

Type-Enhancement

Jonas Wielicki
on

Note: In this issue, I’ll link to prometheus often, because that’s what I’m currently familiar with. Unless noted otherwise, however, all statements apply to most modern Time-Series Databases, including InfluxDB, Graphite and to some extent even RRDTool-based systems.
While writing a mod_prometheus proof of concept (a module which exports data for the Prometheus Time Series Database [1]), I found that statsmanager does not have a real "counter" metric type.
A "counter" metric is (see also [2]) a metric which monotonically increases or resets to zero (typically when the instrumented service restarts). Counter metrics have a few advantages, for example when calculating the derivative to obtain a rate (by knowing the derivative can only ever be positive, wraparounds can be handled properly).
TSDBs and their frontends are really good at calculating the derivative of counters, making it, in 99.9% of the cases, useless to have "rate" metrics (as currently supported by statsmanager). All major TSDBs (including RRDtool!) support rate calculation based on counter values.
An example for how to not do things is mod_measure_cpu [3] (sorry, Zash), which uses the rate metric instead of simply forwarding the tick counters as provided by the OS, ideally normalised to a reasonable unit such as (milli|micro|nano)seconds.
TL;DR:
1. statsmanager needs a proper counter metric
2. rate metrics should be removed
3. counter metrics should enforce that they’re not accidentally decreased
4. we may possibly want to introduce a wraparound when the counter reaches a value where float precision cannot handle unit increments anymore
[1]: https://prometheus.io/
[2]: https://prometheus.io/docs/concepts/metric_types/#counter
[3]: https://modules.prosody.im/mod_measure_cpu.html

Jonas WielickionNote: In this issue, I’ll link to prometheus often, because that’s what I’m currently familiar with. Unless noted otherwise, however, all statements apply to most modern Time-Series Databases, including InfluxDB, Graphite and to some extent even RRDTool-based systems. While writing a mod_prometheus proof of concept (a module which exports data for the Prometheus Time Series Database [1]), I found that statsmanager does not have a real "counter" metric type. A "counter" metric is (see also [2]) a metric which monotonically increases or resets to zero (typically when the instrumented service restarts). Counter metrics have a few advantages, for example when calculating the derivative to obtain a rate (by knowing the derivative can only ever be positive, wraparounds can be handled properly). TSDBs and their frontends are really good at calculating the derivative of counters, making it, in 99.9% of the cases, useless to have "rate" metrics (as currently supported by statsmanager). All major TSDBs (including RRDtool!) support rate calculation based on counter values. An example for how to not do things is mod_measure_cpu [3] (sorry, Zash), which uses the rate metric instead of simply forwarding the tick counters as provided by the OS, ideally normalised to a reasonable unit such as (milli|micro|nano)seconds. TL;DR: 1. statsmanager needs a proper counter metric 2. rate metrics should be removed 3. counter metrics should enforce that they’re not accidentally decreased 4. we may possibly want to introduce a wraparound when the counter reaches a value where float precision cannot handle unit increments anymore [1]: https://prometheus.io/ [2]: https://prometheus.io/docs/concepts/metric_types/#counter [3]: https://modules.prosody.im/mod_measure_cpu.html