Upscale-Prometheus Cortex ready to spark joy in production


Four years in, horizontally scalable, clustered Prometheus implementation Cortex has hit the big 1.0 and is now ready for production.

Cortex was initially developed by Julius Volz, creator of monitoring tool Prometheus, and Grafana Labs’ Tom Wilkie. It provides ways to store Prometheus metrics for long-term analysis and can be used to query metrics from multiple servers. 

The project made its way into the CNCF sandbox in 2018 and even though it was only deemed stable enough to be used in production this week, it is already part of the Weave Cloud and Grafana Cloud offerings. This information has to be taken with a grain of salt, however, since Grafana supplies most of the Cortex maintainers and Wilkie worked at Weaveworks during Cortex’ inception. 

While the majority of the changes included in v1.0 are made up of flag removals and renamings, the release also includes a couple of new features. The ruler service, for example, has been fitted with an experimental storage API, while work on the write-ahead logging has bestowed a flusher target upon users. The latter can be used as a job to flush the WAL, should other mechanisms for some reason fail to do so.

Cortex 1.0 provides admins with a way to set availability zones for ingesters, which is meant to help “ensure metric replication is distributed across zones”. Other improvements range from added FIFO cache metrics for current number of entries and memory usage, to failures on samples at distributors and ingesters returning the first validation error instead of the last. 

The experimental time series database (TSDB) also saw some enhancements, including support for a local filesystem backend and memcached for the TSDB index cache, and not relying on a gRPC server for communication between the querier and Bucketstore anymore.  

Moreover, TSDB now comes with an option to set a time before a block marked for deletion is actually deleted from the bucket, though users have to be careful since setting it to 0 could cause query failures if the block is still in use somewhere else. To help with that there’s also a parameter to set a duration after which those marked blocks will be filtered out while fetching blocks used for querying. 

The changelog has all the information, and is well worth a look given the huge amount of name changes that could lead to broken setups.