What is Prometheus
Prometheus is an open source system monitoring and alarm tool, characterized by
Multidimensional data model (time series data consists of metric name and a set of key/value)
Flexible query language (PromQl) in multiple dimensions
Independent of distributed storage, single master node works.
Time series data are collected through HTTP-based pull mode.
Sequential data pushing can be performed through push gateway.
The target server to be collected can be obtained through service discovery or static configuration.
Support for various visual charts and dashboards
Prometheus uses pull, or pull model, to collect data. It collects indicators through HTTP protocol. As long as the application system can provide HTTP interface, it can access the monitoring system. Compared with private protocol or binary protocol, it is developed and simple.
For this short-period index collection of scheduled tasks, if the pull mode is adopted, the task may be finished and Prometheus has not had time to collect. At this time, a transfer layer can be added. The client pushes the data to the Push Gateway cache and Prometheus comes from the push gateway pull index. (
Additional Push Gateway needs to be built and new job needs to be added to collect data from gateway.)
Composition and structure
It is mainly responsible for data acquisition and storage, and provides support for PromQL query language.
Official client class libraries include go, java, scala, python, ruby, and many other class libraries developed by third parties, supporting nodejs, php, erlang, etc
Intermediate Gateway Supporting Temporary Job Active Push Indicators
Dashboard developed by rails for visualizing indicator data
Indicators supporting other data sources are imported into Prometheus, supporting databases, hardware, message middleware, storage systems, http servers, jmx, etc.
Experimental components for alarm
Command line tools
Other auxiliary tools
The architecture diagram is as follows:
docker exec -it a9bd827a1d18 less /etc/prometheus/prometheus.yml
# my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Attach these labels to any time series or alerts when communicating with # external systems (federation, remote storage, Alertmanager). external_labels: monitor: 'codelab-monitor' # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first.rules" # - "second.rules" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ['localhost:9090']
This refers to capturing data every 15 seconds (
Refers to the interval between calculating rule
Pushgateway has a separate mirror image.
docker pull prom/pushgateway
For applications that like to use push mode, a push gateway can be specially built to adapt.
Prometheus uses g’s LevelDB to index (
PromSQL relies heavily on LevelDB.), for a large number of sampling data has its own storage layer, Prometheus creates a local file for each timing data, organized by 1024byte chunk.
magnetic disk file
Prometheus stores the file in the path specified by storage.local.path, which defaults to. /data. There are three kinds of chunk codes
First generation encoding format, simple delta encoding
The current default encoding format, double-delta encoding
Variable bit-width encoding, the encoding method used by Beringei, facebook’s time series database
Prometheus has saved the most recently used chunks in memory. The maximum number of chunks can be set by storage.local.memory-chunks, with a default value of 1048576, i.e. 1048576 chunks and a size of 1G.
In addition to the adopted data, prometheus also needs to carry out various operations on the data, so the overall memory cost will definitely be larger than the configured local.memory-chunks size. Therefore, the government proposes to reserve 3 times the local.memory-chunks memory size.
As a rule of thumb, you should have at least three times more RAM available than needed by the memory chunks alone
You can view Prometheus _ Local _ Storage _ Memory _ Chunks and process_resident_memory_byte through the metrics of the server.
The current number of chunks in memory, excluding cloned chunks
Number of chunks Exposed in Current Memory
Resident memory size in bytes
Size of data residing in memory
Between 0 and 1, prometheus leaves rushed mode when the value is less than or equal to 0.7.
When it is greater than 0.8, enter the rushed mode.
1 indicates entering rushed mode, 0 indicates no. When entering rushed mode, prometheus will use the configuration of Storage.local.series-sync-strategy and Storage.local.checkpoint-interval to accelerate the persistence of chunks.
docker run -p 9090:9090 \ -v /tmp/prometheus-data:/prometheus-data \ prom/prometheus \ -storage.local.retention 168h0m0s \ -storage.local.max-chunks-to-persist 3024288 \ -storage.local.memory-chunks=50502740 \ -storage.local.num-fingerprint-mutexes=300960
Set the maximum number of chunks retained in prometheus memory to 1048576 by default, which is 1G in size.
Used to configure the time for data storage, 168h0m0s is 24*7 hours, or 1 week.
Used to control the timing of rewriting sequence files, the default is to rewrite when 10% of chunks are removed. If the disk space is large enough and you do not want to rewrite frequently, you can increase the value, for example, 0.3, that is, when 30% of chunks are removed, rewrite will be triggered.
This parameter controls the maximum number of chunks waiting to be written to disk. If this number is exceeded, Prometheus will limit the sampling rate until this number falls to 95% of the specified threshold. It is recommended that this value be set to 50% of storage.local.memory-chunks. Prometheus will try its best to accelerate the storage speed to avoid sending in the case of current restriction.
When prometheus server is performing checkpoint operation or processing expensive queries, there will be a short pause in the operation of collecting indicators. This is because the mutexes allocated by prometheus to the time series may not be enough. The pre-allocated mutexes can be increased through this indicator, and sometimes it can be set to tens of thousands.
After controlling when to synchronize to disk after writing data, there are’ never’, ‘always’, ‘adaptive’. Synchronization operation can reduce data loss caused by operating system crash, but will reduce the performance of writing data.
The default policy is adaptive, that is, disks will not be synchronized immediately after data is written, and the operating system’s page cache will be used for batch synchronization.
The interval between checkpoints, that is, checkpoints are performed on memory chunks that have not been written to disk.