HAProxy 2.0 contains, among other improvements, the ability to expose Prometheus metrics natively.
As I am building HAProxy 2.0 and 2.1 rpm’s, I wanted to try it out and move away from the haproxy_exporter.
While the haproxy_exporter is developed by the Prometheus team, and therefore a high quality exporter, there is good reasons to move to the HAProxy native Prometheus exporter:
up
metric represents
haproxy)Gone but can be calculated better with queries:
Replaced by better metrics:
Gone because they were linked to the exporter itself:
New metrics:
HAProxy must be compiled with Prometheus support:
$ make TARGET=linux-glibc EXTRA_OBJS="contrib/prometheus-exporter/service-prometheus.o"
To enable those new metrics on a HAProxy 2.x (I reuse 9101, the port of the exporter):
frontend prometheus
bind 127.0.0.1:9101
http-request use-service prometheus-exporter if { path /metrics }
You can set just the last line of course.
If you want to keep your old dashboards, here is what you need to know:
The haproxy_exporter used backend or frontend as labels, where HAProxy uses the proxy label. You can retain the old behaviour using Prometheus metrics relabelling:
scrape_configs:
- job_name: haproxy
static_configs:
- targets: [127.0.0.1:9101]
metric_relabel_configs:
- source_labels: [__name__, proxy]
regex: "haproxy_frontend.+;(.+)"
target_label: frontend
replacement: "$1"
- source_labels: [__name__, proxy]
regex: "haproxy_server.+;(.+)"
target_label: backend
replacement: "$1"
- source_labels: [__name__, proxy]
regex: "haproxy_backend.+;(.+)"
target_label: backend
replacement: "$1"
- regex: proxy
action: labeldrop
With that configuration you will have a painless migration to the native HAProxy metrics!
I was positively surprised to see that most of the metrics were still there, and that we have access to a lot of new metrics! This is really a big step for HAProxy monitoring.
Goodbye haproxy_exporter and thanks for all the fish!
Here are some snippets you can use to get the terraform docker provider to work with the google container registry gcr.io:
# Your config
provider "google" {}
data "google_client_config" "default" {}
provider "docker" {
registry_auth {
address = "gcr.io"
username = "oauth2accesstoken"
password = "${data.google_client_config.default.access_token}"
}
}
data "google_container_registry_image" "myapp_tagged" {
name = "${var.docker_image_myapp}"
tag = "${var.docker_image_tag_myapp}"
}
data "docker_registry_image" "myapp" {
name = "${data.google_container_registry_image.myapp_tagged.image_url}"
}
data "google_container_registry_image" "myapp" {
name = "${var.docker_image_myapp}"
digest = "${data.docker_registry_image.myapp.sha256_digest}"
}
Now you can use: ${data.google_container_registry_image.myapp.image_url}
in
as image to get your tagged image in your pods, and get predictable container
image update! That URL will be scope as needed (gcr.io/projectname/imagename…)
and is ready to use in your pods definition.
Your service account must have storage read access.
The round-trip between google_container_registry_image
and
docker_registry_image
enables the fetch of the exact checksum of the tagged
version.
Note: this example is not complete (I did not include vars and google provider auth).
Here is a small example on how to use Prometheus to scrape your GCE instance.
I recommend you to look at prometheus documentation to see how you can pass the crendentials to your prometheus instance.
scrape_configs:
- job_name: node_gce
gce_sd_configs:
- zone: europe-west1-b
project: myproject
- zone: europe-west1-d
project: myproject
- zone: europe-west1-c
project: myproject
relabel_configs:
- source_labels: [__meta_gce_public_ip]
target_label: __address__
replacement: "${1}:9090"
- source_labels: [__meta_gce_zone]
regex: ".+/([^/]+)"
target_label: zone
- source_labels: [__meta_gce_project]
target_label: project
- source_labels: [__meta_gce_instance_name]
target_label: instance
- regex: "__meta_gce_metadata_(.+)"
action: labelmap
Let’s analyze it.
gce_sd_configs:
- zone: europe-west1-b
project: project1
- zone: europe-west1-d
project: project1
- zone: europe-west1-c
project: project2
We have a job named node_gce, which has 3 gce_sd_config
objects. One object is
attached to one zone and one project.
This example will substitute the private ip by the public ip of your node, and
use the port 9090. __address__
is a hidden used by prometheus to get the
address to scrape.
- source_labels: [__meta_gce_public_ip]
target_label: __address__
replacement: "${1}:9090"
Now, let’s get automatically a zone label, which will match the gce zone:
- source_labels: [__meta_gce_zone]
regex: ".+/([^/]+)"
target_label: zone
Let’s get a project label, too:
- source_labels: [__meta_gce_project]
target_label: project
And a human readable instance name, that will match gce instance name:
- source_labels: [__meta_gce_instance_name]
target_label: instance
The last part of the config will make every metadata of the instance a label in prometheus:
- regex: "__meta_gce_metadata_(.+)"
action: labelmap
Prometheus allows you to get metrics from a lot of systems.
We are integrated with third party suppliers that expose us a balance, an amount of resources we can use.
That is exposed as the following metric:
available_sms{route="1",env="prod"} 1000
This is a gauge, therefore we can write an alerting rule like this:
- alert: No more SMS
expr: |
available_sms < 1000
That works well.. when the provider API is available. In our use case, sometimes, the api is refusing access for 10 minutes. Which means that if our balance is below 1000 we will get two tickets as the alert will start twice.
An alternative would be to do:
- alert: No more SMS
expr: |
max_over_time(available_sms[1h]) < 1000
Picking min_over_time
means that the alert will be resolved only one hour
after the original result.
max_over_time
means that the alert will be triggered one hour too late.
We use an alternative approach, which is to record the last known value:
- record: available_sms_last
expr: available_sms or available_sms_last
- alert: No more SMS
expr: |
available_sms_last < 1000
- alert: No more SMS balance
expr: |
absent(available_sms)
for: 1h
That rule will ensure that in case the api is not available, the
available_sms_last
metric will contain the last known value. We can therefore alert on
that, without alerting too soon or too late! This is using prometheus 1-to-1
vector matching.
Another alert, on absent(available_sms)
enables us to know when the api is
down for a long time.
Prometheus only deals with GMT. It does not even try to do something else. But, when you want to compare your business metrics with your usual traffic, you need to take DST into account.
Here is my take on the problem. Note that I am in TZ=Europe/Brussels. We had DST on October 29.
Let’s say that I want to compare one metric with the same metric 2 weeks ago. In
this example, the metric would be rate(app_requests_count{env="prod"}[5m])
.
If we are the 1st of December, we need to look back 14 days. But, if we are the
1st of November, we need to look back 14 days + 1 hour. DST happened on October
29.
To achieve that, we will take advantage of Prometheus recording rules and functions. This example is based on Prometheus 2.0.
First, I setup a record rule that tells me when I need to add an extra hour:
- record: dst
expr: |
0*(month() < 12) + 0*(day_of_month() < 13) + 0*(year() == 2017)
labels:
when: 14d
That metric dst{when="14d"}
will be 0 until the 13th of November, and will have no value
otherwise. If you really care, you can play with the hour()
function as well.
Then, I create a second rule with two different offset
and a or
. Note that
in a recording group, prometheus computes the rules sequentially.
- record: app_request_rate
expr: |
(
sum(dst{when="14d"})
+ (
sum(
rate(
app_requests_count{env="prod"}[5m]
offset 337h
)
)
or vector(0)
)
)
or
(
sum(
rate(
app_requests_count{env="prod"}[5m]
offset 14d)
)
or vector(0)
)
labels:
when: 14d
Let’s analyze this.
The recording rule is split in two parts by a or
:
(
sum(dst{when="14d"})
+ (
sum(
rate(
app_requests_count{env="prod"}[5m]
offset 337h
)
)
or vector(0)
)
)
(
sum(
rate(
app_requests_count{env="prod"}[5m]
offset 14d)
)
or vector(0)
)
If the first part does not return any value, then we get the second part.
The second part is easy, so let’s start with it:
The first part is however a bit more complex. Part of it is like the second part, but with an offset of 14d+1h (337h).
Now, to detect if we need the first of the second offset, we add sum(dst{when="14d"})
to the first part. When we need to add an extra hour, then the value of sum(dst{when="14d"})
is 0.
Otherwise, there is no value and prometheus falls back to the second part of the rule.
Note: in this rule, the sum
in sum(dst{when="14d"})
is
here to remove the labels, and allow the +
operation.
It is a bit tricky ; but it should do the job. I think I will also in the future
create recording rules for day_of_month()
, month()
and year()
, so I can
apply an offset to their values.
I will probably revisit this in March 2018…