Blog

A look at HAProxy native Prometheus metrics

HAProxy 2.0 contains, among other improvements, the ability to expose Prometheus metrics natively.

As I am building HAProxy 2.0 and 2.1 rpm’s, I wanted to try it out and move away from the haproxy_exporter.

While the haproxy_exporter is developed by the Prometheus team, and therefore a high quality exporter, there is good reasons to move to the HAProxy native Prometheus exporter:

  • Performance
  • New metrics
  • Operations (no need to manage an exporter + the up metric represents haproxy)
  • Security (no need to expose the stats csv)

Metrics

Gone but can be calculated better with queries:

  • haproxy_backend_current_server
  • haproxy_backend_current_session_rate
  • haproxy_backend_http_connect_time_average_seconds
  • haproxy_backend_http_queue_time_average_seconds
  • haproxy_backend_http_response_time_average_seconds
  • haproxy_backend_http_total_time_average_seconds
  • haproxy_frontend_current_session_rate
  • haproxy_server_current_session_rate

Replaced by better metrics:

  • haproxy_backend_up: replaced by haproxy_backend_status
  • haproxy_server_up: replaced by haproxy_frontend_status with more values (0=STOP, 1=UP, 2=FULL, 2=MAINT, 3=DRAIN, 4=NOLB)

Gone because they were linked to the exporter itself:

  • haproxy_exporter_build_info
  • haproxy_exporter_csv_parse_failures
  • haproxy_exporter_total_scrapes
  • go metrics

New metrics:

  • haproxy_backend_active_servers
  • haproxy_backend_backup_servers
  • haproxy_backend_check_last_change_seconds
  • haproxy_backend_check_up_down_total
  • haproxy_backend_client_aborts_total
  • haproxy_backend_connect_time_average_seconds
  • haproxy_backend_connection_attempts_total
  • haproxy_backend_connection_reuses_total
  • haproxy_backend_downtime_seconds_total
  • haproxy_backend_failed_header_rewriting_total
  • haproxy_backend_http_cache_hits_total
  • haproxy_backend_http_cache_lookups_total
  • haproxy_backend_http_comp_bytes_bypassed_total
  • haproxy_backend_http_comp_bytes_in_total
  • haproxy_backend_http_comp_bytes_out_total
  • haproxy_backend_http_comp_responses_total
  • haproxy_backend_http_requests_total
  • haproxy_backend_last_session_seconds
  • haproxy_backend_loadbalanced_total
  • haproxy_backend_max_connect_time_seconds
  • haproxy_backend_max_queue_time_seconds
  • haproxy_backend_max_response_time_seconds
  • haproxy_backend_max_total_time_seconds
  • haproxy_backend_queue_time_average_seconds
  • haproxy_backend_requests_denied_total
  • haproxy_backend_response_time_average_seconds
  • haproxy_backend_responses_denied_total
  • haproxy_backend_server_aborts_total
  • haproxy_backend_total_time_average_seconds
  • haproxy_frontend_connections_rate_max
  • haproxy_frontend_denied_connections_total
  • haproxy_frontend_denied_sessions_total
  • haproxy_frontend_failed_header_rewriting_total
  • haproxy_frontend_http_cache_hits_total
  • haproxy_frontend_http_cache_lookups_total
  • haproxy_frontend_http_comp_bytes_bypassed_total
  • haproxy_frontend_http_comp_bytes_in_total
  • haproxy_frontend_http_comp_bytes_out_total
  • haproxy_frontend_http_comp_responses_total
  • haproxy_frontend_http_requests_rate_max
  • haproxy_frontend_intercepted_requests_total
  • haproxy_frontend_responses_denied_total
  • haproxy_process_active_peers
  • haproxy_process_busy_polling_enabled
  • haproxy_process_connected_peers
  • haproxy_process_connections_total
  • haproxy_process_current_backend_ssl_key_rate
  • haproxy_process_current_connection_rate
  • haproxy_process_current_connections
  • haproxy_process_current_frontend_ssl_key_rate
  • haproxy_process_current_run_queue
  • haproxy_process_current_session_rate
  • haproxy_process_current_ssl_connections
  • haproxy_process_current_ssl_rate
  • haproxy_process_current_tasks
  • haproxy_process_current_zlib_memory
  • haproxy_process_dropped_logs_total
  • haproxy_process_frontent_ssl_reuse
  • haproxy_process_hard_max_connections
  • haproxy_process_http_comp_bytes_in_total
  • haproxy_process_http_comp_bytes_out_total
  • haproxy_process_idle_time_percent
  • haproxy_process_jobs
  • haproxy_process_limit_connection_rate
  • haproxy_process_limit_http_comp
  • haproxy_process_limit_session_rate
  • haproxy_process_limit_ssl_rate
  • haproxy_process_listeners
  • haproxy_process_max_backend_ssl_key_rate
  • haproxy_process_max_connection_rate
  • haproxy_process_max_connections
  • haproxy_process_max_fds
  • haproxy_process_max_frontend_ssl_key_rate
  • haproxy_process_max_memory_bytes
  • haproxy_process_max_pipes
  • haproxy_process_max_session_rate
  • haproxy_process_max_sockets
  • haproxy_process_max_ssl_connections
  • haproxy_process_max_ssl_rate
  • haproxy_process_max_zlib_memory
  • haproxy_process_nbproc
  • haproxy_process_nbthread
  • haproxy_process_pipes_free_total
  • haproxy_process_pipes_used_total
  • haproxy_process_pool_allocated_bytes
  • haproxy_process_pool_failures_total
  • haproxy_process_pool_used_bytes
  • haproxy_process_relative_process_id
  • haproxy_process_requests_total
  • haproxy_process_ssl_cache_lookups_total
  • haproxy_process_ssl_cache_misses_total
  • haproxy_process_ssl_connections_total
  • haproxy_process_start_time_seconds
  • haproxy_process_stopping
  • haproxy_process_unstoppable_jobs
  • haproxy_server_check_failures_total
  • haproxy_server_check_last_change_seconds
  • haproxy_server_check_up_down_total
  • haproxy_server_client_aborts_total
  • haproxy_server_connect_time_average_seconds
  • haproxy_server_connection_attempts_total
  • haproxy_server_connection_reuses_total
  • haproxy_server_current_throttle
  • haproxy_server_downtime_seconds_total
  • haproxy_server_failed_header_rewriting_total
  • haproxy_server_last_session_seconds
  • haproxy_server_loadbalanced_total
  • haproxy_server_max_connect_time_seconds
  • haproxy_server_max_queue_time_seconds
  • haproxy_server_max_response_time_seconds
  • haproxy_server_max_total_time_seconds
  • haproxy_server_queue_limit
  • haproxy_server_queue_time_average_seconds
  • haproxy_server_response_time_average_seconds
  • haproxy_server_responses_denied_total
  • haproxy_server_server_aborts_total
  • haproxy_server_server_idle_connections_current
  • haproxy_server_server_idle_connections_limit
  • haproxy_server_total_time_average_seconds

Enabling Prometheus support

HAProxy must be compiled with Prometheus support:

$ make TARGET=linux-glibc EXTRA_OBJS="contrib/prometheus-exporter/service-prometheus.o"

To enable those new metrics on a HAProxy 2.x (I reuse 9101, the port of the exporter):

frontend prometheus
    bind 127.0.0.1:9101
    http-request use-service prometheus-exporter if { path /metrics }

You can set just the last line of course.

Compatibility with the exporter

If you want to keep your old dashboards, here is what you need to know:

The haproxy_exporter used backend or frontend as labels, where HAProxy uses the proxy label. You can retain the old behaviour using Prometheus metrics relabelling:

scrape_configs:
- job_name: haproxy
  static_configs:
    - targets: [127.0.0.1:9101]
  metric_relabel_configs:
  - source_labels: [__name__, proxy]
    regex: "haproxy_frontend.+;(.+)"
    target_label: frontend
    replacement: "$1"
  - source_labels: [__name__, proxy]
    regex: "haproxy_server.+;(.+)"
    target_label: backend
    replacement: "$1"
  - source_labels: [__name__, proxy]
    regex: "haproxy_backend.+;(.+)"
    target_label: backend
    replacement: "$1"
  - regex: proxy
    action: labeldrop

With that configuration you will have a painless migration to the native HAProxy metrics!

Conclusion

I was positively surprised to see that most of the metrics were still there, and that we have access to a lot of new metrics! This is really a big step for HAProxy monitoring.

Goodbye haproxy_exporter and thanks for all the fish!

Permalink. Category: prometheus. Tags: haproxy.
First published on Wed 27 November 2019.

GCP container registry and terraform provider docker

Here are some snippets you can use to get the terraform docker provider to work with the google container registry gcr.io:

# Your config
provider "google" {}

data "google_client_config" "default" {}

provider "docker" {
  registry_auth {
    address  = "gcr.io"
    username = "oauth2accesstoken"
    password = "${data.google_client_config.default.access_token}"
  }
}

data "google_container_registry_image" "myapp_tagged" {
  name = "${var.docker_image_myapp}"
  tag  = "${var.docker_image_tag_myapp}"
}

data "docker_registry_image" "myapp" {
  name = "${data.google_container_registry_image.myapp_tagged.image_url}"
}

data "google_container_registry_image" "myapp" {
  name   = "${var.docker_image_myapp}"
  digest = "${data.docker_registry_image.myapp.sha256_digest}"
}

Now you can use: ${data.google_container_registry_image.myapp.image_url} in as image to get your tagged image in your pods, and get predictable container image update! That URL will be scope as needed (gcr.io/projectname/imagename…) and is ready to use in your pods definition.

Your service account must have storage read access.

The round-trip between google_container_registry_image and docker_registry_image enables the fetch of the exact checksum of the tagged version.

Note: this example is not complete (I did not include vars and google provider auth).

Permalink. Category: cloud. Tags: gcp docker.
First published on Mon 22 April 2019.

Prometheus Google Compute Engine discovery example

Here is a small example on how to use Prometheus to scrape your GCE instance.

I recommend you to look at prometheus documentation to see how you can pass the crendentials to your prometheus instance.

scrape_configs:
  - job_name: node_gce
    gce_sd_configs:
      - zone: europe-west1-b
        project: myproject
      - zone: europe-west1-d
        project: myproject
      - zone: europe-west1-c
        project: myproject
    relabel_configs:
      - source_labels: [__meta_gce_public_ip]
        target_label: __address__
        replacement: "${1}:9090"
      - source_labels: [__meta_gce_zone]
        regex: ".+/([^/]+)"
        target_label: zone
      - source_labels: [__meta_gce_project]
        target_label: project
      - source_labels: [__meta_gce_instance_name]
        target_label: instance
      - regex: "__meta_gce_metadata_(.+)"
        action: labelmap

Let’s analyze it.

Zones and projects

    gce_sd_configs:
      - zone: europe-west1-b
        project: project1
      - zone: europe-west1-d
        project: project1
      - zone: europe-west1-c
        project: project2

We have a job named node_gce, which has 3 gce_sd_config objects. One object is attached to one zone and one project.

Relabeling

Setting the address

This example will substitute the private ip by the public ip of your node, and use the port 9090. __address__ is a hidden used by prometheus to get the address to scrape.

      - source_labels: [__meta_gce_public_ip]
        target_label: __address__
        replacement: "${1}:9090"

Zones and project

Now, let’s get automatically a zone label, which will match the gce zone:

      - source_labels: [__meta_gce_zone]
        regex: ".+/([^/]+)"
        target_label: zone

Let’s get a project label, too:

      - source_labels: [__meta_gce_project]
        target_label: project

Instance name

And a human readable instance name, that will match gce instance name:

      - source_labels: [__meta_gce_instance_name]
        target_label: instance

Metadata

The last part of the config will make every metadata of the instance a label in prometheus:

      - regex: "__meta_gce_metadata_(.+)"
        action: labelmap
Permalink. Category: monitoring. Tags: prometheus.
First published on Sat 9 March 2019.

Dealing with flapping metrics in prometheus

Prometheus allows you to get metrics from a lot of systems.

We are integrated with third party suppliers that expose us a balance, an amount of resources we can use.

That is exposed as the following metric:

available_sms{route="1",env="prod"} 1000

This is a gauge, therefore we can write an alerting rule like this:

- alert: No more SMS
  expr: |
    available_sms < 1000

That works well.. when the provider API is available. In our use case, sometimes, the api is refusing access for 10 minutes. Which means that if our balance is below 1000 we will get two tickets as the alert will start twice.

An alternative would be to do:

- alert: No more SMS
  expr: |
    max_over_time(available_sms[1h]) < 1000

Picking min_over_time means that the alert will be resolved only one hour after the original result. max_over_time means that the alert will be triggered one hour too late.

We use an alternative approach, which is to record the last known value:

- record: available_sms_last
  expr: available_sms or available_sms_last
- alert: No more SMS
  expr: |
    available_sms_last < 1000
- alert: No more SMS balance
  expr: |
    absent(available_sms)
  for: 1h

That rule will ensure that in case the api is not available, the available_sms_last metric will contain the last known value. We can therefore alert on that, without alerting too soon or too late! This is using prometheus 1-to-1 vector matching.

Another alert, on absent(available_sms) enables us to know when the api is down for a long time.

Permalink. Category: monitoring. Tags: prometheus.
First published on Thu 21 February 2019.

Prometheus and DST

Prometheus only deals with GMT. It does not even try to do something else. But, when you want to compare your business metrics with your usual traffic, you need to take DST into account.

Here is my take on the problem. Note that I am in TZ=Europe/Brussels. We had DST on October 29.

Let’s say that I want to compare one metric with the same metric 2 weeks ago. In this example, the metric would be rate(app_requests_count{env="prod"}[5m]). If we are the 1st of December, we need to look back 14 days. But, if we are the 1st of November, we need to look back 14 days + 1 hour. DST happened on October 29.

To achieve that, we will take advantage of Prometheus recording rules and functions. This example is based on Prometheus 2.0.

First, I setup a record rule that tells me when I need to add an extra hour:

- record: dst
  expr: |
    0*(month() < 12) + 0*(day_of_month() < 13) + 0*(year() == 2017)
  labels:
    when: 14d

That metric dst{when="14d"} will be 0 until the 13th of November, and will have no value otherwise. If you really care, you can play with the hour() function as well.

Then, I create a second rule with two different offset and a or. Note that in a recording group, prometheus computes the rules sequentially.

- record: app_request_rate
  expr: |
    (
      sum(dst{when="14d"})
      + (
         sum(
          rate(
           app_requests_count{env="prod"}[5m]
           offset 337h
          )
         )
         or vector(0)
        )
    )
    or
    (
     sum(
      rate(
       app_requests_count{env="prod"}[5m]
       offset 14d)
      )
      or vector(0)
    )
  labels:
    when: 14d

Let’s analyze this.

The recording rule is split in two parts by a or:

    (
      sum(dst{when="14d"})
      + (
         sum(
          rate(
           app_requests_count{env="prod"}[5m]
           offset 337h
          )
         )
         or vector(0)
        )
    )
    (
     sum(
      rate(
       app_requests_count{env="prod"}[5m]
       offset 14d)
      )
      or vector(0)
    )

If the first part does not return any value, then we get the second part.

The second part is easy, so let’s start with it:

  • We sum the 5min rates of app_requests_count, env=prod, 14 days ago.
  • If we get no metrics (e.g. Prometheus was down) we get 0.

The first part is however a bit more complex. Part of it is like the second part, but with an offset of 14d+1h (337h).

Now, to detect if we need the first of the second offset, we add sum(dst{when="14d"}) to the first part. When we need to add an extra hour, then the value of sum(dst{when="14d"}) is 0. Otherwise, there is no value and prometheus falls back to the second part of the rule.

Note: in this rule, the sum in sum(dst{when="14d"}) is here to remove the labels, and allow the + operation.

It is a bit tricky ; but it should do the job. I think I will also in the future create recording rules for day_of_month(), month() and year(), so I can apply an offset to their values.

I will probably revisit this in March 2018…

Permalink. Category: monitoring. Tags: prometheus.
First published on Thu 9 November 2017.