homepage/posts/2024/10/18/graph-docker-memory/index.org
Tom Alexander d8058c901c
All checks were successful
build-staging Build build-staging has succeeded
First draft of blog post.
2024-10-22 18:16:42 -04:00

19 KiB

Graph Docker Container Memory Usage with Gnuplot

Sometimes you need to know how much memory your docker containers are using. Wouldn't it be nice to be able to view their memory usage on a graph? Lets build that.

Get the data from docker

First, we're going to need a docker container to be running so we can read how much memory it is using. You can use any container you want, but personally I used:

  docker run --rm -i -t alpine:3.20

Now in another terminal, we want to run the

docker stats
command to see the stats of all running containers. This gets the output:

  CONTAINER ID   NAME              CPU %     MEM USAGE / LIMIT   MEM %     NET I/O       BLOCK I/O     PIDS
  1165735d5e55   sharp_chaplygin   0.00%     984KiB / 90.17GiB   0.00%     4.63kB / 0B   1.01MB / 0B   1

That's great, but we want to read this in a script so we can graph the memory usage over time. Our current

docker stats
command has a couple of problems:

  1. We don't want it to continuously run. We want a single snapshot of the memory and then the program should exit.
  2. It would be better we didn't have to parse this bespoke format.
  3. Personally I'd rather the command didn't truncate the container ID. That is a convenience for human users, but our script can handle the full ID and we can choose to truncate it in our scripts later if we want.

We can solve that with a couple additional flags:

  $ docker stats --no-stream --no-trunc --format json | jq
  {
    "BlockIO": "1.01MB / 0B",
    "CPUPerc": "0.00%",
    "Container": "1165735d5e55",
    "ID": "1165735d5e558652cb70345367dca725e3c48e5a9b6e1aba772f4496274b1fea",
    "MemPerc": "0.00%",
    "MemUsage": "984KiB / 90.17GiB",
    "Name": "sharp_chaplygin",
    "NetIO": "4.84kB / 0B",
    "PIDs": "1"
  }

Read the stats in python

We are going to be generating a graph of this data using gnuplot. For the user's convenience, we are going to generate a single gnuplot file that contains the gnuplot commands and the data. In our gnuplot file, all the stats for one container will exist in the file, followed by a separataor, and then all the stats for the next docker container. This means, we are going to buffer all of the metrics in memory and write them out at the end rather than streaming them out to the file as we go. So we need to define a class that will record our samples:

  #!/usr/bin/env python
  from __future__ import annotations

  from dataclasses import dataclass
  from typing import NewType
  from datetime import datetime

  ContainerId = NewType("ContainerId", str)


  @dataclass
  class Sample:
      instant: datetime
      stats: dict[ContainerId, Stats]


  @dataclass
  class Stats:
      memory_usage_bytes: int

Each time we run

Sample
which will save the memory usage for every running container at that instant. Lets add code that does that:

  import logging
  from typing import Tuple
  import subprocess
  import json

  ContainerName = NewType("ContainerName", str)


  def main():
      logging.basicConfig(level=logging.INFO)
      samples: list[Sample] = []
      labels: dict[ContainerId, ContainerName] = {}
      print(take_sample())


  def take_sample() -> Tuple[Sample, dict[ContainerId, ContainerName]]:
      labels: dict[ContainerId, ContainerName] = {}
      stats: dict[ContainerId, Stats] = {}
      docker_inspect = subprocess.run(
          ["docker", "stats", "--no-stream", "--no-trunc", "--format", "json"],
          stdout=subprocess.PIPE,
      )
      for container_stat in (
          json.loads(l) for l in docker_inspect.stdout.decode("utf8").splitlines()
      ):
          if not container_stat["ID"]:
              # When containers are starting up, they sometimes have no ID and "--" as the name.
              continue
          labels[ContainerId(container_stat["ID"])] = ContainerName(
              container_stat["Name"]
          )
          memory_usage = parse_mem_usage(container_stat["MemUsage"])
          stats[ContainerId(container_stat["ID"])] = Stats(
              memory_usage_bytes=memory_usage
          )

      return Sample(instant=datetime.now(), stats=stats), labels


  def parse_mem_usage(mem_usage: str) -> int:
      # TODO: We will implement this.
      return 0


  if __name__ == "__main__":
      main()

Sample
object.

Unfortunately, despite exporting in a machine-readable format,

docker stats
still uses human-readable values for the memory like "984KiB / 90.17GiB". This means we still need to parse that data. We are going to use regular expressions to grab the constant and the unit, and then we are going to use the unit to convert to the number of bytes:

  import re

  def parse_mem_usage(mem_usage: str) -> int:
      parsed_mem_usage = re.match(
          r"(?P<number>[0-9]+\.?[0-9]*)(?P<unit>[^\s]+)", mem_usage
      )
      if parsed_mem_usage is None:
          raise Exception(f"Invalid Mem Usage: {mem_usage}")
      number = float(parsed_mem_usage.group("number"))
      unit = parsed_mem_usage.group("unit")
      for multiplier, identifier in enumerate(["B", "KiB", "MiB", "GiB", "TiB"]):
          if unit == identifier:
              return int(number * (1024**multiplier))
      raise Exception(f"Unrecognized unit: {unit}")

And finally, we can run our script to get a single sample:

  $ python graph_docker_memory.py
  (Sample(instant=datetime.datetime(2024, 10, 22, 17, 31, 25, 529549), stats={'1165735d5e558652cb70345367dca725e3c48e5a9b6e1aba772f4496274b1fea': Stats(memory_usage_bytes=1007616)}), {'1165735d5e558652cb70345367dca725e3c48e5a9b6e1aba772f4496274b1fea': 'sharp_chaplygin'})

Adding a loop

Now that we can read docker memory usage in python, its time to add our business logic. The flow for our program will be:

  1. Wait for any docker containers to exist.
  2. Record memory every 2 seconds until no docker containers exist.
  3. Print out the gnuplot file to stdout.

We will use the empty response from our

take_sample()
function to indicate no containers are running. Lets update our main function. First, wait for any containers to exist:

  from time import sleep
  from typing import Final

  SAMPLE_INTERVAL_SECONDS: Final[int] = 2

  def main():
      logging.basicConfig(level=logging.INFO)
      samples: list[Sample] = []
      labels: dict[ContainerId, ContainerName] = {}
      first_pass = True
      # First wait for any docker container to exist.
      while True:
          sample, labels_in_sample = take_sample()
          if labels_in_sample:
              break
          if first_pass:
              first_pass = False
              logging.info("Waiting for a docker container to exist to start recording.")
          sleep(1)

Then we want to take samples every 2 seconds, updating our map of docker IDs to docker names, and recording their memory usage:

      # And then record memory until no containers exist.
      while True:
          sample, labels_in_sample = take_sample()
          if not labels_in_sample:
              break
          samples.append(sample)
          labels = {**labels, **labels_in_sample}
          sleep(SAMPLE_INTERVAL_SECONDS)

Finally, when no containers exist anymore, we want to print out our gnuplot graph:

      if labels:
          # Draws a red horizontal line at 32 GiB since that is the memory limit for cloud run.
          write_plot(
              samples,
              labels,
          )

  from typing import Collection

  def write_plot(
      samples: Collection[Sample],
      labels: dict[ContainerId, ContainerName],
  ):
      # TODO: We will implement this
      return

Generate the gnuplot file

Finally, we need to write out our gnuplot file. First, we print out a header that tells gnuplot information about our graph:

  def write_plot(
      samples: Collection[Sample],
      labels: dict[ContainerId, ContainerName],
  ):
      print(
          """set terminal svg background '#FFFFFF'
  set title 'Docker Memory Usage'
  set xdata time
  set timefmt '%s'
  set format x '%tH:%tM:%tS'
  # Please note this is in SI units (base 10), not IEC (base 2). So, for example, this would show a Gigabyte, not a Gibibyte.
  set format y '%.0s%cB'
  set datafile separator "|"
  """
      )

This is telling gnuplot to:

  1. Output an SVG with a white background.
  2. Sets the title of the graph.
  3. Sets the x-axis to be time.
  4. Sets the input format for the x-axis data to unix timestamps (number of seconds since Jan 1st 1970 in UTC).
  5. Sets the display format for the x-axis to be Hours:Minutes:Seconds using relative time (data starts at "0").
  6. Sets the y-axis format to use SI-units (base 10, 1GB = 1,000,000,000 bytes) for bytes.
  7. Tells gnuplot that we will use the pipe character to separate values in our data.

Since we're using relative time, we're going to need to know when each container started so we can normalize the timestamps to starting at 0:

      starting_time_per_container = {
          container_id: min(
              (sample.instant for sample in samples if container_id in sample.stats)
          )
          for container_id in labels.keys()
      }

Now, we need to tell gnuplot about each line (each docker container). The output will look like:

  "-" using 1:2 title 'my-first-container' with lines, \
  "-" using 1:2 title 'my-second-container' with lines

Which is telling gnuplot to read the data from stdin (which is the same gnuplot file), with the axes x, y defined as the first and second value in the file (which will be separated by a pipe character), and with the specified titles.

To accomplish this in the code, we'll add:

      line_definitions = ", ".join(
          [
              f""""-" using 1:2 title '{name}' with lines"""
              for container_id, name in sorted(labels.items())
          ]
      )
      print("plot", line_definitions)

And then finally we need to output the data for each container, separating each container with an "e". For example, the data for two containers, one that goes from 100 bytes to 120 bytes of memory over the span of 8 seconds, and another that goes from 300 bytes to 900 bytes of memory over the span of 6 seconds, would look like:

  0|100
  2|140
  4|290
  6|118
  8|120
  e
  0|300
  2|450
  4|650
  6|900

To accomplish this in code, we loop over the containers, and over the samples for each container, then we subtract the time the container started from the timestamp (which aligns all containers to starting at 0 on the left in the graph, regardless of when they started up) and print out the data:

      for container_id in sorted(labels.keys()):
          start_time = int(starting_time_per_container[container_id].timestamp())
          for sample in sorted(samples, key=lambda x: x.instant):
              if container_id in sample.stats:
                  print(
                      "|".join(
                          [
                              str(int((sample.instant).timestamp()) - start_time),
                              str(sample.stats[container_id].memory_usage_bytes),
                          ]
                      )
                  )
          print("e")

Run and render

That is everything we need. The full script is below, but first to test the script:

  1. Run

      python graph_docker_memory.py | tee memory.gnuplot
  2. Then launch a docker container or two, and perform some actions like install a package. This gives us changing data to create a more interesting line.
  3. Then close all dokcer containers you have open.

Now memory.gnuplot should contain a full gnuplot definition to graph the memory usage of your containers. We can generate an SVG with:

gnuplot memory.gnuplot > memory.svg

And then open memory.svg in your preferred image viewer and/or web browser to see your graph. It should look like:

/talexander/homepage/media/commit/d8058c901cf13e1e12424a27f406888096357eb9/posts/2024/10/18/graph-docker-memory/memory.svg

You may notice that the underscore in the container names renders oddly. We just need to escape the underscore when writing the container names. This has been added to the full script below. Also the full script adds support for defining horizontal lines (for example, if you have a memory limit of 1GiB, you might want to put a red horizontal line at 1GiB).

Full script

Below is the full version of the script in one piece:

  #!/usr/bin/env python
  from __future__ import annotations

  import json
  import logging
  import re
  import subprocess
  from dataclasses import dataclass
  from datetime import datetime, timedelta
  from time import sleep
  from typing import Collection, Final, NewType, Tuple

  ContainerId = NewType("ContainerId", str)
  ContainerName = NewType("ContainerName", str)

  SAMPLE_INTERVAL_SECONDS: Final[int] = 2


  @dataclass
  class Sample:
      instant: datetime
      stats: dict[ContainerId, Stats]


  @dataclass
  class Stats:
      memory_usage_bytes: int


  def main():
      logging.basicConfig(level=logging.INFO)
      samples: list[Sample] = []
      labels: dict[ContainerId, ContainerName] = {}
      first_pass = True
      # First wait for any docker container to exist.
      while True:
          sample, labels_in_sample = take_sample()
          if labels_in_sample:
              break
          if first_pass:
              first_pass = False
              logging.info("Waiting for a docker container to exist to start recording.")
          sleep(1)
      # And then record memory until no containers exist.
      while True:
          sample, labels_in_sample = take_sample()
          if not labels_in_sample:
              break
          samples.append(sample)
          labels = {**labels, **labels_in_sample}
          sleep(SAMPLE_INTERVAL_SECONDS)
      if labels:
          # Draws a red horizontal line at 32 GiB since that is the memory limit for cloud run.
          write_plot(
              samples,
              labels,
              horizontal_lines=[(32 * 1024**3, "red", "Cloud Run Max Memory")],
          )


  def write_plot(
      samples: Collection[Sample],
      labels: dict[ContainerId, ContainerName],
      ,*,
      horizontal_lines: Collection[Tuple[int, str, str | None]] = [],
  ):
      starting_time_per_container = {
          container_id: min(
              (sample.instant for sample in samples if container_id in sample.stats)
          )
          for container_id in labels.keys()
      }
      print(
          """set terminal svg background '#FFFFFF'
  set title 'Docker Memory Usage'
  set xdata time
  set timefmt '%s'
  set format x '%tH:%tM:%tS'
  # Please note this is in SI units (base 10), not IEC (base 2). So, for example, this would show a Gigabyte, not a Gibibyte.
  set format y '%.0s%cB'
  set datafile separator "|"
  """
      )
      for y_value, color, label in horizontal_lines:
          print(
              f'''set arrow from graph 0, first {y_value} to graph 1, first {y_value} nohead linewidth 2 linecolor rgb "{color}"'''
          )
          if label is not None:
              print(f"""set label "{label}" at graph 0, first {y_value} offset 1,-0.5""")

      # Include the horizontal lines in the range
      if len(horizontal_lines) > 0:
          print(f"""set yrange [*:{max(x[0] for x in horizontal_lines)}<*]""")
      line_definitions = ", ".join(
          [
              f""""-" using 1:2 title '{gnuplot_escape(name)}' with lines"""
              for container_id, name in sorted(labels.items())
          ]
      )
      print("plot", line_definitions)
      for container_id in sorted(labels.keys()):
          start_time = int(starting_time_per_container[container_id].timestamp())
          for sample in sorted(samples, key=lambda x: x.instant):
              if container_id in sample.stats:
                  print(
                      "|".join(
                          [
                              str(int((sample.instant).timestamp()) - start_time),
                              str(sample.stats[container_id].memory_usage_bytes),
                          ]
                      )
                  )
          print("e")


  def gnuplot_escape(inp: str) -> str:
      out = ""
      for c in inp:
          if c == "_":
              out += "\\"
          out += c
      return out


  def take_sample() -> Tuple[Sample, dict[ContainerId, ContainerName]]:
      labels: dict[ContainerId, ContainerName] = {}
      stats: dict[ContainerId, Stats] = {}
      docker_inspect = subprocess.run(
          ["docker", "stats", "--no-stream", "--no-trunc", "--format", "json"],
          stdout=subprocess.PIPE,
      )
      for container_stat in (
          json.loads(l) for l in docker_inspect.stdout.decode("utf8").splitlines()
      ):
          if not container_stat["ID"]:
              # When containers are starting up, they sometimes have no ID and "--" as the name.
              continue
          labels[ContainerId(container_stat["ID"])] = ContainerName(
              container_stat["Name"]
          )
          memory_usage = parse_mem_usage(container_stat["MemUsage"])
          stats[ContainerId(container_stat["ID"])] = Stats(
              memory_usage_bytes=memory_usage
          )
      for container_id, container_stat in stats.items():
          logging.info(
              f"Recorded stat {labels[container_id]}: {container_stat.memory_usage_bytes} bytes"
          )
      return Sample(instant=datetime.now(), stats=stats), labels


  def parse_mem_usage(mem_usage: str) -> int:
      parsed_mem_usage = re.match(
          r"(?P<number>[0-9]+\.?[0-9]*)(?P<unit>[^\s]+)", mem_usage
      )
      if parsed_mem_usage is None:
          raise Exception(f"Invalid Mem Usage: {mem_usage}")
      number = float(parsed_mem_usage.group("number"))
      unit = parsed_mem_usage.group("unit")
      for multiplier, identifier in enumerate(["B", "KiB", "MiB", "GiB", "TiB"]):
          if unit == identifier:
              return int(number * (1024**multiplier))
      raise Exception(f"Unrecognized unit: {unit}")


  if __name__ == "__main__":
      main()