How to monitor the size of a directory via Telegraf
Asked Answered
K

5

9

We need to monitor the size of a directory (for example the data directory of InfluxDB) to set up alerts in Grafana. As mentioned here: How to configure telegraf to send a folder-size to influxDB , there is no built-in plugin for this.

We don't mind using the inputs.exec section of Telegraf. The directories are not huge (low filecount + dircount), so deep scanning (like the use of du) is fine by us.

One of the directories we need to monitor is /var/lib/influxdb/data.

What would be a simple script to execute, and what are the caveats?

Kremlin answered 6/6, 2017 at 9:27 Comment(0)
E
8

It's possible natively with filecount plugin

[[inputs.filecount]]
directories = ["/var/lib/influxdb/engine/data"]

Output:

> filecount,directory=/var/lib/influxdb/engine/data,host=psg count=424i,size_bytes=387980393i 1652195855000000000
Eason answered 10/5, 2022 at 15:14 Comment(1)
Great, that seems to be the only good answer as of current. I switched my answer flag.Kremlin
K
11

You could create a simple bash script metrics-exec_du.sh with the following content (chmod 755):

#!/usr/bin/env bash
du -bs "${1}" | awk '{print "[ { \"bytes\": "$1", \"dudir\": \""$2"\" } ]";}'

And activate it by putting the following in the Telegraf config file:

[[inputs.exec]] commands = [ "YOUR_PATH/metrics-exec_du.sh /var/lib/influxdb/data" ] timeout = "5s" name_override = "du" name_suffix = "" data_format = "json" tag_keys = [ "dudir" ]

Caveats:

  1. The du command can stress your server, so use with care
  2. The user telegraf must be able to scan the dirs. There are several options, but since InfluxDB's directory mask is a bit unspecified (see: https://github.com/influxdata/influxdb/issues/5171#issuecomment-306419800), we applied a rather crude workaround (examples are for Ubuntu 16.04.2 LTS):
    • Add the influxdb group to the user telegraf : sudo usermod --groups influxdb --append telegraf
    • Put the following in the crontab, run for example each 10 minutes: 10 * * * * chmod -R g+rX /var/lib/influxdb/data > /var/log/influxdb/chmodfix.log 2>&1

Result, configured in Grafana (data source: InfluxDB): Grafana dirsize monitoring

Cheers, TW

Kremlin answered 6/6, 2017 at 9:29 Comment(1)
Hi Tw Bert. I am trying to do something similar to this but with telegraf and influxdb running in containers. Here is my question: #77085692 Do you know a way to fix the permissions properly? It seems a little clunky to periodically change permissions. Any suggestion would be awesome. ThanksAeroembolism
D
9

If you need to monitor multiple directories I updated the answer by Tw Bert and extended it to allow you to pass them all on one command line. This saves you having to add multiple [[input.exec]] entries into your telegraf.conf file.

Create the file /etc/telegraf/scripts/disk-usage.sh containing:

#!/bin/bash

echo "["
du -ks "$@" | awk '{if (NR!=1) {printf ",\n"};printf "  { \"directory_size_kilobytes\": "$1", \"path\": \""$2"\" }";}'
echo
echo "]"

I want to monitor two directories: /mnt/user/appdata/influxdb and /mnt/user/appdata/grafana. I can do something like this:

# Get disk usage for multiple directories
[[inputs.exec]]
  commands = [ "/etc/telegraf/scripts/disk-usage.sh /mnt/user/appdata/influxdb /mnt/user/appdata/grafana" ]
  timeout = "5s"
  name_override = "du"
  name_suffix = ""
  data_format = "json"
  tag_keys = [ "path" ]

Once you've updated your config, you can test this with:

telegraf --debug --config /etc/telegraf/telegraf.conf --input-filter exec --test

Which should show you what Telegraf will push to influx:

bash-4.3# telegraf --debug --config /etc/telegraf/telegraf.conf --input-filter exec --test
> du,host=SomeHost,path=/mnt/user/appdata/influxdb directory_size_kilobytes=80928 1536297559000000000
> du,host=SomeHost,path=/mnt/user/appdata/grafana directory_size_kilobytes=596 1536297559000000000
Deterioration answered 7/9, 2018 at 5:20 Comment(1)
You better use "$@" though with du. https://mcmap.net/q/1170236/-difference-between-and-duplicateIndependent
E
8

It's possible natively with filecount plugin

[[inputs.filecount]]
directories = ["/var/lib/influxdb/engine/data"]

Output:

> filecount,directory=/var/lib/influxdb/engine/data,host=psg count=424i,size_bytes=387980393i 1652195855000000000
Eason answered 10/5, 2022 at 15:14 Comment(1)
Great, that seems to be the only good answer as of current. I switched my answer flag.Kremlin
R
1

The solutions already provided look good to me and highlighting the caveats such a read permission is great. An alternative worth mentioning is Using Telegraf to collect the data as proposed in monitor diskspace on influxdb with telegraf.

[[outputs.influxdb]]
  urls = ["udp://your_host:8089"]
  database = "telegraf_metrics"

  ## Retention policy to write to. Empty string writes to the default rp.
  retention_policy = ""
  ## Write consistency (clusters only), can be: "any", "one", "quorum", "all"
  write_consistency = "any"

  ## Write timeout (for the InfluxDB client), formatted as a string.
  ## If not provided, will default to 5s. 0s means no timeout (not recommended).
  timeout = "5s" 

# Read metrics about disk usage by mount point
[[inputs.disk]]
  ## By default, telegraf gather stats for all mountpoints.
  ## Setting mountpoints will restrict the stats to the specified mountpoints.
  # mount_points = ["/"]

  ## Ignore some mountpoints by filesystem type. For example (dev)tmpfs (usually
  ## present on /run, /var/run, /dev/shm or /dev).
  ignore_fs = ["tmpfs", "devtmpfs"]

Note: the timeout should be considered carefully. Maybe hourly readings would be sufficient to avoid exhaustion by logging.

Rhapsodist answered 4/11, 2020 at 9:32 Comment(1)
Just skimming this, but I think this is df vs du . Different stats. Good to mention the alternative tho.Kremlin
D
1

I'm posting a new answer because the other solutions did not consider the fact that telegraf runs under the telegraf user by default, and that user does not have permission to list files under most directories. Shell scripts cannot have the suid bit set, and all of the provided solutions (so far) either require the telegraf user to have access to all monitored directories, or run telegraf with a different user. All of these "solutions" pose security risks.

I have created a small project here https://github.com/nagylzs/dudir to overcome these problems. It contains instructions about how to use it.

  • The safest way is to hard-code the directory names into a new executable, set the suid bit and call it from telegraf.
  • The second safest way is to use the original version of the program, and pass directory names on the command line. It still increases security because you cannot read file contents or list directory contents; it only lets you get the directory size. In this case, I would do chown root:telegraf dudir and then chmod 4550 dudir.
Dandify answered 23/8 at 7:17 Comment(7)
This is interesting, thanks. Just to check I understand - it's almost the same method as the answer by @Deterioration , but you are allowed to execute the command by any user (in this case telegraf), rather than needing actual permissions? Is the SUID bit set on your dudir script itself only? Or do you need to set it on the files that are trying to be measured with du?Aeroembolism
Yes, you need to set the suid but on the dudir executable only. It will run in the name of the user who owns the executable. So if you do chown root:root dudir then it will be able to read all files in all directories. If you do chown postgres: dudir then it will be able to read PGDATA dir etc. The suid flag works on executables, but it does not work on shell scripts. See faqs.org/faqs/unix-faq/faq/part4/section-7.html - you can still do it by patching the kernel, but that is a bad idea.Dandify
Ah I think I understand. So this is one of the reasons that you have this .go script, which is acting as a bridge, and running the commands instead, such as this: exec.Command("du", "-b", "-d0", d) ?Aeroembolism
Since the du program is also an executable, it would also be possible to copy that and set the suid flag on it. But du is part of the system, and can be updated by the OS anytime. It is better to write your own program, and restrict it as much as possible. For example, only the telegraf group should have the eXecute flag, and preferably it should not accept the directories as arguments. It is safer to compile the directories into the program as a fixed list. The more flexible the program, the more security risks it poses.Dandify
Oh, and it is not a ".go script". Go is not an interpreted language. It is a go source file, and you cannot directly run it. You have to compile it into a binary executable.Dandify
Great, makes sense, thanks for clarifying.Aeroembolism
I have a bunch of external NFS mounted drives, and want to monitor them. I think yours is a nice solution for that, because it is a bit of a pain to manage telegraf permissions across multiple drives owned by all different users. Thanks for presenting it.Aeroembolism

© 2022 - 2024 — McMap. All rights reserved.