Blog

Grafana Loki - Accidental Cardinality

For ROSE Online, we use grafana and grafana loki to ingest our logs. Previously our logs were outputting logfmt logs and we were simply dumping stdout/err into one giant log file for each server, e.g.

-a---           7/25/2023  9:11 PM          10927 worldserver.log
-a---           7/25/2023  9:11 PM           1907 loginserver.log
-a---           7/25/2023  7:59 PM           9042 gameserver.log

For a while this worked for us because we are using NSSM which has an option to rotate the logs everytime the service is rebooted. In the early days of our deployment due to issues in our code base this was often enough that we would get a fair amount of rotate logs which we could pull off our server before our disk filled up.

As our service became more reliable the time between restarts went longer and longer (3-4 weeks between game updates on average) and we were risking having to force a reboot to rotate logs.

We decided to rewrite our logging code so that the services themselves rotate the logs daily rather than relying on NSSM to do so. While we were at it we updated our logs to output in JSONL instead of logfmt. However after this switch we started to have some problems.

Timestamp sensitivity

After switching we immediately noticed that Loki was ingesting our logs but we could not search them. After hours of investigation we discovered that Grafana uses https://golang.org/pkg/time/#Parse to parse times. One of our services was outputting time in microseconds with 9 digits of precision (e.g. 2023-08-18T01:01:01.12345689). It appears that this time cannot be parsed even when giving Loki a timestamp source format with 9 digits of microseconds (e.g. '2006-01-02T15:04:05.999999Z'). Updating our timestamp to only use 6 digits of microsecond precision resolved the issue and we could search our logs.

Slow ingest

Once we solved the timestamp issue we noticed that searches were still failing. After some investigation we noticed that Loki was just missing the logs we were expecting, searching for anything more recent would not show up. We took a look at the grafana-agent positions file and noticed something interesting:

PS C:\rose\server> ls .\logs\

    Directory: C:\rose\server\logs

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a---           8/11/2023 12:00 AM      249845292 gameserver.2023-08-10.log
-a---           8/12/2023 12:00 AM     7244976598 gameserver.2023-08-11.log
-a---           8/13/2023 12:00 AM     5733769523 gameserver.2023-08-12.log
-a---           8/14/2023 12:00 AM     5472904189 gameserver.2023-08-13.log
-a---           8/15/2023 12:00 AM     4637788245 gameserver.2023-08-14.log
-a---           8/16/2023 12:00 AM     5250456776 gameserver.2023-08-15.log
-a---           8/17/2023 12:00 AM     5069492384 gameserver.2023-08-16.log
-a---           8/17/2023 12:00 AM              0 gameserver.2023-08-17.log

PS C:\rose\server> cat ..\grafana-agent\data\positions.yaml
positions:
  C:\rose\server\logs\gameserver.2023-08-10.log: "249845292"
  C:\rose\server\logs\gameserver.2023-08-11.log: "454698984"
  C:\rose\server\logs\gameserver.2023-08-12.log: "383480538"
  C:\rose\server\logs\gameserver.2023-08-13.log: "274764448"
  C:\rose\server\logs\gameserver.2023-08-14.log: "183574525"
  C:\rose\server\logs\gameserver.2023-08-15.log: "117410015"
  C:\rose\server\logs\gameserver.2023-08-16.log: "66995791"
  C:\rose\server\logs\gameserver.2023-08-17.log: "27146761"

Basically, grafana-agent was falling behind. For every new file that was rotated we would start ingesting the new file but the old files were not completed. As time went on grafana-agent was falling more and more behind for each of the existing files. In fact we reached a point where all the log files were partially ingested but none for finished!

After opening a ticket with grafana support they pointed us to a suspicious log message in our grafana-agent logs:

"server returned HTTP status 429 Too Many Requests (429): Maximum active stream limit exceeded, reduce the number of active streams (reduce labels or reduce label values), or contact your Loki administrator to see if the limit can be increased"

This led us to evaluate our dynamic label usage in our grafan-agent config file, this is what our file looked like for one of our sections:

 pipeline_stages:
    - json:
        expressions:
            timestamp:
            level:
            target:
            filename:
            line_number:
            fields:
    - labels:
        level: level
        target: target
        filename: filename
        line_number: line_number
        time: timestamp
    - timestamp:
        source: timestamp
        format: '2006-01-02T15:04:05.999999Z'
        location: UTC
    - output:
        source: fields

Our error is obvious to an experienced grafana loki user. We basically violated the most fundamental rule of Loki, avoid cardinality. The main offender is we were using our timestamp as a dynamic label. This was resulting in a HUGE amount of cardinality. For good measure we also removed the line_number and filename labels but since we still wanted this information to be available to use we removed the output section so our whole json line was included in the log. Our final config looks like this:

 pipeline_stages:
    - json:
        expressions:
            timestamp:
            level:
            target:
            filename:
            line_number:
            fields:
    - labels:
        level: level
        target: target
    - timestamp:
        source: timestamp
        format: '2006-01-02T15:04:05.999999Z'
        location: UTC

After restarting grafana-agent with this new config our ingestion was cruising and the issue was resolved. Basically our backlog of ingestion was cleared in less than an hour.