InfluxDB Corrupted Data

March 26th, 2023

InfluxDB (as of the 2.6 version) seems not to be able to restore normal operation after getting no space left on a device and cleaning it up. New data is kept in memory, which suggests a normal operation, while no data is written to the disk anymore. Moreover, a (potentially) partially written WAL (write-ahead-log) will be recognized as corrupt during each next startup of the InfluxDB and ignored for subsequent writes.

Solution is easy - just delete the corrupt WAL file. Helpful to identify the right WAL can be checking the logs journalctl -u influxdb.service and analyzing the WALs influxd inspect verify-wal --wal-path ... or influxd inspect dump-wal <wal_file_path>.

However, how to restore the data out of the WAL file? I could not find any command-line tool doing exactly that, but with the help of In-memory indexing with TSM I was able to write a simple code that - accordingly to your needs - can be used as a base for more complex restoration.

Since the data is compressed using Google's Snappy algorithm, we need a library for that pip install python-snappy.

import snappy

in_path = "PATH TO YOUR CORRUPTED WAL FILE"
out_path = "PATH TO A NEW WAL FILE"

count = 0
with open(in_path, mode='rb') as in_file:
  with open(out_path, mode='wb') as out_file:
    while True:
      op_type = in_file.read(1) # first byte is an operation code

      if op_type == b"":
        print('file end', in_file.tell())
        break

      if op_type[0] == 1 or op_type[0] == 0: # use it to identify out-of-sync
        count += 1 # just for statistics
        length_b = in_file.read(4) # length of the field
        length = int.from_bytes(length_b, "big") # in my case they were big-endian
        print('id', count, 'op_type', op_type, 'length', length)
        d_raw = in_file.read(length)
        try:
          d = snappy.uncompress(d_raw) # a real test if the data is not corrupt
          # print(d)

          # copy good data to a new file
          out_file.write(op_type)
          out_file.write(length_b)
          out_file.write(d_raw)

       except Exception as e:
          # the current entry wasn't readable, skip it
          print('exception', e)

      else:
        print('id', count, 'unexpected op type', op_type, 'at file position', in_file.tell())
        # if this is your case and you expect more valid data in the file, you may try to re-sync here
        break

print('total entries found', count)

Next: nginx + gunicorn + flask + systemd

Previous: IPv6 via OpenVPN

Main Menu