New project: comparqter

Today I released comparqter publicly. It’s a small CLI tool that I’ve been working on for a few weeks in my spare time that enables the compaction (and potentially re-compression) of files in a datalake, typically hosted on S3.

One of my clients had this issue, where their data lakes were filled with hundreds or thousands of tiny files that contained anywhere between 1 and 2 rows, and sometimes a few hundred. This meant that when Athena or another tool needed to query data, thousands upon thousands of files needed to be scanned and analysed.

This creates a massive issue in terms of throughput, but more importantly for them, it generates massive amounts of HEAD and GET requests against S3, which ends up being quite expensive. From what I could see in their AWS billing, I would expect that re-compacting their lakes would reduce their ongoing S3 costs by about 20-40%.

And now they have a tool to do it for free :).

Well, the basis for a tool. This is just a quick first iteration—it is version 0.1.0 after all, and it still needs a massive amount of extra testing. I have a few ideas of what could be implemented next from a logical project perspective:

  • Support getting configuration from env vars instead of CLI args
  • Support for local file systems
  • Support for other clouds
    • The AWS stuff would have to be gated behind a feature
  • Generate a report with the source files that were compacted
  • Generate a report of the newly created files
  • JSON logging, probably feature-gated
    • Improved logging

And many more things! I’m happy to review any and all contributions, provided they aren’t just AI slop. If you need a particular feature added, feel free to get in touch—I’m happy to be sponsored.