New project: comparqter
Comparqter is a small CLI tool used to compact many small Parquet files in S3 buckets Today I released One of my clients had this issue, where their data lakes were filled with hundreds or thousands of tiny files that contained anywhere between 1 and 2 rows, and sometimes a few hundred. This meant that when Athena or another tool needed to query data, thousands upon thousands of files needed to be scanned and analysed. This creates a massive issue in terms of throughput, but more importantly for them, it generates massive amounts of HEAD and GET requests against S3, which ends up being quite expensive. From what I could see in their AWS billing, I would expect that re-compacting their lakes would reduce their ongoing S3 costs by about 20-40%. And now they have a tool to do it for free :). Well, the basis for a tool. This is just a quick first iteration—it is version 0.1.0 after all, and it still needs a massive amount of extra testing. I have a few ideas of what could be implemented next from a logical project perspective: And many more things! I’m happy to review any and all contributions, provided they aren’t just AI slop. If you need a particular feature added, feel free to get in touch—I’m happy to be sponsored.comparqter
publicly. It’s a small CLI tool that I’ve been working on for a few weeks in my spare time that enables the compaction (and potentially re-compression) of files in a datalake, typically hosted on S3.