It was really hard to resist spilling the beans about OpenZL on this recent HN p...

perching_aix · 2025-10-06T20:52:22 1759783942

That post immediately came to my mind too! Do you maybe have a comparison to share with respect to the specialized compressor mentioned in the OP there?

> Grace Blackwell’s 2.6Tbp 661k dataset is a classic choice for benchmarking methods in microbial genomics. (...) Karel Břinda’s specialist MiniPhy approach takes this dataset from 2.46TiB to just 27GiB (CR: 91) by clustering and compressing similar genomes together.

bede · 2025-10-06T20:46:37 1759783597

Author of [0] here. Congratulations and well done for resisting. Eager to try it!

Edit: Have you any specific advice for training a fasta compressor beyond that given in e.g. "Using OpenZL" (https://openzl.org/getting-started/using-openzl/)

Gethsemane · 2025-10-06T21:27:17 1759786037

I'd love to see some benchmarks for this on some common genomic formats (fa, fq, sam, vcf). Will be doubly interesting to see its applicability to nanopore data - lots of useful data is lost because storing FAST5/POD5 is a pain.

jltsiren · 2025-10-06T22:09:08 1759788548

OpenZL compressed SAM/BAM vs. CRAM is the interesting comparison. It would really test the flexibility of the framework. Can OpenZL reach the same level of compression, and how much effort does it take?

I would not expect much improvement in compressing nanopore data. If you have a useful model of the data, creating a custom compressor is not that difficult. It takes some effort, but those formats are popular enough that compressors using the known models should already exist.

terrelln · 2025-10-06T22:19:22 1759789162

Do you happen to have a pointer to a good open source dataset to look at?

Naively and knowing little about CRAM, I would expect that OpenZL would beat Zstd handily out of the box, but need additional capabilities to match the performance of CRAM, since genomics hasn't been a focus as of yet. But it would be interesting to see how much we need to add is generic to all compression (but useful for genomics), vs. techniques that are specific only to genomics.

We're planning on setting up a blog on our website to highlight use cases of OpenZL. I'd love to make a post about this.

bede · 2025-10-06T22:55:36 1759791336

For BAM this could be a good place to start: https://www.htslib.org/benchmarks/CRAM.html

Happy to discuss further

terrelln · 2025-10-06T23:04:08 1759791848

Amazing, thank you!

I will take a look as soon as I get a chance. Looking at the BAM format, it looks like the tokenization portion will be easy. Which means I can focus on the compression side, which is more interesting.

fwip · 2025-10-07T03:45:55 1759808755

Another format that might be worth looking at in the bioinformatics world is hdf5. It's sort of a generic file format, often used for storing multiple related large tables. It has some built-in compression (gzip IIRC) but supports plugins. There may be an opportunity to integrate the self-describing nature of the hdf5 format with the self-describing decompression routines of openZL.

felixhandte · 2025-10-07T18:06:57 1759860417

Wanna hop over to https://github.com/facebook/openzl/issues/76?

jayknight · 2025-10-06T21:38:51 1759786731

And a comparison between CRAM and openzl on a sam/bam file. Is openzl indexable, where you can just extract and decompress the data you need from a file if you know where it is?

terrelln · 2025-10-06T21:40:30 1759786830

> Is openzl indexable

Not today. However, we are considering this as we are continuing to evolve the frame format, and it is likely we will add this feature in the future.

felixhandte · 2025-10-07T17:10:36 1759857036

Update: let's continue discussing genomic sequence compression on https://github.com/facebook/openzl/issues/76.