If you want to try it out. Can lazily load from HF and apply filtering this way....

If you want to try it out. Can lazily load from HF and apply filtering this way.

  df = (
    pl.scan_parquet('hf://datasets/minimaxir/mtg-embeddings/mtg_embeddings.parquet')
    .filter(
        pl.col("type").str.contains("Sorcery"),
        pl.col("manaCost").str.contains("B"),
    )
    .collect()

)

Polars is awesome to use, would highly recommend. Single node it is excellent at saturating CPUs, if you need to distribute the work put it in a Ray Actor with some POLARS_MAX_THREADS applied depending on how much it saturates a single node.