I'll give that a look: the feature set of GPU-accelerated ops seems just up my a...

minimaxir · on May 9, 2024

Tested it out and the UMAP implementation with this library is very very fast compared to Parametric UMAP: running it on 100k embeddings took about 7 seconds when the same pipeline on the same GPU took about a half-hour. I will definitely be playing around with it more.

lmeyerov · on May 9, 2024

Yeah we advise Graphistry users to keep GPU umap training sets to < 100k rows, and instead focus on doing careful sampling within that, and multiple models for going beyond that. It'd be more accessible for teams if we could raise the limit, but quality wise, it's generally fine. Security logs, customer activity, genomes, etc.

RAPIDS umap is darn impressive tho. Instead of focusing on improving further, it did the job. Our bottleneck shifted to optimizing the ingest pipeline to feed umap, so we released cu_cat as a GPU-accelerated automated feature engineering library to get all that data into umap. RAPIDS cudf helps take care of the intermediate IO and wrangling in-between.

Downstream, we generally stopped doing DBSCAN , despite being so pretty. We replace with cugraph/GFQL on the umap similarity graph, to avoid quality issues we see in practice, and then visually & interactively investigate the similarity graph in pygraphistry. Once you can see the k-nn similarity edges - and lack thereof -- you realize why scatter plot clusterings (visual or algorithmic) are so misleading to analysts and treat with more caution. There is a variety of umap contenders nowadays, but with this pipeline, we haven't felt the need to go beyond. That's a multi-year testament to Leland and team.

The result is we can now umap and interactively visualize most real world large datasets, database query results, and LLM embeddings that pygraphistry & louie.ai users encounter in seconds. Many years to get here, and now it is so easy!