BERTopic is great, but some people forget that even magic, er, UMAP+HDBScan on embeddings cannot solve some problems:
- statistical tools (including LDA and variants) define topics to be coherent latent clusters of words/embeddings. These correspond to a mixture of real-world concepts, including events, topics, issues etc. So when you apply BERTopic, you often get clusters that represent very different things on a conceptual level
- the end-to-end pipeline is very nice, especially when adding things like cluster labeling from LLMs on top. But we should not forget that this stacks many steps with implicit errors on top of each other. It is not easy to get a transparent and robust story why one cluster solution is better than any other.
- one of the implicit choices is picking UMAP, which will tend to find very coherent clusters but "throw out" many (up to ~50%) cases into an outlier cluster (-1). Sometimes that's not what we want, and then tuning is needed (e.g. use kmeans instead).
- random footnote: cuML for really fast BERTopic is great, but seems to produce inferior solutions. Better test that before putting it into production.
With all that said, I love that now we can use this tool and debate its merits on this level, rather than everyone implementing their own homegrown and probably bug-rich version of it.
Just on that UMAP point, I’ve found a lot success using PaCMAP (https://github.com/YingfanWang/PaCMAP) instead, which AFAIK is the SOTA in dimensionality reduction algorithms.
We do a variant of gpu umap+hdbscan => viz + scores a LOT. We are big fans!
We've grown in 2 pretty different directions to berttopic to support our users investigations over the last few years:
-- At least for our users, before going deep into berttopic and the tweaks you are suggesting, we find a much earlier and more basic step is to play with which columns to roll in. Berttopic is about text columns, but in practice, that's often just 1-3 of the 10-100 columns our users are working with! For example, in Splunk logs, beyond some message column, we also care timestamp, risk level, IP address columns, etc. Same thing if you are say analyzing transactions or user activity in Databricks or Snowflake, there's a lot of impactful metadata outside of the text columns. IMO, much of the beauty of UMAP is its success with 100s and, with GPUs, 1000s of columns.
-- For interactive visual analysis, we found it super valuable since early on to show the similarity connections that UMAP finds, and make them interactive for reclustering. Most umap visualizers are instead static, basically a scatterplot you can zoom in on. In contrast, being able to filter, recluster, recolor, etc, is pretty important part of the iteration flow as it eliminates needing to go back to coding for every little step. By making UMAP's inferred similarity edges 'live', you can now treat it as an interactive similarity graph, and filter->recluster on-the-fly. (It also helps understand nuance within a cluster, as you can see which edges exist, with what strength, and even interactive summaries of why they exist.)
That talk also deals with the scale problems of extending this to say all of your customer data or log data. Especially when supporting more than just some text columns, we need to easily & quickly encode those as well for the UMAP to pick them up. We recently released cu_cat (our GPU fork of dirty cat) to preprocess all these wild datatypes, and will be turning on soon by default for pygraphistry's "g.nodes(df).umap().plot()" -- these three pieces have beocme the lego pieces we use for enabling workflows like in the talk. It's super fun, and for so little code, surprisingly effective!
One issue too is with the embeddings, word vector embeddings lead to issues with polysemy, and BERT embeddings in the last layer tend to all point in about the same direction for each batch.
Not sure what you're asking. "Lesser resourced", "less resourced", "low density", and this "low resource" are all terms I've heard. Not to be confused with "less commonly taught" languages, obviously.
If you're asking whether Serbian is really low/less resource, there's no defining line.
And of course there are still unwritten languages.
- statistical tools (including LDA and variants) define topics to be coherent latent clusters of words/embeddings. These correspond to a mixture of real-world concepts, including events, topics, issues etc. So when you apply BERTopic, you often get clusters that represent very different things on a conceptual level
- the end-to-end pipeline is very nice, especially when adding things like cluster labeling from LLMs on top. But we should not forget that this stacks many steps with implicit errors on top of each other. It is not easy to get a transparent and robust story why one cluster solution is better than any other.
- one of the implicit choices is picking UMAP, which will tend to find very coherent clusters but "throw out" many (up to ~50%) cases into an outlier cluster (-1). Sometimes that's not what we want, and then tuning is needed (e.g. use kmeans instead).
- random footnote: cuML for really fast BERTopic is great, but seems to produce inferior solutions. Better test that before putting it into production.
With all that said, I love that now we can use this tool and debate its merits on this level, rather than everyone implementing their own homegrown and probably bug-rich version of it.