Eh their analysis method is not too hot. From the comments section:
"The phrases included in the black boxes are the top 50 phrases most statistically correlated to that group. We calculated this as follows:
1. We calculated the frequency of every 1, 2, and 3 word phrase for the whole population.
2. We calculated those same frequencies within each race/gender pair.
3. For each phrase, we divided #2 by #1.
4. This is the propensity of a given group to use a given phrase.
5. The list you see is the phrases with the 50 highest ratios of #2/#1."
So even if a group uses a phrase 1.001x more than the population average, it might still be listed, if there are no actual phrase-usage differences (i.e., all phrase ratios will be small, and the top 50 will be arbitrary).
They're talking about the top fifty such phrases. It seems unlikely that there would be a demographic group that is proportionately less likely to use any arbitrary phrase. The only possibility is for a group's phrase usage to be statistically indistinguishable from the average.
Fortunately, we can perform a sanity check: read some of the phrases to someone, and ask that person which group they think the phrases came from. I bet people will guess with high enough accuracy to establish that it's nonrandom.
I'm not saying the results were random. I'm saying they're not really allowing for a "confidence interval" in how many more times a certain group uses a phrase than average. For example if black men use "soul food" 30x greater than average that seems like a solid result. But if it's only 1.01x more than average that seems like noise.
In an older version of the post, we did have the actual numbers, but they didn't seem to add much. Black women use "soul food" 20 times (!) more frequently than the site-wide average; for black men, "soul food" is 11 times as frequent.
AFAIK, nothing we put up for this article is less than twice as frequent for that group as it is for the general (OkC) population.
RE: Soul Food, one of the comments on the OKC site mentioned that Soul Food applies not only to a style of cuisine, but also to a movie and subsequent spin-off series on HBO.
The reason it pops up so frequently is because the movie is hailed within the black community as one of the few Hollywood productions that presents black people as multi-dimensional humans that deal with a number of problems related to race, class, and life in general. It's a classic.
I think it should actually give you better results. It's monotonic in kl divergence, and does a much better job of taking into account how common the feature is rather than just how different it is. You no longer need to do things like throwing out phrases that appear less than x times if you use it.
If they only use a phrase 1.001x more than the population average, it's going to be much further down on the list than the top 50, unless they use nearly every phrase less than the average (which, barring some extremely pathological outlier cases, is impossible).
"The phrases included in the black boxes are the top 50 phrases most statistically correlated to that group. We calculated this as follows:
1. We calculated the frequency of every 1, 2, and 3 word phrase for the whole population. 2. We calculated those same frequencies within each race/gender pair. 3. For each phrase, we divided #2 by #1. 4. This is the propensity of a given group to use a given phrase. 5. The list you see is the phrases with the 50 highest ratios of #2/#1."
So even if a group uses a phrase 1.001x more than the population average, it might still be listed, if there are no actual phrase-usage differences (i.e., all phrase ratios will be small, and the top 50 will be arbitrary).