Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

On the proper use of this data:

Perhaps the easiest way to think of it is that the phrases are predictors for the race/sex, not the other way around. For example, you shouldn't expect every white male you meet to like Van Halen. However if someone says to you "I have a friend who's a big Van Halen fan", you're pretty safe in assuming that the friend is a white male.

Likewise, it might be that only 10% of blacks like soul food. But if almost no other demographics like it, it will still show up high on their list. So "is black" does not strongly imply "loves soul food", but "loves soul food" does strongly imply "is black".

In other words, http://en.wikipedia.org/wiki/Bayes_theorem



a commenter on there made a good point about the "soul food" term, where that's also the name of a movie, book and tv series so it ends up being 3 (probably black) data points all counting towards 1.


For that reason, I'd be happier if they sorted by KL Divergence instead of just log-odds. That'd give a much better tradeoff between commonality and predictive power.


You have a valid point but I'm guessing they avoided kl-divergence because it's 1) harder to explain 2) it's not symmetric, e.g. kl(asian, white) != kl(white, asian) 3) it needs a smoothing function for comparing distributions where not every element is in both distributions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: