Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

First I get the text of each article using Newspaper[1] (no rss involved). Then I used tf-idf and k-means to cluster the articles (I followed this tutorial[2]). Then I combine the clusters with my collaborative filtering model using feature augmentation: for each cluster, I generate N "fake users" who like each item in their cluster, and I add that to the rest of the rating data.

So in effect, content-based filtering is used to handle cold start (when a new article is submitted which doesn't have many ratings yet, the fake user ratings will dominate it), and then as real user ratings are gathered, it will switch gradually to relying only on collaborative filtering.

[1] https://newspaper.readthedocs.io/en/latest/

[2] http://brandonrose.org/clustering_mobile



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: