Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That's not thoroughly ridiculous.

For many problems in machine learning, k-nearest-neighbors and a large dataset is very hard to beat in terms of error rate. Of course, the time to run a query is beyond atrocious, so other models are favored even if kNN has a lower error rate.



According to [1], k-NN is pretty far from being the top general classifier. Admittedly, all the data sets are small, with no more than 130,000 instances. When does it start becoming "very hard to beat", and what are you basing that on?

1. http://jmlr.org/papers/volume15/delgado14a/delgado14a.pdf


The vast majority of the data sets are very small (on the order of 100 to 1000 examples!). In fact, in the paper they discarded some of the larger UCI datasets.

It's not surprising that they found random forests perform better, even though the conventional wisdom is boosting outperforms (it's much harder to overfit with forests).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: