Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's more a case of providing a quick and easy way to share large datasets, backed by HDFS. So, researchers don't have a good way to share datasets (apart from AWS/GCE).

We work with climate science researchers who have multi-TB datasets, and they have no efficient way to share them. Same goes for genomics researchers who routinely pay lots of money for Aspera licenses just to download datasets faster than TCP allows. We are using a Ledbat protocol tuned to give good bandwidth over high latency links, but only scavange available b/w as it is lower priority than TCP.

For the machine learning researcher: i'd like to test this RNN on the reddit comments dataset....3 days later after finding a poor quality torrent...oh, now i can do it. On our system, search, find, click to download. We will move towards downloading (random) samples of very large datasets (even to Kafka from where they can be processed as they are downloaded).



Sounds nice. Could you consider to make it more general than sharing datasets for ML? I mean, it sounds like a really generic solution that anyone could benefit from, not just researchers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: