Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Correct, there are no images in the data except for 68 PNGs. It's just HTML files.


how it's possible that a bunch of html files would add up to 200gb? is it because of some kind of overhead?

would maybe a database dump be smaller?


Well, "a bunch" is an understatement, I bet they have a bit more than just a bunch! It does pass a sniff test, since from https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia:

>As of May 2015, the current version of the English Wikipedia article / template / redirect text was about 51 GB uncompressed in XML format.

Compressed data at the same time was 11.5 GB. And that's data from 9 years ago, and just English Wikipedia.

For comparison, I collect leaked password dumps and they (combined, after deduplication) go into hundreds of GBs too. And that's for just username:password lines, not even text.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: