Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Lets be reasonable here: a decently sized server farm could probably keep the entirety of documents hosted on JSTOR in RAM.

One largeish science publisher I worked with had 9-10TB of (pdf) document data. In addition to that there's a search engine, and image versions (of everything) to allow for look inside. Then there's dynamically generated HTML for online view. Electronic publishers do with large amounts of data and it isn't wholly static content. I don't know much about JSTOR but I think it's safe to assume that the implementation is non-trivial.



> I don't know much about JSTOR but I think it's safe to assume that the implementation is non-trivial.

It's non-trivial, and it's also one of the smallest chunks of their budget. As a (bloated, inefficient, high-salary) non-profit, you can read JSTOR's financial filings for yourself:

- http://www.generalist.org.uk/blog/2011/jstor-where-does-your... - http://lists.wikimedia.org/pipermail/wikien-l/2011-July/1092...

They spend ~$4m a year on all computer costs. (To put that in perspective, they spend $1.3m a year on 'travel' & 'conferences, conventions, and meetings'.)


I believe Swartz's point is that 10TB of data is not something you can build a self-perpetuating bureaucracy around, anymore, and that this implementation is, in fact, simple enough that it can be made available and maintained for a tiny fraction of the cost of JSTOR.


I'm not really talking about what Swartz's argument is, but you seem to have missed the point that there's more to being a publisher than holding data.


I don't understand your point either. Yes, publishers do more than merely hold PDF files. Ergo what?


Ergo keeping all the documents in RAM isn't a technical solution. I think you've mentally lost track of what I originally said.


10TB isn't a large amount of data. I have more data than that sitting in my $1500 home NAS.


I was trying to make the point that there is more to being a publisher than just holding static content.


As a business, sure.

But I don't see the complication on the tech side.

Getting the scanned copies is likely the most complex portion of this and is outside the realm of the site. The rest is just displaying static content and filling a Solr cluster with OCR data for search which is a trivial task with off the shelf OSS tools. Seems like a 2 week MVP project given how clunky the website is (why on earth does it move to the top of the page when I click the 'next' arrow when viewing the article, arg).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: