> Lets be reasonable here: a decently sized server farm could probably keep ...

gwern · on Sept 16, 2012

> I don't know much about JSTOR but I think it's safe to assume that the implementation is non-trivial.

It's non-trivial, and it's also one of the smallest chunks of their budget. As a (bloated, inefficient, high-salary) non-profit, you can read JSTOR's financial filings for yourself:

- http://www.generalist.org.uk/blog/2011/jstor-where-does-your... - http://lists.wikimedia.org/pipermail/wikien-l/2011-July/1092...

They spend ~$4m a year on all computer costs. (To put that in perspective, they spend $1.3m a year on 'travel' & 'conferences, conventions, and meetings'.)

Zigurd · on Sept 16, 2012

I believe Swartz's point is that 10TB of data is not something you can build a self-perpetuating bureaucracy around, anymore, and that this implementation is, in fact, simple enough that it can be made available and maintained for a tiny fraction of the cost of JSTOR.

calpaterson · on Sept 16, 2012

I'm not really talking about what Swartz's argument is, but you seem to have missed the point that there's more to being a publisher than holding data.

sillysaurus · on Sept 16, 2012

I don't understand your point either. Yes, publishers do more than merely hold PDF files. Ergo what?

calpaterson · on Sept 16, 2012

Ergo keeping all the documents in RAM isn't a technical solution. I think you've mentally lost track of what I originally said.

mbell · on Sept 16, 2012

10TB isn't a large amount of data. I have more data than that sitting in my $1500 home NAS.

calpaterson · on Sept 16, 2012

I was trying to make the point that there is more to being a publisher than just holding static content.

mbell · on Sept 16, 2012

As a business, sure.

But I don't see the complication on the tech side.

Getting the scanned copies is likely the most complex portion of this and is outside the realm of the site. The rest is just displaying static content and filling a Solr cluster with OCR data for search which is a trivial task with off the shelf OSS tools. Seems like a 2 week MVP project given how clunky the website is (why on earth does it move to the top of the page when I click the 'next' arrow when viewing the article, arg).