Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Exactly this just happened where I work. The CIO was recommending Hadoop on AWS for our image processing/analysis jobs. We process a single set of images at a time which come in around ~1.5GB. The output data size is about 1.2GB. Not a good candidate for Hadoop but, you know... "big data", right?


Indeed. For more real-world examples of people who thought that they had "big data", but didn't, see https://news.ycombinator.com/item?id=6398650 ("Don't use Hadoop - your data isn't that big"). The linked essay has:

> They handed me a flash drive with all 600MB of their data on it (not a sample, everything). For reasons I can't understand, they were unhappy when my solution involved pandas.read_csv rather than Hadoop.

User w_t_payne commented:

> I have worked for at least 3 different employers that claimed to be using "Big Data". Only one of them was really telling the truth.

> All of them wanted to feel like they were doing something special.

I think that last line is critical to understanding why a CIO might feel this way.


"Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it." -Dan Ariely


In a similar situation. In fact stupider. We have a 120Gb baseline of data inside a relational store. The vendor has a file stream option that allows blobs to be stored on disk instead of the transaction log and be pushed through the DB server rather than using our current CIFS DFS cluster. So lets stick our 950Gb static document load in there too (while i was on holiday typically) and off we go.

Do thing starts going like shit as expected now it has to stream all that through the DB bottleneck.

So what's the solution? Well we're in big data territory now apparently at 1.2TiB (comedically small data and almost entirely static data set) and have every vendor licking arse with the CEO and CTO to sell us Hadoop, more DB features and SAN kit.

We don't even need it for processing. Just a big CRUD system. Total Muppets.


Resume-Driven Development


Being in charge of large budgets is better for CEOs and CTOs than being in charge of small budgets. It makes you look more impressive so they go for it like lemmings over the cliff, sometimes taking entire companies with them.


No, it's called enterprise software, and it's driven by "appeal to the enormous egos and matching budgets of leaders of companies that are held back more by their own bloat than by anything resembling competition." This is just standard practice for the enterprise software sector in general, and after everyone's spent their billions and the next recession is upon us, they'll probably axe most of these utter failures to produce business value. But all the consultants will just blame someone other than the client being dysfunctional because that's just the fastest way to get kicked off a contract.

Consulting for enterprise customers tends to be a lot like marriages - you can be right, or you can be happy (or paid). It takes a unique customer to have gotten past their cultural dysfunctions to accept responsibilities for their problems and to take legitimate, serious action. But like marriage, there can be great, great upsides when everyone gets on the same page and works towards mutual goals with the spirit of selflessness and growth. Yeah....


I have had this exact conversation at various past employers, usually when they started talking about "big data" and hadoop/friends:

"How much data do you expect to have?"

"We don't know, but we want it to scale up to be able to cover the whole market."

"Okay, so let's make some massive overestimates about the size of the market and scope of the problem... and that works out to about 100Mb/sec. That's about the speed at which you can write data to two hard drives. This is small data even in the most absurdly extreme scaling that I can think of. Use postgres."

Even experienced people do not have meaningful intuitions about what things are big or small on modern hardware. Always work out the actual numbers. If you don't know what they are then work out an upper bound. Write all these numbers down and compare them to your measured growth rates. Make your plans based on data. Anything that you've read about in the news is rare or it wouldn't be news, so it is unlikely to be relevant to your problem space.


That's not even medium data. Most people probably would be surprised to find out that their data could be stored and processed on an iPhone, and that using heavier duty tools isn't necessary or worthwhile.


Perhaps a good way of explaining it to management is, "If it fits on a smartphone SD card, it's tiny data. If it fits on a laptop hard drive, it's maybe medium data." I think at that point many of these conversations would end.


If the data can fit on a thumb drive it's not big data.


I think I read this somewhere here a few months ago (paraphrasing, obviously): "When the indices for your DB don't fit into a single machines RAM, then you're dealing with Big Data, not before."


And following up: Your laptop does not count as a "single machine" for purposes of RAM size. If you can fit the index of your DB in memory on anything you can get through EC2, it's still not Big Data.


There's still 40x difference to biggest EC2 instance to a maxed out Dell server (244 GB EC2 vs 6 TB for a R920). Not to mention non-PC hardware like SPARC, POWER and SGI UV systems that fit even more.


This is true, but at the upper end the "it isn't Big Data if it fits in a single system's memory" rule starts to get fuzzy. If you're using an SGI UV 2000 with 64 TB of memory to do your processing, I'm not going to argue with you about using the words "Big Data". ;-) I figured using an EC2 instance was a decent compromise.


Would it be fair to approximate it as, "if you can lift it, it's not big data"?


if a single file of the data can fit on the single biggest disk commonly available it not big data


Another explanation is that your CIO is not an idiot but rather they know about future projects that you don't. CIOs want to build capabilities (skills and technologies) not just one off implementations every time.

Not saying this is the case but CIO bashing is all too easy when you're an engineer.


A good CIO would know that leaving out key parts of the project is unlikely to produce good results. Even if the details aren't final, a simple “… and we probably need to scale this up considerably by next year” would be useful when weighing tradeoffs


The CIO is probably better at playing corporate politics than a grunt. (That's why they are CIO and not a grunt.)


The last people you want to keep secrets from are your engineers.


I think that hiding requirements from the people building the system would, in fact, be an idiotic move.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: