Hacker Newsnew | past | comments | ask | show | jobs | submit | crcsmnky's commentslogin

Bummer. I loved his read of the news segments during Morning Edition. So much so that I asked he record the voicemail message doing a similarly styled opening.


Do you still have it?



Shameless self promotion - I just wrote up a little doc [1] on using Locust and Kubernetes to run and scale distributed load tests along with some sample code [2].

I looked at JMeter and Gatling as alternatives but they were either a bit clunky (JMeter) or not distributed out of the box (Gatling). Late in the game I did discover Tsung (Erlang) but didn't get time to try it out - looks pretty powerful though.

[1] https://cloud.google.com/solutions/distributed-load-testing-...

[2] https://github.com/GoogleCloudPlatform/distributed-load-test...


Great writeup. It is so close to what we built at tradewave.net (except for Locust). I am glad I am not the only one that saw load testing using scalable containers. At some point, we will be going towards HFT and having a true load testing suite is handy.


While this generally works best for larger code bases, I tend to start reading through open bugs/tickets and find things that appear easy. Then I will assign them to myself and do what I can to fix it or at least track it down.

Generally I find it hard to just start reading through packages, source, functions, etc. and find it much easier to try and solve some sort of problem. By tracking and debugging a particular issue through to the end, I find a learn a lot about the codebase.


Perhaps I'm missing something. It appears that the author is recommending against using Hadoop (and related tools) for processing 3.5GB of data. Who in the world thought that would be a good idea to begin with?

The underlying problem here isn't unique to Hadoop. People who are minimally familiar with how technology works and who are very much into BuzzWords™ will always throw around the wrong tool for the job so they can sound intelligent with a certain segment of the population.

That said, I like seeing how people put together their own CLI-based processing pipelines.


I've used hadoop at petabyte scale (2+pb input; 10+pb sorted for the job) for machine learning tasks. If you have such a thing on your resume, you will be inundated with employers who have "big data", and at least half will be under 50g with a good chunk of those under 10g. You'll also see multiple (shitty) 16 machine clusters, any of which -- for any task -- could be destroyed by code running on a single decent server with ssds. Let alone hadoop jobs running in emr, which is glacially slow (slow disk, slow network, slow everything.)

Also, hadoop is so painfully slow to develop in it's practically a full employment act for software engineers. I imagine it's similar to early ejb coding.


> Also, hadoop is so painfully slow to develop in it's practically a full employment act for software engineers.

It's comical how bad Hadoop is compared even to the CM Lisp described in Daniel Hillis' PhD dissertation. How do you devolve all the way from that down to "It's like map/reduce. You get one map and one reduce!"


Programming is very faddish. It's amazing how bad commonly used technologies are. I'm so happy I'm mostly a native developer and don't have to use the shitty web stack and its shitty replacements.


What really puzzles me is that Doug Cutting worked at Xerox PARC and Mike Cafarella has two (!) CS Masters degrees, a PhD degree, and is a professor at the University of Michigan. It's not like they were unaware of the previous work in the field (Connection Machine languages, Paralations, NESL).


It sounds a little bit like BizTalk :-)


Imho, hadoop is only for 100tb or more. Anything less can be easily handled by other tools.


Exactly this just happened where I work. The CIO was recommending Hadoop on AWS for our image processing/analysis jobs. We process a single set of images at a time which come in around ~1.5GB. The output data size is about 1.2GB. Not a good candidate for Hadoop but, you know... "big data", right?


Indeed. For more real-world examples of people who thought that they had "big data", but didn't, see https://news.ycombinator.com/item?id=6398650 ("Don't use Hadoop - your data isn't that big"). The linked essay has:

> They handed me a flash drive with all 600MB of their data on it (not a sample, everything). For reasons I can't understand, they were unhappy when my solution involved pandas.read_csv rather than Hadoop.

User w_t_payne commented:

> I have worked for at least 3 different employers that claimed to be using "Big Data". Only one of them was really telling the truth.

> All of them wanted to feel like they were doing something special.

I think that last line is critical to understanding why a CIO might feel this way.


"Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it." -Dan Ariely


In a similar situation. In fact stupider. We have a 120Gb baseline of data inside a relational store. The vendor has a file stream option that allows blobs to be stored on disk instead of the transaction log and be pushed through the DB server rather than using our current CIFS DFS cluster. So lets stick our 950Gb static document load in there too (while i was on holiday typically) and off we go.

Do thing starts going like shit as expected now it has to stream all that through the DB bottleneck.

So what's the solution? Well we're in big data territory now apparently at 1.2TiB (comedically small data and almost entirely static data set) and have every vendor licking arse with the CEO and CTO to sell us Hadoop, more DB features and SAN kit.

We don't even need it for processing. Just a big CRUD system. Total Muppets.


Resume-Driven Development


Being in charge of large budgets is better for CEOs and CTOs than being in charge of small budgets. It makes you look more impressive so they go for it like lemmings over the cliff, sometimes taking entire companies with them.


No, it's called enterprise software, and it's driven by "appeal to the enormous egos and matching budgets of leaders of companies that are held back more by their own bloat than by anything resembling competition." This is just standard practice for the enterprise software sector in general, and after everyone's spent their billions and the next recession is upon us, they'll probably axe most of these utter failures to produce business value. But all the consultants will just blame someone other than the client being dysfunctional because that's just the fastest way to get kicked off a contract.

Consulting for enterprise customers tends to be a lot like marriages - you can be right, or you can be happy (or paid). It takes a unique customer to have gotten past their cultural dysfunctions to accept responsibilities for their problems and to take legitimate, serious action. But like marriage, there can be great, great upsides when everyone gets on the same page and works towards mutual goals with the spirit of selflessness and growth. Yeah....


I have had this exact conversation at various past employers, usually when they started talking about "big data" and hadoop/friends:

"How much data do you expect to have?"

"We don't know, but we want it to scale up to be able to cover the whole market."

"Okay, so let's make some massive overestimates about the size of the market and scope of the problem... and that works out to about 100Mb/sec. That's about the speed at which you can write data to two hard drives. This is small data even in the most absurdly extreme scaling that I can think of. Use postgres."

Even experienced people do not have meaningful intuitions about what things are big or small on modern hardware. Always work out the actual numbers. If you don't know what they are then work out an upper bound. Write all these numbers down and compare them to your measured growth rates. Make your plans based on data. Anything that you've read about in the news is rare or it wouldn't be news, so it is unlikely to be relevant to your problem space.


That's not even medium data. Most people probably would be surprised to find out that their data could be stored and processed on an iPhone, and that using heavier duty tools isn't necessary or worthwhile.


Perhaps a good way of explaining it to management is, "If it fits on a smartphone SD card, it's tiny data. If it fits on a laptop hard drive, it's maybe medium data." I think at that point many of these conversations would end.


If the data can fit on a thumb drive it's not big data.


I think I read this somewhere here a few months ago (paraphrasing, obviously): "When the indices for your DB don't fit into a single machines RAM, then you're dealing with Big Data, not before."


And following up: Your laptop does not count as a "single machine" for purposes of RAM size. If you can fit the index of your DB in memory on anything you can get through EC2, it's still not Big Data.


There's still 40x difference to biggest EC2 instance to a maxed out Dell server (244 GB EC2 vs 6 TB for a R920). Not to mention non-PC hardware like SPARC, POWER and SGI UV systems that fit even more.


This is true, but at the upper end the "it isn't Big Data if it fits in a single system's memory" rule starts to get fuzzy. If you're using an SGI UV 2000 with 64 TB of memory to do your processing, I'm not going to argue with you about using the words "Big Data". ;-) I figured using an EC2 instance was a decent compromise.


Would it be fair to approximate it as, "if you can lift it, it's not big data"?


if a single file of the data can fit on the single biggest disk commonly available it not big data


Another explanation is that your CIO is not an idiot but rather they know about future projects that you don't. CIOs want to build capabilities (skills and technologies) not just one off implementations every time.

Not saying this is the case but CIO bashing is all too easy when you're an engineer.


A good CIO would know that leaving out key parts of the project is unlikely to produce good results. Even if the details aren't final, a simple “… and we probably need to scale this up considerably by next year” would be useful when weighing tradeoffs


The CIO is probably better at playing corporate politics than a grunt. (That's why they are CIO and not a grunt.)


The last people you want to keep secrets from are your engineers.


I think that hiding requirements from the people building the system would, in fact, be an idiotic move.


"Although Tom was doing the project for fun, often people use Hadoop and other so-called Big Data (tm) tools for real-world processing and analysis jobs that can be done faster with simpler tools and different techniques."

I think the point the author is making is that although they knew from the start that Hadoop wasn't necessary for the job, many people probably don't.


Lots of people think that is "big data". For most people if it's too big for an Excel spreadsheet, it's "big data" and the way you process big data is with Hadoop. Of course once you show them the billable hours difference between setting up a Hadoop cluster, and (in my case at least) using python libraries on a MBP, they change their minds real fast. Its just a matter of "big data" being a new thing, people will figure it out as time goes on and things settle down.


People love the idea of being big and having "big problems". Wanting to use Hadoop isn't that different from wanting to use all sorts of fancy "web-scale" databases.

Most of us don't have scaling issues or big data, but that sort of excludes us from using all the fancy new tools that we want to play with. I'm still convinced that most of the stuff I work on at work could be run on SQLite, with designed a bit more careful.

The truth is that most of us will never do anything that couldn't be solved with 10 year old technology. And honestly we should happy, there's a certain comfort in being able to use simple and generally understood tools.


Totally agree. It's usually pretty clear who has taken the time out to read about your background and who hasn't. I like responding and connecting with the ones that do their research.


How do you know that Apple didn't prove their efficacy to the city?

You're comment comes off as incredibly naive - why should Apple have to foot the cost? Are you suggesting that all technology products should be given for free to educational institutions unless they're educational efficacy can be proven?

I fail to see how this is different from schools buying desktop or laptop computers. They're still computers running applications - which is where the true educational component comes into play. If the software is worthless than sure, it's probably pointless but with a working software solution than why can't this be effective?


Wow, there are a lot of issues with this article.

First off, startups don't represent all of Silicon Valley so to say this is a valley problem by using startup-only-centric data is disingenuous at best.

I'll admit that I find the lack of diversity amongst founding teams problematic but this article doesn't raise any kind of awareness to the source of the diversity breakdown. The author just jumps to a broad, unsupported conclusion so quickly that it undermines any actual rational, legitimate discussion of the topic.


"There is nothing special in Go that makes it particularly well suited to video games, but on the other hand there is nothing inherent in the language that makes it a bad choice, except for the fact that it's new and not well supported."

I would say that "new and not well supported" makes this a potentially bad choice for shipping code. Not to say that it can't be used but the problem with new technologies is that not all warts have been properly exposed and you can open yourself up to major headaches.


A screenshot is a nice to have, or something that qualifies as a legitimate request, but it's not critical nor would I expect it to be off putting if not provided.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: