Hadoop is insane. The elephant is fitting. Is it really the best choice, or has ...

Alupis · on April 22, 2015

> Is it really the best choice, or has someone done something cleaner in golang or c++11?

What does the language have to do with the program?

Hadoop is what it is because it's a complex problem with a fittingly complex solution. Simply re-writing it in your pet language won't somehow make it "better".

getsat · on April 22, 2015

Go and modern C++ are both quite a bit more terse than Java. They also produce binaries which don't necessarily require a runtime to be available on every server (just ABI compatibility).

(I have no horse in this race, I am just writing what I think the grandparent comment was referring to)

pjmlp · on April 22, 2015

> They also produce binaries which don't necessarily require a runtime to be available on every server

Just like Java[0]. It is just a matter of choosing the right compiler for the use case at hand.

[0] - http://www.excelsiorjet.com/ (one from many vendors)

getsat · on April 22, 2015

Cool concept, I didn't realise this existed. Can you run Hadoop and friends under this? I've worked at companies with over 500 servers in a Hadoop cluster and literally never once heard about anything other than using Oracle's JRE aside from one proposal to use OpenJDK which was shot down pretty quickly.

pjmlp · on April 22, 2015

I don't have experience with Hadoop.

Almost all commercial JVMs have some form of AOT or JIT caching, specially those that target embedded systems.

Sun never added support to the reference JVM for political reasons, as they would rather push for plain JIT.

Oracle is now finally thinking about adding support for it, with no official statement if it will make it into 9 or later.

JEP 197 is the start of those changes, http://openjdk.java.net/jeps/197

Oracle Labs also has SubstrateVM, which is an AOT compiler built with Graal and Truffle.

Alupis · on April 22, 2015

Way back in the day, GCC's gcj compiler would do AOT compilation of Java, however I believe it stopped being developed at jdk5 support.

pjmlp · on April 23, 2015

If I am not mistaken most the developers abandoned the project to work on the Eclipse compiler and OpenJDK when those projects became available.

GCC only keeps gcj around due to its unit tests.

Alupis · on April 22, 2015

There's also things like exec4j which bundles everything including a JVM into an executable which one can just run... and things like AdvancedInstaller and Install4j will also allow one to bundle a JVM.

So producing a binary which doesn't require a separate runtime really isn't a problem.

pjmlp · on April 22, 2015

Since you mention it, Java 8 brings bundling and installers support into the reference JDK.

swills · on April 22, 2015

C++ does usually require a runtime.

yoklov · on April 22, 2015

C++'s runtime is small and ubiquitous. Depending on how the software is written (if it allows disabling exceptions and rtti), it might be the same size as C's runtime, which is practically (but not totally) nonexistant.

I'm not an expert on Java, but my experience with it is that it's runtime is fairly huge and requires custom installation.

bandrami · on April 23, 2015

C++'s runtime is worse than Java's in that sense. Most JVMs can run most Java bytecode, but your libstdc++ has to be from the same version of the same compiler that your application was compiled with.

Alupis · on April 23, 2015

It was quite surprising for me the first time I did a little embedded work and discovered I couldn't run binaries that were compiled against glibc on my musl-libc based system, and vice-versa. I had initially thought they all just supported the same c89 spec so should work...

bandrami · on April 23, 2015

Yep. It's 99% ABI compatible, but that 1% will kill you.

For that matter, as you allude even C has a runtime.

swills · on April 23, 2015

Which C++ runtime is ubiquitous? I can think of at least 3 C++ runtimes (MS, libstdc++, libc++).

mjibson · on April 22, 2015

I spent an entire day last week attempting to build hadoop with LZO compression support. There are many outdated guides on the internet about how to do this, and I eventually gave up and spent a few hours getting the cloudera packages to install in a Dockerfile so I could reproduce my work later.

Figuring out which software packages I needed, how to modify my environment variables, which compiler to get, and where to put everything in the correct directory was the entire difficulty.

If it were written in Go instead of Java, I could have done `go get apache.org/hadoop` and it would have been done instead of giving up after hours of frustration.

Go has almost no new features that make it an interesting language from a programming language perspective. Go's win is that it makes the actual running of real software in production better. Hadoop's difficult is exactly why InfluxDB exists at all.

Alupis · on April 22, 2015

> If it were written in Go instead of Java, I could have done `go get apache.org/hadoop`

This complaint is just about packaging, and not the language itself. Any project can have good or back packing scripts, and for Java there are plenty of ways to make it "good".

Not to mention, the BUILDING.txt document clearly states they use maven[1] and to build you just do: mvn compile

> Go's win is that it makes the actual running of real software in production better

This might just be a familiarity issue, because once you launch the program, all things are equal.

And yes, you can bundle a JVM with your java app, which makes it exactly like GO's statically linked runtime and just as portable without any fuss.

[1] https://github.com/apache/hadoop/blob/trunk/BUILDING.txt

kbar13 · on April 22, 2015

> no new features

Go gets us better performance and concurrency out of the box.

Alupis · on April 22, 2015

> Go gets us better performance

Than Java? At best, GO performs on par with Java, but is often measured 10-20% slower.[1][2][3]

This is usually attributed to the far more mature optimizing compiler in the JVM, which ultimately compiles bytecode down to native machine code, especially for hot paths. Java performance for long running applications is on par with C (one of the reasons it's a primary choice for very high performing applications such as HFT, Stock Exchanges, Banking, etc).

> concurrency out of the box.

Java absolutely supports concurrency "out of the box"...[4]

[1] http://zhen.org/blog/go-vs-java-decoding-billions-of-integer...

[2] http://stackoverflow.com/questions/20875341/why-golang-is-sl...

[3] http://www.reddit.com/r/golang/comments/2r1ybd/speed_of_go_c...

[4] http://docs.oracle.com/javase/7/docs/api/java/util/concurren...

AlisdairO · on April 22, 2015

Hell, if we look at real-world-ish applications, the techempower benchmarks show go at easily 50% slower than a bunch of different Java options.

hysterix · on April 22, 2015

>What does the language have to do with the program?

I happen to agree with you whole heartedly, if you spend enough time here though you'll see the inevitable comment about how anything made in php is worthless insecure garbage and anyone who spends their time developing a php application are amateurs at best.

This isn't really a comment at you, just wanting to point out how much that convention is challenged.

elyase · on April 22, 2015

http://www.pachyderm.io is modern alternative.

century19 · on April 22, 2015

Apache Spark is a good replacement for Hadoop now. It's written in Scala.

growse · on April 22, 2015

Spark is a good replacement for MapReduce. MapReduce != Hadoop.

century19 · on April 22, 2015

Fair enough, but the original article was about Hadoop MapReduce wasn't it? It specifically says:

"without even using any of the HBaseGiraphFlumeCrunchPigHiveMahoutSolrSparkElasticsearch (or any other of the Apache chaos) mess yet."