Spark 2.0.0 Released

minimaxir · on Aug 1, 2016

Spark 2.0 was released last week and submitted about 10 times, but the only time it hit the front page (https://news.ycombinator.com/item?id=12171026) it received zero discussion. The behavior of data science HN submissions is always weird. :p

I spent the weekend playing around with Spark/PySpark 2.0 for a blog post which will be released in a couple days. PySpark is almost on the parity of scikit-learn in terms of available data structures and algorithms. (Although some Spark design decisions are more enterprise-friendly than personal user-friendly. For example,Spark 2.0 is the first update with the ability to natively read CSVs. And let's not get started on exporting CSVs: https://twitter.com/minimaxir/status/759830773893509125)

Eridrus · on Aug 2, 2016

I use Spark at work and there just isn't anything super exciting about this release. They did a bunch of work under the hood to make it go faster, and added some more SQL, but there's no big change in capabilities here.

Spark is great if you really do need to parallelize your code across a pile of machines, but besides wanting to play with shiny toys, there isn't any reason to use this unless you have to.

Personally I'm looking forward to see what h2o does with their Deep Learning integration (they're going to wrap Caffeine,T ensorFlow & money somehow) in Sparkling Water, since being able to easily train modern neural net models from Spark would be great, and if they can continue their deployment story into the DL space, that would be amazing for me.

Eridrus · on Aug 2, 2016

Damn mobile; I meant to write Caffe, TensorFlor & mxnet

aduffy · on Aug 1, 2016

What's wrong with exporting to csv? Seems like it's supported: https://github.com/apache/spark/blob/master/sql/core/src/mai...

minimaxir · on Aug 1, 2016

Added a link to screenshot. What happens is that each Spark partition outputs its own CSV, which is never what anyone would want!

The expected (undocumented) solution is to compress into one partition, but then non-primitive data structures get output incorrectly.

My solution was to just cast the Spark DataFrame as a pandas DataFrame and save a CSV through there.

nchammas · on Aug 1, 2016

> What happens is that each Spark partition outputs its own CSV, which is never what anyone would want!

1 file per partition is exactly what you want when the output is large, so that multiple executors can share the work of writing out the output.

> The expected (undocumented) solution is to compress into one partition, but then non-primitive data structures get output incorrectly.

By "compress" do you mean coalesce? Yes, coalescing the RDD/DataFrame into one partition is the commonly accepted solution for when you want to force 1 output file. The cost of doing so is that you lose parallelism, since only 1 task will be able to write out that file.

And what do you mean by "non-primitive data structure" in this case?

minimaxir · on Aug 1, 2016

Yes, I meant coalesce.

It makes sense that output is distributed, but it caught me by surprise since I had not seen anything about that on the official site, nor in any literature. It is unusual since the other Spark libraries have smart defaults.

My CSV had a columns of DenseVectors containing floats and when exported, it used the internal representation of numbers (see second tweet in reply to first tweet)

nchammas · on Aug 1, 2016

I think it would be dangerous to have the default be "coalesce to 1 partition before writing", but I agree this should be better documented since it takes many people by surprise.

As for the DenseVectors, that looks strange and is perhaps worthy of a report on the project tracker.

ap22213 · on Aug 1, 2016

Why wouldn't someone occasionally want to output data per partition? I have that need fairly often, and I appreciate having the flexibility to do it. Spark requires reading the docs, for sure, but once you get down the basics it's easy going compared to other options.

I've been using Spark for a while now, and the only thing that's still frustrating is underestimating memory use. Nothing like crunching a couple hundred TiB only to have executors start dropping because they've run out of memory (because a task created just a few too many objects). It'd be really nice to be able to recover easier.

Spark is awesome!

nl · on Aug 1, 2016

Also:

   df_out.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("filename")

I'm not sure if this is the same as repartition(1), but I haven't noticed any problems with data structures. Or the to_pandas() solution works.

The problem with all these approaches is that they bring them all into memory on the driver. The real solution should be to stream to the node and write out as it does.

minimaxir · on Aug 1, 2016

Yea, using repartition and coalesce had the same result on my output.

jbooth · on Aug 1, 2016

Parallel output by default is just how it's going to work on a distributed computing platform.

placeybordeaux · on Aug 1, 2016

Outputting like that works exactly as I'd like for a hadoop use case.

ganeshkrishnan · on Aug 2, 2016

Ummmm no I would disagree with you. If you expect all of it to be compressed into one partition each executor has to send its data back to the parent/driver and the driver has to handle the task.

In most cases, like ours, we have a driver that is skimpy on RAM (around 2 gig) and each executor/slave has around 4-8 gigs. Each slave/executor thread it spawned has to do it's own job and return when it's done.

I am planning to start a blog for our nlp search engine www.shoten.xyz

Our pains and our aims.

nl · on Aug 1, 2016

I'd hardly say MLlib matches sci-kit in the number of algorithms available! For example we recently had to resort to a third-party implementation of DBSCAN.

It does have most of the important ones though. Also the pipeline API feels slightly cleaner than the one in sci-kit.

drewda · on Aug 1, 2016

Speaking of Python wrappers for Spark, do you know if this release allows use of GraphX from Python? Or is that still only available through the Scala libraries?

minimaxir · on Aug 1, 2016

You may be thinking of GraphFrames: https://databricks.com/blog/2016/03/03/introducing-graphfram... (apparently it is not ready for 2.0)

There is no native support otherwise.

drewda · on Aug 1, 2016

Thanks for the pointer to GraphFrames. I was unaware that effectively supersedes GraphX.

I had been thinking back to https://issues.apache.org/jira/browse/SPARK-3789 and now that I've doubled checked that issue, I see that Python-bindings for GraphX are moot, given GraphFrames.

nl · on Aug 1, 2016

GraphX and DataSets are both Scala/Java only.

reactor · on Aug 1, 2016

I don't know, but if it was written in Python, Go or Java things would have different in here, just guessing.

rm999 · on Aug 1, 2016

For anyone interested in Spark, I created http://www.reddit.com/r/apachespark a couple years ago. It's a relatively small subreddit given the popularity of Spark, and could use more users :)

redsymbol · on Aug 2, 2016

Thanks for creating it. It's one of only 8 subreddits I subscribe to.