> What happens is that each Spark partition outputs its own CSV, which is never what anyone would want!
1 file per partition is exactly what you want when the output is large, so that multiple executors can share the work of writing out the output.
> The expected (undocumented) solution is to compress into one partition, but then non-primitive data structures get output incorrectly.
By "compress" do you mean coalesce? Yes, coalescing the RDD/DataFrame into one partition is the commonly accepted solution for when you want to force 1 output file. The cost of doing so is that you lose parallelism, since only 1 task will be able to write out that file.
And what do you mean by "non-primitive data structure" in this case?
It makes sense that output is distributed, but it caught me by surprise since I had not seen anything about that on the official site, nor in any literature. It is unusual since the other Spark libraries have smart defaults.
My CSV had a columns of DenseVectors containing floats and when exported, it used the internal representation of numbers (see second tweet in reply to first tweet)
I think it would be dangerous to have the default be "coalesce to 1 partition before writing", but I agree this should be better documented since it takes many people by surprise.
As for the DenseVectors, that looks strange and is perhaps worthy of a report on the project tracker.
Why wouldn't someone occasionally want to output data per partition? I have that need fairly often, and I appreciate having the flexibility to do it. Spark requires reading the docs, for sure, but once you get down the basics it's easy going compared to other options.
I've been using Spark for a while now, and the only thing that's still frustrating is underestimating memory use. Nothing like crunching a couple hundred TiB only to have executors start dropping because they've run out of memory (because a task created just a few too many objects). It'd be really nice to be able to recover easier.
I'm not sure if this is the same as repartition(1), but I haven't noticed any problems with data structures. Or the to_pandas() solution works.
The problem with all these approaches is that they bring them all into memory on the driver. The real solution should be to stream to the node and write out as it does.
Ummmm no I would disagree with you. If you expect all of it to be compressed into one partition each executor has to send its data back to the parent/driver and the driver has to handle the task.
In most cases, like ours, we have a driver that is skimpy on RAM (around 2 gig) and each executor/slave has around 4-8 gigs.
Each slave/executor thread it spawned has to do it's own job and return when it's done.
I am planning to start a blog for our nlp search engine www.shoten.xyz