Pachyderm Raises $10M to Bring Data Provenance to the Enterprise

mmasters · on Nov 16, 2018

I was part of an experimental neuroimaging group that tested Pachyderm OSS years ago and at the time we were really impressed with the versioning capabilities it provided. For us at the time it made it easy for each researcher to grab and change data as needed for their own development without requiring support from eng.

marmaduke · on Nov 16, 2018

How well does that work when you datasets are a sizeable percentage of available storage capacity, though? Is there some sort of deduplication at work?

jaz46 · on Nov 16, 2018

Pachyderm does a ton of data deduplication, both for input data that's added to pachyderm repos as well as for output files.

Pachyderm's pipelines are also smart enough to know what data has changed and what hasn't and only process the incremental data "diffs" as needed. If your pipeline is just one giant reduce or training job that's can't be broken up at all, then this isn't valuable, but most workloads include lots of Map steps where only processing diffs can be incredibly powerful

marmaduke · on Nov 17, 2018

This is super cool, thanks for pointing that out. Is the hard part done by Pachyderm or as some layer over container file systems?

ztjio · on Nov 17, 2018

Pachyderm does it. It's like half of what pachyderm does, manage the versioned data, and schedule workers to run your containered processes against them.

FYI, it's ridiculously easy to get going playing with Pachyderm if you just want to check it out. You can run it on Minikube.

marmaduke · on Nov 17, 2018

> You can run it on Minikub

Thanks for the tip. I just started down the k8s path from bare metal cluster and will try this.

marmaduke · on Nov 16, 2018

I have a “data science pipeline” coordinated with a Makefile and run on CI/CD (GitLab) with reports generated as build artifacts. Big stuff checked in with Git LFS.

Why would I use Pachyderm?

jaz46 · on Nov 16, 2018

Pachyderm pipelines can be run in a massively distributed fashion with data being sharded across many workers. Pachyderm also offers much better failure semantics than Make + CI. For example, if one shard of data of your pipeline fails such as a node or container dies, Pachyderm will automatically make sure that the data get rescheduled to another worker.

Each pipeline can have separate resource requirements (e.g. GPUs, lots of memory, etc) and gets scheduled by Kubernetes.

Finally, Pachyderm is versioning all of the intermediate steps in your data pipeline so if a downstream step fails, you don't have to restart from scratch, you can pick up right where it left off.

agibsonccc · on Nov 16, 2018

Congrats on your round! FWIW, (me being in a space adjacent to yours) I see these kinds of comments all the time about the "my 1 off pipeline seems good enough".

Most people don't need "industrial strength" till you hit a certain scale. They tend to optimize for ease of use/simplicity. It's one thing if you don't have to change their workflow, it's another if you have to not only have people change their workflow but also teach something new.

Convenience matters more. There's a pain threshold of "new tool" vs "this costs me x amount of time".

How are you guys overcoming this? Even though we're also in the infra space, I've never seen you guys out in the wild. Where would I bump in to you and what is the scale I would want to add the complexity of k8s + pachyderm + whatever other deps you guys have over just using an S3 bucket?

Another question: Why hasn't AWS just packaged this up and offered it as an extension of their k8s service? How are you guys going to overcome that?

Usually I see "hybrid cloud" or "on prem" as the response. If it is on prem, are you guys relying on the presence of k8s at customers? Do you guys use something like gravitational?

marmaduke · on Nov 17, 2018

> my 1 off pipeline seems good enough

Not what I was describing. We’ve been using the setup for a while with multiple projects and for the data science end of a clinical trial.

I like building new setups incrementally out of tools I already know, so the question is obviously why is this new tool a radical improvement

marmaduke · on Nov 17, 2018

Thanks for the response. I forgot to mention that the Make/CI jobs are run on a Slurm HPC cluster (actually multiple since the CI file uses runners running in multiple Slurm clusters for different resources). Make ensures we don’t redo stuff unless needed.

Point taken about restarts, but I usually want to see why a thing fails before retrying.

It seems like Pachyderm mainly offers better granularity, especially for the data management.

IanCal · on Nov 16, 2018

I think it'd be overkill if what you have is working for you. At least, replacing what you have may not be worth it.

For me, there's various properties I like about pachyderm (though they're missing some key workflows I'd like). Looking at some output data, I can see exactly what versions of everything was run against what data, and the data is all version controlled. This should also allow experimentation by individual people and easier approval of updates. I work with multiple teams with many components, where each one may impact others - being able to look at some final result that's weird and go back to figure out what's going on is hard. Tools like pachyderm help with that.

claytonjy · on Nov 16, 2018

for a job where that workflow is sufficient, you probably shouldn't.

But if you need an always-on, always-ready pipeline that can scale horizontally on an as-needed basis, and perhaps mixes many different modules/languages, than Pachyderm might be a good fit.

emkt · on Nov 17, 2018

Well you can use https://github.com/argoproj/argo for workflow needs without need to version everything. For data versioning if you need, you can use https://dvc.org/ As this gives me freedom to choose as if I want data versioning and/or workflow rather than one bug bundle.

alex_lfw · on Nov 16, 2018

He was holding these two pieces of pizza...

jdoliner · on Nov 16, 2018

Hi, Pachyderm CEO here. We have been discussing this comment internally for the last 15 minutes and still don't understand it. But we really enjoy it. Would you mind explaining a bit more what you are saying?

Liru · on Nov 17, 2018

Not the person you are responding to, but the comment you're replying to is referencing Seinfeld, due to the name of your company.

https://en.wikipedia.org/wiki/The_Stand_In_(Seinfeld)

jaz46 · on Nov 17, 2018

What a wonderfully obscure reference.

jdoliner · on Nov 17, 2018

This is lovely, thank you.

grbno · on Nov 17, 2018

I've also wondered why I should use Pachyderm. Decided to give it a try, and wrote the following blog about it : https://medium.com/bigdatarepublic/pachyderm-for-data-scient... " Finally, version control for your data "

mb4 · on Nov 16, 2018

Congrats guys! Data provenance is only becoming more important

akircher · on Nov 16, 2018

This is great news. An amazing team with a good solution to a huge problem. I look forward to following your progress.

coolhand1 · on Nov 16, 2018

Love this project! Git for data is such a brilliant concept

otoolep · on Nov 16, 2018

Congrats to the team at Pachyderm!

_drFaust · on Nov 16, 2018

much deserved, been following this project for sometime and have continued to be impressed.

koolhead17 · on Nov 17, 2018

Congrats Pachyderm team.

atav1k · on Nov 16, 2018

nice!