Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Pachyderm Raises $10M to Bring Data Provenance to the Enterprise (pachyderm.io)
102 points by jaz46 on Nov 16, 2018 | hide | past | favorite | 27 comments


I was part of an experimental neuroimaging group that tested Pachyderm OSS years ago and at the time we were really impressed with the versioning capabilities it provided. For us at the time it made it easy for each researcher to grab and change data as needed for their own development without requiring support from eng.


How well does that work when you datasets are a sizeable percentage of available storage capacity, though? Is there some sort of deduplication at work?


Pachyderm does a ton of data deduplication, both for input data that's added to pachyderm repos as well as for output files.

Pachyderm's pipelines are also smart enough to know what data has changed and what hasn't and only process the incremental data "diffs" as needed. If your pipeline is just one giant reduce or training job that's can't be broken up at all, then this isn't valuable, but most workloads include lots of Map steps where only processing diffs can be incredibly powerful


This is super cool, thanks for pointing that out. Is the hard part done by Pachyderm or as some layer over container file systems?


Pachyderm does it. It's like half of what pachyderm does, manage the versioned data, and schedule workers to run your containered processes against them.

FYI, it's ridiculously easy to get going playing with Pachyderm if you just want to check it out. You can run it on Minikube.


> You can run it on Minikub

Thanks for the tip. I just started down the k8s path from bare metal cluster and will try this.


I have a “data science pipeline” coordinated with a Makefile and run on CI/CD (GitLab) with reports generated as build artifacts. Big stuff checked in with Git LFS.

Why would I use Pachyderm?


Pachyderm pipelines can be run in a massively distributed fashion with data being sharded across many workers. Pachyderm also offers much better failure semantics than Make + CI. For example, if one shard of data of your pipeline fails such as a node or container dies, Pachyderm will automatically make sure that the data get rescheduled to another worker.

Each pipeline can have separate resource requirements (e.g. GPUs, lots of memory, etc) and gets scheduled by Kubernetes.

Finally, Pachyderm is versioning all of the intermediate steps in your data pipeline so if a downstream step fails, you don't have to restart from scratch, you can pick up right where it left off.


Congrats on your round! FWIW, (me being in a space adjacent to yours) I see these kinds of comments all the time about the "my 1 off pipeline seems good enough".

Most people don't need "industrial strength" till you hit a certain scale. They tend to optimize for ease of use/simplicity. It's one thing if you don't have to change their workflow, it's another if you have to not only have people change their workflow but also teach something new.

Convenience matters more. There's a pain threshold of "new tool" vs "this costs me x amount of time".

How are you guys overcoming this? Even though we're also in the infra space, I've never seen you guys out in the wild. Where would I bump in to you and what is the scale I would want to add the complexity of k8s + pachyderm + whatever other deps you guys have over just using an S3 bucket?

Another question: Why hasn't AWS just packaged this up and offered it as an extension of their k8s service? How are you guys going to overcome that?

Usually I see "hybrid cloud" or "on prem" as the response. If it is on prem, are you guys relying on the presence of k8s at customers? Do you guys use something like gravitational?


> my 1 off pipeline seems good enough

Not what I was describing. We’ve been using the setup for a while with multiple projects and for the data science end of a clinical trial.

I like building new setups incrementally out of tools I already know, so the question is obviously why is this new tool a radical improvement


Thanks for the response. I forgot to mention that the Make/CI jobs are run on a Slurm HPC cluster (actually multiple since the CI file uses runners running in multiple Slurm clusters for different resources). Make ensures we don’t redo stuff unless needed.

Point taken about restarts, but I usually want to see why a thing fails before retrying.

It seems like Pachyderm mainly offers better granularity, especially for the data management.


I think it'd be overkill if what you have is working for you. At least, replacing what you have may not be worth it.

For me, there's various properties I like about pachyderm (though they're missing some key workflows I'd like). Looking at some output data, I can see exactly what versions of everything was run against what data, and the data is all version controlled. This should also allow experimentation by individual people and easier approval of updates. I work with multiple teams with many components, where each one may impact others - being able to look at some final result that's weird and go back to figure out what's going on is hard. Tools like pachyderm help with that.


for a job where that workflow is sufficient, you probably shouldn't.

But if you need an always-on, always-ready pipeline that can scale horizontally on an as-needed basis, and perhaps mixes many different modules/languages, than Pachyderm might be a good fit.


Well you can use https://github.com/argoproj/argo for workflow needs without need to version everything. For data versioning if you need, you can use https://dvc.org/ As this gives me freedom to choose as if I want data versioning and/or workflow rather than one bug bundle.


He was holding these two pieces of pizza...


Hi, Pachyderm CEO here. We have been discussing this comment internally for the last 15 minutes and still don't understand it. But we really enjoy it. Would you mind explaining a bit more what you are saying?


Not the person you are responding to, but the comment you're replying to is referencing Seinfeld, due to the name of your company.

https://en.wikipedia.org/wiki/The_Stand_In_(Seinfeld)


What a wonderfully obscure reference.


This is lovely, thank you.


I've also wondered why I should use Pachyderm. Decided to give it a try, and wrote the following blog about it : https://medium.com/bigdatarepublic/pachyderm-for-data-scient... " Finally, version control for your data "


Congrats guys! Data provenance is only becoming more important


This is great news. An amazing team with a good solution to a huge problem. I look forward to following your progress.


Love this project! Git for data is such a brilliant concept


Congrats to the team at Pachyderm!


much deserved, been following this project for sometime and have continued to be impressed.


Congrats Pachyderm team.


nice!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: