I was part of an experimental neuroimaging group that tested Pachyderm OSS years ago and at the time we were really impressed with the versioning capabilities it provided. For us at the time it made it easy for each researcher to grab and change data as needed for their own development without requiring support from eng.
How well does that work when you datasets are a sizeable percentage of available storage capacity, though? Is there some sort of deduplication at work?
Pachyderm does a ton of data deduplication, both for input data that's added to pachyderm repos as well as for output files.
Pachyderm's pipelines are also smart enough to know what data has changed and what hasn't and only process the incremental data "diffs" as needed. If your pipeline is just one giant reduce or training job that's can't be broken up at all, then this isn't valuable, but most workloads include lots of Map steps where only processing diffs can be incredibly powerful
Pachyderm does it. It's like half of what pachyderm does, manage the versioned data, and schedule workers to run your containered processes against them.
FYI, it's ridiculously easy to get going playing with Pachyderm if you just want to check it out. You can run it on Minikube.
I have a “data science pipeline” coordinated with a Makefile and run on CI/CD (GitLab) with reports generated as build artifacts. Big stuff checked in with Git LFS.
Pachyderm pipelines can be run in a massively distributed fashion with data being sharded across many workers. Pachyderm also offers much better failure semantics than Make + CI. For example, if one shard of data of your pipeline fails such as a node or container dies, Pachyderm will automatically make sure that the data get rescheduled to another worker.
Each pipeline can have separate resource requirements (e.g. GPUs, lots of memory, etc) and gets scheduled by Kubernetes.
Finally, Pachyderm is versioning all of the intermediate steps in your data pipeline so if a downstream step fails, you don't have to restart from scratch, you can pick up right where it left off.
Congrats on your round! FWIW, (me being in a space adjacent to yours) I see these kinds of comments all the time about the "my 1 off pipeline seems good enough".
Most people don't need "industrial strength" till you hit a certain scale. They tend to optimize for ease of use/simplicity. It's one thing if you don't have to change their workflow, it's another if you have to not only have people change their workflow but also teach something new.
Convenience matters more. There's a pain threshold of "new tool" vs "this costs me x amount of time".
How are you guys overcoming this? Even though we're also in the infra space, I've never seen you guys out in the wild. Where would I bump in to you and what is the scale I would want to add the complexity of k8s + pachyderm + whatever other deps you guys have over just using an S3 bucket?
Another question: Why hasn't AWS just packaged this up and offered it as an extension of their k8s service? How are you guys going to overcome that?
Usually I see "hybrid cloud" or "on prem" as the response.
If it is on prem, are you guys relying on the presence of k8s at customers? Do you guys use something like gravitational?
Thanks for the response. I forgot to mention that the Make/CI jobs are run on a Slurm HPC cluster (actually multiple since the CI file uses runners running in multiple Slurm clusters for different resources). Make ensures we don’t redo stuff unless needed.
Point taken about restarts, but I usually want to see why a thing fails before retrying.
It seems like Pachyderm mainly offers better granularity, especially for the data management.
I think it'd be overkill if what you have is working for you. At least, replacing what you have may not be worth it.
For me, there's various properties I like about pachyderm (though they're missing some key workflows I'd like). Looking at some output data, I can see exactly what versions of everything was run against what data, and the data is all version controlled. This should also allow experimentation by individual people and easier approval of updates. I work with multiple teams with many components, where each one may impact others - being able to look at some final result that's weird and go back to figure out what's going on is hard. Tools like pachyderm help with that.
for a job where that workflow is sufficient, you probably shouldn't.
But if you need an always-on, always-ready pipeline that can scale horizontally on an as-needed basis, and perhaps mixes many different modules/languages, than Pachyderm might be a good fit.
Well you can use https://github.com/argoproj/argo for workflow needs without need to version everything. For data versioning if you need, you can use https://dvc.org/ As this gives me freedom to choose as if I want data versioning and/or workflow rather than one bug bundle.
Hi, Pachyderm CEO here.
We have been discussing this comment internally for the last 15 minutes and still don't understand it. But we really enjoy it. Would you mind explaining a bit more what you are saying?