Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Launch HN: Sematic (YC S22) – Open-source framework to build ML pipelines faster
121 points by neutralino1 on Aug 10, 2022 | hide | past | favorite | 47 comments
Hi HN – I’m Emmanuel, founder of Sematic (https://sematic.dev). Sematic is an open-source framework to prototype and productionize end-to-end Machine Learning (ML) and Data Science (DS) pipelines in days instead of weeks or months. The idea is to do for ML development what Rails and Heroku did for web development.

I started my career searching for Supersymmetry and the Higgs boson on the Large Hadron Collider at CERN, then moved to industry. I spent the last four years building ML infrastructure at Cruise. In both academia and industry, I witnessed researchers, data scientists, and ML engineers spending an absurd share of their time building makeshift tooling, stitching up infrastructure, and battling obscure systems, instead of focusing on their core area of expertise: extracting insights and predictions from data.

This was painfully apparent at Cruise where the ML Platform team needed to grow linearly with the number of users to support and models to ship to the car. What should have just taken a click (e.g. retraining a model when world conditions change – COVID parklets, road construction sites, deployment to new cities) often required weeks of painstaking work. Existing tools for prototyping and productionizing ML/DS models did not enable developers to become autonomous and tackle new projects instead of babysitting current ones.

For example, a widely adopted tool such as Kubeflow Pipelines requires users to learn an obscure Python API, package and deploy their code and dependencies by hand, and does not offer exhaustive tracking and visualization of artifacts beyond simple metadata.

In order to become autonomous, users needed a dead-simple way to iterate seamlessly between local and cloud environments (change code, validate locally, run at scale in the cloud, repeat) and visualize objects (metrics, plots, datasets, configs) in a UI. Strong guarantees around dependency packaging, traceability of artifact lineage, and reproducibility would have to be provided out-of-the-box.

Sematic lets ML/DS developers build and run pipelines of arbitrary complexity with nothing more than minimalistic Python APIs. Business logic, dynamic pipeline graphs, configurations, resource requirements, etc. — all with only Python. We are bringing the lovable aspects of Jupyter Notebooks (iterative development, visualizations) to the actual pipeline.

How it works: Sematic resolves dynamic nested graphs of pipeline steps (simple Python functions) and intercepts all inputs and outputs of each step to type-check, serialize, version, and track them. Individual steps are orchestrated as Kubernetes jobs according to required resources (e.g. GPU, high-memory), and all tracking and visualization information is surfaced in a modern UI. Build assets (user code, third-party dependencies, drivers, static libraries) are packaged and shipped to remote workers at runtime, which enables a fast and seamless iterative development experience.

Sematic lets you achieve results much faster by not wasting time on packaging dependencies, foraging for output artifacts to visualize, investigating obscure failures in black-box container jobs, bookkeeping configurations, writing complex YAML templates to run multiple experiments, etc.

It can run on a local machine or be deployed to leverage cloud resources (e.g. GPUs, high-memory instances, map/reduce clusters, etc.) with minimal external dependencies: Python, PostgreSQL, and Kubernetes.

Sematic is open-source and free to use locally or self-hosted in your own cloud. We will provide a SaaS offering to enable access to cloud resources without the hassle of maintaining a cloud deployment. To get started, simply run `$ pip install sematic; sematic start`. Check us out at https://sematic.dev, star our Github repo, and join our Discord for updates, feature requests, and bug reports.

We would love to hear from everyone about your experience building reliable end-to-end ML training pipelines, and anything else you’d like to share in the comments!



For people in this thread interested in what this tool is an alternative to: Airflow, Luigi, Kubeflow, Kedro, Flyte, Metaflow, Sagemaker Pipelines, GCP Vertex Workbench, Azure Data Factory, Azure ML, Dagster, DVC, ClearML, Prefect, Pachyderm, and Orchest.

Disclaimer: author of Orchest https://github.com/orchest/orchest


Yup, most of these tools fall into the definition of "orchestration" in one way shape or form, though not all of them are targeted at Machine Learning pipelines, and not all of them focus as much on the low barrier to entry space we're aiming at. In general they also don't optimize for local workflows while still enabling cloud access. We understand there are a lot of tools that may superficially look similar, so we've put together a page to describe things we do differently from other tools: https://docs.sematic.dev/sematic-vs

Btw, orchest looks like a cool way to orchestrate notebooks!

Also, hi! I'm founding engineer here at Sematic. Happy to answer any questions I can!


How does Sematic compare to Metaflow? it optimizes for many of the same goals of Sematic - local workflows, cloud access, lineage tracking, state transfer etc?


There are several differences, but I'd say these are some of the main ones:

UI: whereas Metaflow provides the ability to build your own result visualizations explicitly in your workflow (via their "cards" feature), Sematic makes it so that your outputs (and inputs) get automatic rich visualizations based on the data type of the data being passed around.

API: Instead of being based around explicitly building up a graph, where you have to explicitly specify the I/O connections between steps, Sematic makes defining your steps look like writing/calling python functions.

Packaging: Whereas metaflow requires you to include packaging information in the code defining your steps (the @conda decorator, etc.), Sematic plugs into your existing dependency management to bundle up dependencies for execution in the cloud.


Interesting approach on passing data between steps and constructing the overall graph - it will be interesting to see what the take rate is between the two approaches (of sematic and metaflow). On the UI front, Metaflow generates viz for all objects by default in @card; but how does Sematic package up PyTorch referenced in the example (https://docs.sematic.dev/real-example) for execution on the cloud? IIRC, Metaflow packages the cwd (in addition to @conda, @pip etc.) and relies on existing packages for local execution?

Edit: Digging deeper, Sematic relies on Bazel (https://docs.sematic.dev/execution-modes#dependency-packagin...) and needs a BUILD file to specify all the dependencies for cloud execution. It seems that the entire pipeline will execute as a single (or multiple) k8s pod(s) using the same environment?

I am quite interested in trying out Sematic. Any guidelines on what kind of scale Sematic can support today (and the near future)?


The way packaging is designed to work in Sematic is for us to hook into your existing dependency management solution to determine what your dependencies are, then build a docker image for you based on those. As you point out, right now we only integrate with bazel for this purpose, but we hope to add more. A simple plugin for requirements.txt -> Docker image is probably next on the TODO list.

> It seems that the entire pipeline will execute as a single (or multiple) k8s pod(s) using the same environment?

single docker image, but multiple pods (when you are using the full cloud mode). This was an intentional decision to avoid confusion around what things could be imported in what places (mimicking more what it would be like in one python instance), and also avoid weird version inconsistencies across the pipeline.

> Any guidelines on what kind of scale Sematic can support today (and the near future)?

Based on some prior tooling experiences, the main bottleneck should be what your Kubernetes cluster can handle.

> I am quite interested in trying out Sematic

Glad to hear it! We'd love to hear about your experiences. You can join our discord if you want help while you're trying it out: https://discord.com/invite/4KZJ6kYVax


Nice comparison section! And thank you kindly for the compliment.

Our approach has been supporting both Python scripts (R, Julia, JS and Bash too for that matter) and notebooks as it would give users the ability to choose the right tool for the job and in case of migrations from notebooks to scripts make the process more incremental.

Welcome to the thread :wave:


Why do people keep on re-inventing pipelines? There's so many already, all with almost identical syntax. This one is similar to Flyte.


Looks cool!

> Sematic makes I/O between steps in your pipelines as simple as passing an output of one python function as the input of another. Airflow provides APIs which can pass data between tasks, but involves some boilerplate around explicitly pushing/pulling data around, and coupling producers and consumers via named data keys.

In robotics you sometimes need high performance data transformation e.g. convert pile of raw robot log data protos --> pile of simulation inputs --> pile of extracted data --> munged into net input format

Does semantic support this if the communication between tasks uses python functions? Like if my simulator is C++, will I have to use SWIG?

In some of the competing systems, the input/output between nodes are just produced files as side effects, which is nice because it doesn't care what language / infra you use as long as you produce the required input/output.


Thank you!

What we have seen done in the past is use things like pybind11 to expose C++ APIs in Python, which I guess is a similar concept to SWIG. If you are using build tools such as Bazel, you can even get the C++ compiled at run-time when submitting your pipeline.

Regarding i/o artifacts, Sematic lets user choose how they are serialized. We offer reasonable baseline defaults, but certain artifacts require serialization formats that are cross-language (e.g. ROS messages since you mention robotics).

As a last resort, users are free to serialize and persist artifacts by hand as part of their pipeline functions (e.g. storing in a cloud bucket) and only returning a reference to said artifact (e.g. a Python dataclass with artifact location and metadata).


I will check it out after work. Let me just say that this is indeed a legitimate problem. After you train the model, to me it takes at least 3x the amount of effort to deploy and push to production.

I wish it was as easy as drag and dropping the model to target servers after building the model.


Absolutely. Software development has many nice CI/CD patterns to generate assets, test, deploy. It enables software engineers to work fast with a safety net and have all pipelines automated.

I think ML development is where web development was in the late 2000's. Clear patterns and best practices have not yet emerged. For example, many ML developers work without reproducibility and traceability enabled. This does not allow for fast and safe work.


Ok, I'll bite.

disclaimer: this is a nice framework, will happily try it.

Imho: The underlying patterns are quite clear, and there are various approaches to build stable pipelines.

I have used automation with basic containers + gitlab actions / custom runners, clearml, earthly pipelines, kubeflow,.. for this.

All of those can give reproducibility (experiment tracking, code & dependencies, etc.) without much effort.

The last mile (model deployment) is often very specific, so let's keep that out of scope.

But: The basic problem is cultural, not technical.

One stated goal of this project hits close to the root cause: "Facilitating the transition from Jupyter Notebook prototype code to steady production-grade pipelines".

As ML developers, we have to stop regarding notebooks as anything that produces acceptable output (apart from initial exploration). Work has to happen in structured, tracked, and versioned codebases (~production code).

Anything that happened locally/in a notebook might as well not exist from my point of view.


That is very true. In our careers we have repeatedly incentivized users to exit Notebooks early on.

Of course we understand why Notebooks are so appealing and useful, and we think they have their place in the toolbox (like a Python console for developers).

And you are correct that there already are many tools to build reproducible traceable pipelines. What we have found is that they are still too difficult to adopt. Which is why we are trying to greatly lower the barrier to entry.


Notebooks, with the right workflow, are just fine, just use nbdev or equivalent.


I have an idea where I want to build an ML system that generates different sets of board game rules (think tic-tac-toe type games), then trains models to play that game, and scores each set of rules based on a set of criteria. For example: no side should always win, the skill ceiling should be high (models should keep improving when trained more). A less skilled (trained) model should sometimes be able to beat a more skilled model. The games should end within a reasonable number of turns. Etc. The high level system should then generate new rulesets, searching for a ruleset that scores optimally on the criteria. Would Sematic be good for this?


Thanks for your question! Yes, Sematic has a neat feature that can help: Dynamic Graphs. Because Sematic uses simple Python to declare the control and data flow of your graph, you can simply loop over configurations (in your case different sets of board game rules) and train a model for each config, and eventually aggregate results to determine the winner.

Join our Discord in you want to discuss this further – https://discord.gg/4KZJ6kYVax


Awesome. Thanks!


> Machine Learning (ML) and Data Science (DS) developers are not software engineers, and vice-versa.

Woof, I'm out.


Your comment struck because superciliously I agree, but when thinking about it, it's quite true. Both are strongly related but quite different, just as a marathon runner (though better than average) will not set world records on a 100m sprint.

A software engineer can easily use a XGB model and get some outputs, but is a long way from having a deep understanding of statistical distributions. A ML engineer can easily put together a Django website, but is far from designing a maintainable software application.


Maybe on average it's true, but it's also a weird tone to strike in marketing materials. Where I work the MLE's are expected to write production level code and deploy their own services. We explicitly hire 1 level down on the SDE ladder when hiring MLEs, so a senior MLE should basically be almost a senior SDE.


> with minimal external dependencies: Python, PostgreSQL, and Kubernetes

What a time to be alive.

All jokes aside, this is really awesome and i'm glad to see more and more tools to make ML more developer friendly and accessible. Out of curiosity do you guys come from a TF, pytorch, jax, etc background?


Indeed, Kubernetes is hardly "light" :D

However, it's only required when running pipelines in the cloud. When running locally (`$ sematic start`), nothing else than Python is required.

Regarding background, we are folks with experience building ML tooling. We've built infra around TF and Pytorch, but none with Jax. That being said, Sematic is agnostic to the framework you want to use.


How about if "locally" is a handful of servers reachable by ssh?


You can certainly deploy the web app on one server, and run your pipelines on another (or the same). In this case, "locally" would mean that the pipeline and all its steps run on the same host machine. This is totally sufficient in many cases.

Kubernetes becomes interesting when using heterogeneous resources (e.g. GPU nodes for training, high-memory for data processing, etc.), but is not a necessity.


Congratulation on the launch! Best wishes! Would absolutely love to dive into it soon.

Here are some high level questions:

- How does it handle failure of individual tasks in the pipeline? - What if the underlying jobs (e.g. training or dataset extraction or metrics evaluation) need to run outside the k8s cluster (e.g. running bare-metal, slurm, sagemaker, or even a separate k8s cluster)? - How does caching work if multiple pipeline can share some common components (e.g. dataset extraction)?


> - How does it handle failure of individual tasks in the pipeline? At this time there are no handling of failures (Sematic is 6 weeks old :). In the near future we will have fault tolerance mechanisms: retries, try/except.

> - What if the underlying jobs need to run outside the k8s cluster? You are free to launch jobs on third-party platforms from one of your pipeline steps. This is a pretty common pattern, for instance launching a Spark job, or a training job on a dedicated GPU cluster. In this case, the pipeline step that launches the job (the Sematic function) needs to wait for the third-party job to complete, or pass a reference to the job to a downstream step that will do the waiting.

> - How does caching work? At this time there is no caching (as mentioned Sematic is very new :). We will implement memoization soon. What you can do is run a data processing pipeline separately and then use the generated dataset as input to other pipelines. This is a pretty common pattern: having a number of sub-pipelines (e.g. a data processing loop, a train/eval loop, a testing/metrics loop, etc.) that you can run independently, but also you can put them together in an end-to-end pipeline for automation. Sematic lets your nest pipelines in arbitrary ways, and each sub-pipeline can still have its own entry-point for independent execution.


Sounds great! Very interested when the SaaS offering opens up. Definitely not keen on running a Kubernetes cluster for the sake of simplifying ML operations.


Glad it looks interesting to you! Regarding running a Kubernetes cluster, that's only required if you want the steps in your pipeline to all execute in their own containers. If your workflow is such that everything can run on a single machine, you can still use Sematic to track your experiments. One advantage here is that if you ever do need to scale up to containerized workflows, you can do so without changing the code for your pipeline.


Do I still need to manage a kubernetes cluster?


In the current open-source product, yes. But you can also simply use Sematic locally without any infrastructure (`$ sematic start`).

In the coming months, we will provide a fully-hosted SaaS offering to save you the hassle of maintaining your own infrastructure.


Love this, as an MLOps practitioner who has repeatedly needed to build this at multiple banks, Sematic seems finally like a real solution for the wider world, a real place to bring best practice to data science pipelining.


What are my options for big data that won’t fit completely into memory? Is it easy to hook up to a spark cluster?

Do I have the option to access the underlying infra through a Unix shell, when the UI isn’t enough?


You are free to launch jobs on Spark clusters from any of your pipeline steps, whether they run locally or remotely. Be mindful that if you do so, the function that launched the job needs to also wait for its completion. In the future we will provide deeper integrations with Spark.

When running on cloud instances, Sematic runs container images on your Kubernetes nodes. You are free to SSH into those and run the images manually to have the same dependency context, but any ephemeral context of the job (e.g. storage) will no longer be there.


Uber needed to expend a huge effort to make PyTorch play with Spark aka Holovord/Petastorm and, considering, it's still a complete and utter mess if you do anything but load a preprocessed parquet custom partitioned/row optimized for batch/cluster size. So, I am not about to blame you for a lack of Spark integration, as everyone pretty much rolls their own solution anyway at this point.

What does bother me here, much as with many other ML orchestration frameworks, is that examples and even docs don't tell me how you play with serious situations in terms of scale.

In fact, what you describe as "a real ML pipeline" is - in my view - not a good example of a real ML pipeline because, for instance, it doesn't tell me how I'd solve the issue of scaling out multi-node training when the data doesn't fit in a neat and standard PyTorch Map Dataset that loads some csv from the web.

I mean, maybe it's because I am dumb, and maybe the majority of people do in fact train single-node models on MNIST data, but I'd appreciate some more information on how you deal with more diverse sources in your pipeline. Will I have to squeeze cloud provider X's data solutions (which I am obligated to use by the client, say) into submission until they fit your examples? Because these days I get the feeling that claims of "easily orchestrating your ML pipeline" often amount to that. I see you started some of these topics in the "integrations" part of the docs. However, these pages do not seem to exist yet (for me). Furthermore, the "roadmap" link goes to a 404.

For me, these are the important topics. I can get a nice simple ML pipeline easily on Azure, AWS or Databricks if I am willing to conform to whatever they are doing already. It seems you are in a position to tackle more challenging problems, so that would be nice to show.

Cool product, and good luck!


Thank you – You are right that these are very important topics, and we also had to expend a lot of work at Cruise to scale training beyond single node. We had training jobs running over dozens of GPU nodes for many days. For example, we had a dedicated team to optimize streaming of training data into PyTorch dataloaders. This evidently requires more infrastructure, and also many features around fault tolerance, checkpointing, warm restarts, etc.

We are a very new framework (launched publicly July 1st :-), so there is much work to be done to cover many more example use cases.

What we have found powerful about this plain function approach is that users can submit jobs on remote platforms (e.g. Spark, Google Dataflow, etc.), and use heterogenous resources (e.g. standard nodes to launch third-party jobs, then GPU nodes for training, etc.). So whatever "cloud provider X's data solutions" you have to use, if it has a Python API to submit and wait for jobs, you should be fine.


> The idea is to do for ML development what Rails and Heroku did for web development.

I think this is a great way to explain what you're doing. I'm working in the same space (ML/DS tooling) and I feel like we, as the ML/DS community, haven't cracked exactly how Rails for data looks like, I actually wrote some ideas on this a while ago (https://ploomber.io/blog/rails4ml/).

Congrats on the launch and best of luck with the product!


Thanks! I was a pretty heavy user of Rails in past jobs. They have a nice mix of good abstractions (Model/View/Controller), tooling (CLI, local web server), and best practices (how to name tables, fields, lifecycle timestamps, etc.).

We think that is a good way to go: solid abstractions that experts can build on top of, but also more junior folks can get started quickly by following best practices.

Ploomber looks great too, I like the breakdown of your SaaS offering.


How is it different from MLFlow? the recent MLFlow pipeline, does it has any similarities?


MLFlow in general focuses more exclusively on the "lineage tracking" pieces alone. I.e. registering your models and such. They do have an experimental pipeline product as you mention, but it's different in a number of ways. I'd say the biggest one is that MLFlow pipelines has fixed, pre-determined structures for your pipelines while Sematic lets you build your own. We also have a more python-dev-friendly API and better access to cloud computing resources. You can find more details on our comparison page: https://docs.sematic.dev/sematic-vs#...-mlflow-pipelines


Just a note to say that even though the name "Sematic" is the same, this is not the same open source project as mine that I posted to Show HN about a week ago here: https://news.ycombinator.com/item?id=32364193.


As they say, naming things is one of the two hardest problems in Computer Science :)

Your project is cool and has a cool name!


Haha, thank you, and yours has a great name too!

Both our projects are ML-adjacent. I don't want to cause any confusion, and will start thinking of some alternate names for my project.


Independently of the name collision, your Show HN looks good! and didn't get any attention. If you email me at hn@ycombinator.com, I'll send you a repost invite for it.


How is this different from or complementary to Tecton?


I'll confess I haven't used Tecton myself, but reading through their documentation it seems that they are much more focused on ETL style data pipelines with the final output being to a feature store. Whereas Sematic is looking at the general end-to-end ML pipelines (e.g. not just dataset transformations/feature extractions, but also model training and evaluation). In case it's helpful, we do have a page that compares Sematic with some other tools in a similar domain to us: https://docs.sematic.dev/sematic-vs#...-mlflow-pipelines


This is amazing. Long live open-source platforms that let developers and data scientist focus on the interesting parts of their jobs and day and make them more productive.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: