Hacker Newsnew | past | comments | ask | show | jobs | submit | karakanb's commentslogin

I have been recently looking into extracting a bunch of details from a set of legacy invoice PDFs and had a subpar experience. Gemini was the best among the ones that I tried, but even that missed quite a bit. I'll definitely give this a look.

It seems like such a crowded space and there are many tools doing document extraction, I wonder if there's anything particular pulling more attention into the space?


Disclaimer: I am the co-founder of a dbt alternative, Bruin. (https://github.com/bruin-data/bruin)

I think consolidation in the space has been coming for quite some time now and this merger only confirms what us, along with many others, have been saying: the data tooling is in a miserable state and we had to glue together a bunch of different tools that don't work with each other.

At this point, I think it is quite obvious that Fivetran is going for Snowflake/Databricks's market share. They own the ingestion for many companies already, and they will offer a managed data lake product in order to compete with the data giants. By owning the means of bringing the data in (Fivetran) as well as the transformation layer (dbt/sqlmesh) they will aim to get ahead of Snowflakes of the world.

I think it'll be a win for the data community if they maintain and continue investing into the existing tooling, as they are running in quite a few places already, especially dbt core running in a self-managed way. I certainly hope they won't try to squeeze revenue for the sake of it from their combined users.

It's an interesting time to be in the space, and it feels great to be one of the few independent players in the market.


How will Fivetran and dbt who are detested for being overpriced and underfeatured in the segment they are supposed to be good at (ETL/ELT) be taking on being a datalake? That's orders of magnitude more complex in engineering and operating and they have no experience. This is really a play to consolidate, get rid of duplicate functions and provide a better experience to customers.


This seems to undermine the engineering muscle these companies have. Fivetran is well-capable of building a query engine, and with this merger, they also get access to SDF's query engine. They have the engineering capabilities, as well as the capital to attract the talent where needed.

I would not underestimate any of these players in the space.


With Snowflake's new OpenFlow offering based on Apache Nifi, Snowflake will be able to become Fivetran faster than Fivetran can become Snowflake/Databricks, though....

Thoughts?


Fivetran has 2 pieces: data movement and integration/access (via linking to APIs, etc.) Sure, there's Airbyte and others, but no one has as complete a catalog of integration plugs as Fivetran.

Yes, OpenFlow may be able to replace the data movement part (though prefect, airflow, yadda yadda all tried to be the one ring), but it's a pretty small bunch who support all connections.

It's not the only part of the business, but it's an important one. Just like Plaid's approach became the standard to code against in accessing financial services, Fivetran has become the default in getting data out of tools you already pay for directly into a controlled space.

If they don't muck that up, they've got an embed in every large integration. No one is looking to OpenFlow for that.

Still won't become Snowflake or Databricks (and it's silly to try, imho), but they do have a good moat for a small castle.


OpenFlow feels like an attempt at keeping customers within Snowflake boundaries, and while it might work for some, I see no way Snowflake being able to keep up with the data integration needs unless they allow another way of extending their capabilities other than pre-built integrations.

On the other hand, I do agree with you that it is quite a big challenge for Fivetran to try and become Snowflake.


They also acquired Census recently - reverse ETL.


This seems really interesting, thanks for sharing.

I can see myself deploying something like this for some of our internal usecases, however the deployment options around such tooling seem to be rather complicated, especially when it comes to the need for RAG and stuff. Would I need to host something else on top of Upsonic for these sorts of usecases?


In the framework we are trying to support all this kind of request. But the integrations like MCP will help to this process by connecting to other developers.


while MySQL is very popular, it is very rare to see it in analytical/ML usecases, that's why we haven't added it yet. There's nothing from a technical POV that prevents us from adding, it just hasn't been a priority, I am happy to pull it up if that would help your usecases.


thanks! we already use DLT under the hood with ingestr, so some of our connectors already come from there. is that what you meant?


Oh didn’t know that. The last time I looked at ingestr it had fewer — seemingly — connections than DLT advertises.


hey, thanks a ton for sharing your thoughts, I appreciate that!

I am sorry to hear that it didn't work, we do have a dedicated page for duckdb specifically here: https://bruin-data.github.io/bruin/platforms/duckdb.html

Would this help with it? I'd love to see how we can improve if you'd like to share your thoughts on that. Please feel free to join our slack community as well, we can talk directly there too.


hey, thanks a lot for sharing your thoughts.

I like the comparison page in Hamilton, and in their examples they operate in the asset level, whereas Bruin crosses the asset level into the orchestrator level as well, effectively bridging the gap there. What Bruin does is beyond a single asset that might be a group of functions, it is basically being able to build and run pipelines of that.

In terms of distributed execution, it is in our roadmap to support running distributed workloads as simple as possible, and Postgres as a pluggable queue backend is one of the options as well. Currently, Bruin is meant as a single-node CLI tool that will do the orchestration and the execution within the same machine.


love it, thanks!


glad to hear you like it, thanks!!


that's definitely coming, thanks!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: