Here's an interesting performance comparison of PostgreSQL with PostGIS and Mong...

tcbasche · on April 5, 2021

Thanks for sharing this! I've been thinking about alternatives to PostGIS to handle larger datasets (millions) and nothing seems to come close to the level of functionality, performance and community support of Postgres

alexhutcheson · on April 5, 2021

If you really need to scale beyond what Postgres/PostGIS can handle, then you might want to check out GeoMesa[1], which is (very loosely) "PostGIS for HBase, Cassandra, or Google BigTable".

That being said, you may not need it, because Postgres/PostGIS can scale vertically to handle larger datasets than most people realize. I recommend loading your intended data (or your best simulation of it) into a Postgres instance running on one of the extremely large VMs available on your cloud provider, and running a load test with a distribution of the queries you'd expect. Assuming the deliberately over-provisioned instance is able to handle the queries, you can then run some experiments to "right-size" the instance to find the right balance of compute, memory, SSD, etc. If it can handle the queries but not at the QPS you need, then read replicas may also be a good solution.

[1] https://github.com/locationtech/geomesa

tcbasche · on April 5, 2021

Yeah at my current job we run RDS in AWS and scale it up to a m5.12xlarge when we need to get sh*t done fast. It normally sits around a 4xlarge simply because the 12 is far too expensive.

I'll check out GeoMesa though, looks interesting!

simonw · on April 5, 2021

Millions is a pretty small dataset these days. I'm increasingly of the opinion that the boundary for considering "big data" tools starts at about 100 million rows.

nucleardog · on April 5, 2021

I'd put it higher, depending on your dataset, usage patterns, etc.

This is kinda the hill I keep nearly dying on at work. Team X wants to spend months investigating and deploying BigDataToolY because "we have big data". This is not our core business or core competency. I tell them to dump the data into Postgres and just see how it performs out of the box so we can get back to working on stuff that matters. They don't, I do, we end up using Postgres.

We had one team ignore the advice and go straight to Redshift (which is a warehousing product and totally inappropriate for their use case; but they ignored that advice too). When they finished and then woke up to the reality that Redshift wasn't going to work, I literally just dumped their data into a plain vanilla Postgres instance and pointed their app at it and... everything worked out of the box. That was ~1b rows in a single poorly-modeled table.

Another team was certain that Postgres could never work and was looking to build out a solution around BigQuery, Firestore, and some other utilities to pre-process a bunch of data and pre-render responses. One of our main products was operating in a several degraded state while they spent months on "research" before deciding this would take about four more months to implement. So I dumped all the data into a Postgres instance (using TimescaleDB) and it... worked out of the box. The current data is a few billion rows across half a dozen tables (the bulk of the data in a single table), but I'd tested to 4x the data without any significant performance degradation.

These are just a couple "notable" examples, but I've done this probably a dozen times now on fairly sizeable revenue-generating real-world products. Often I'm migrating this data _off_ of big data solutions which are performing poorly due to either the team's lack of knowledge and experience to use them properly or the tool having been the wrong one to use in the first place.

I've yet to have to even have the team model their data properly. Usually just a lift and shift into Postgres solves the problems.

I've told every one of these teams "We'll use Postgres until it stops working or starts getting needy, _then_ we'll look at the big data tools.". I've yet to migrate any of these datasets back _off_ of Postgres.

simonw · on April 5, 2021

Love this comment.

I've promoted the idea in the past that if you're going to use something other than PostgreSQL (or MySQL if that's the DB that's already embedded) you need to PROVE that what you need to build can't work with that standard relational database before adopting some new datastore.

It's surprisingly hard to prove this. The most common exception is anything involving processing logs that generate millions of new lines a day, in which case some kind of big data thing might be a better fit.

dunefox · on April 5, 2021

Everytime Postgres comes up I'm impressed.