MongoDB has successfully played the 'hype first, features later' strategy. Now it is well on the way to being a decent swiss-army-knife database.
The RethinkDB retrospective[0] contains a lot of insight into how MongoDB has succeeded despite being vastly inferior on a technical level back when it first launched. I have to admit them a certain respect for executing their strategy so successfully.
Choice quote:
Every time MongoDB shipped a new release and people congratulated them on making improvements, I felt pangs of resentment. They’d announce they fixed the BKL, but really they’d get the granularity level down from a database to a collection. They’d add more operations, but instead of a composable interface that fits with the rest of the system, they’d simply bolt on one-off commands. They’d make sharding improvements, but it was obvious they were unwilling or unable to make even rudimentary data consistency guarantees.
But over time I learned to appreciate the wisdom of the crowds. MongoDB turned regular developers into heroes when people needed it, not years after the fact. It made data storage fast, and let people ship products quickly. And over time, MongoDB grew up. One by one, they fixed the issues with the architecture, and now it is an excellent product. It may not be as beautiful as we would have wanted, but it does the job, and it does it well.
> MongoDB has successfully played the 'hype first, features later' strategy. Now it is well on the way to being a decent swiss-army-knife database.
I have no idea how capable MongoDB is these days, as I haven't used Mongo in years (and even then it was not for long).
However, I do not know any developers who, after living through the "hype first, features later" strategy, have been left with a positive enough opinion of MongoDB to ever want to use it again.
Epic had a post mortem blog post here that mentioned in passing they had stumped all the experts they could find to look at unsolveable issues they had with MongoDB.
https://news.ycombinator.com/item?id=16340462 I kind of assumed the fix is going to be a rewrite with Postgres or MySQL.
- You think people replace a MongoDB cluster by a single Posgres instance? You guys should really use HA, cluster in real life and stop reading reddit / HN and the hype behind PG, with 3.5M+ CCU no one would use an architecture with a single master / slave ( that's what pg is ).
MongoDB / MySQL have bad press by people that never used it in real life and just repeat what they read online.
I could tell you horror story about pg not have an official replication system until 2011 when pg 9.0 landed.
> I could tell you horror story about pg not have an official replication system until 2011 when pg 9.0 landed.
I could tell you a horror story that happened to me just few weeks ago, where MariaDB just corrupted data out of nowhere due to a bug[1]. This happened multiple times and costed us multiple hours of work (including service being down) each time it happened until we realized the issue wasn't hardware but a software bug.
If you ask me, I take PostgreSQL approach of not having a broken replication before 2011 than MySQLs still corrupting data.
Data usually is the most valuable asset a company has.
> I could tell you horror story about pg not have an official replication system until 2011 when pg 9.0 landed.
And I would re-iterate that just because something isn't in mainline, doesn't mean it's not possible. Did you know that Pg didn't have native partitioning until Pg10? Somehow we managed to do partitioning before then.
I don't buy the argument that you need to ship broken features just to have them; Pg doesn't include it into base until it's a /good/ solution which is well engineered and has appropriate toggles. That is not a horror story.
> - You think people replace a MongoDB cluster by a single Posgres instance? You guys should really use HA, cluster in real life and stop reading reddit / HN and the hype behind PG, with 3.5M+ CCU no one would use an architecture with a single master / slave ( that's what pg is ).
I shipped a game which had similar CCUs (within the order of magnitude) and I can confirm that you can't do it with one postgresql machine, or.. actually you could but we chose to fsync() constantly to prevent corruption from ever happening and remove the RAID cache.. but you can shard on top of your database solution too.
I feel I need to quote the post mortem back at you so you point out where I mis read the quote.
"Our top focus right now is to ensure service availability. Our next steps are below:
Identify and resolve the root cause of our DB performance issues. We’ve flown Mongo experts on-site to analyze our DB and usage, as well as provide real-time support during heavy load on weekends."
You're implying they couldn't fix MongoDB or reached the limit which is false. In the current ( HN ) post they said they fixed it, I'm pretty sure they didn't have any experience in DBs in the first place hence why they asked for help.
Nowhere in the original post they mention issues related to MongoDB itself it was probably bad design on their side.
OK I should have been clearer about my interpretation - I read flying in experts as they flew in experts from MongoDB and stumped them so that had me thinking maybe this is not possible if they stumped them. Earlier in this thread one of the engineers from MongoDB says Epic solved the issue but had not updated the blog so I was wrong about that.
> mentioned in passing they had stumped all the experts they could find to look at unsolveable issues they had with MongoDB
That's not really what the article is saying, unless we interpret the following text differently.
"We’ve flown Mongo experts on-site to analyze our DB and usage, as well as provide real-time support during heavy load on weekends." -> "We have started to look into the problem together with experts" and not "Experts have tried and failed".
JSON has been great in MySQL for a few years now since 5.7, and recursive CTEs are coming in the next couple months with 8.0. I don't think you can make a wrong choice between the two these days, but choosing Mongo over either is almost always the wrong decision.
This is actually what I think the biggest power of JSONB is.
I can for example use jsonb_agg() and get a hierarchical response for 1:N joins. It returns JSON value even though neither of the columns contain JSON.
Previously in that scenario I would either need to make more than one query or get a response that has a lot of data repeated.
Another way of reading that post is that MongoDB is being used for some of the highest throughput concurrent workloads out there... and those are always hard to optimize. Doing a lift and shift for a "grass is greener" alternate solution is not a clear cut path to victory at all... but it's certainly a giant science project to contemplate.
Yes, I kind of wish MongoDB had come out and said what they are doing to help Epic Games here if this is something to address that issue or what the plan/thoughts are on the most newsworthy usage of MongoDB in a while.
I am one of the team of MongoDB engineers working with Epic on this issue, and I can assure you that the situation is under control and we have everything in place to scale this application to much higher numbers. However, we're not publishing details about our support cases, especially while they are in progress. That is something for Epic to decide, and I do assume they will eventually say in public just how well MongoDB is, in fact, performing for them.
Not sure if you are aware but what you are asking for never happens.
It is not professional or appropriate for vendors to be revealing (a) that clients are having issues and need support and (b) the specific workings of technologies or processes within the client's business.
There's a whole 'nother generation of devs coming through who have never been burned by MongoDB though. Obviously they will be eventually, but by then another generation will come along to repeat the cycle.
I love deriding mongodb as much as the next dev that hasn't used it much; but I'll just note that while I'd still be hard pressed to prefer mysql over postgres - there was a long period where mysql was put to tasks it was ill suited for, especially prior to around version 4.x.
So while "hype first" might reap a deservedly abundant and bitter harvest of developer hatred - it doesn't preclude evolving into a genuinely useful product...
Both true, although I can completely understand why devs went with MySQL over PostgreSQL at that time. I remember that during the same time period that MySQL was drawing seemingly endless criticism for generally poor RDBMS behavior (3.x and 4.x), PostgreSQL was notorious for having poor performance due to insanely undersized default settings. Like out of the box it was sized to run with at most 10 MB of RAM or similar that was just unrealistic.
I also remember it also had a lot of quirks and missing features prior to v8. I assume it was leftover cruft from Ingres, but I remember PostgreSQL v6 and v7 being unreasonably complicated to get configured just because the defaults were so off reality.
One thing you can say about PostgreSQL, though, is that it's developers don't rest on their heels. Every major release packs in a ton of new features. They've gone from being fairly low or middling on the feature set to being pretty near the top. Even point releases have me saying, "Wow, that's really nice to have."
I used it about 3 years ago and my first thought was "How broken will multi-document ACID transactions be?"
I still want to like MongoDB, I still miss its style of query vs SQL, but I'd have a hard time advocating its use again...
Sometimes it's tempting to use it for projects that I know will remain small, but even then it's not worth the overhead of standing up a different DB when I have a perfectly good SQL server I can muddle through already.
We are in the process of evaluating CockroachDB vs Rethink internally and we've found CockroachDB to perform very poorly without obvious disk or CPU issues. I'm curious if you've seen different especially as it relates to Rethink.
I didn't do comparisons. But RethinkDB has straightforward issues like a slow QL implementation using a lot of CPU and a lot of disk space usage. Change feeds have a few scaling issues if you want a lot of them. I don't know that it has mysterious kinds of excessive resource usage. I'm a dev of RethinkDB, not an end user, so I might be seeing the worst side of it. I haven't used Cockroach or TiDB.
I work in a Danish muniplicity. Traditionally we've build everything on SQL because it's the world we function in, but we adopted the MEAN stack as a proof of concept a few years back and Mongo has been growing ever since.
It does require building and maintaining schemas in a different manner, but when you do that, it's pretty great to work with, especially when we're doing design driven development that consists of a lot of prototyping.
I'm a fan, but I'm a manager on business development and digitisation, so I may be a little sheltered from whatever annoyances it may cause in operations.
I am curious why a municipality needs custom software. I mean, the scandinavian countries had standardised paper forms for most municipal tasks (population register, ledgers etc) already in the 17-18th century, and those were used nationwide, or at least throughout a single province. Why can't the same be done with software?
Well there are 98 municipalities and 98 ways to operate in a thousand different ways.
I’ve worked on quite a few multi-municipalitiy open source
projects, like handling employee refunds on driving.
Basically I drive x kilometers for a meeting, I get paid x and the taxman gets the report. Simple stuff.
Well in the 6 parties involved there were 6 ways to interpret tax laws, 4 different agreements with unions on what rates to pay, 3 different payment systems with 3 very different ways of taking the reported data from a flat file to a rest interface, at least one political decision to overrule tax laws for a certain set of employeees and several different ideas on how to host it and so single sign on, oh and 4 different ways to obtain employee data.
That’s for a simple system with basically 1 function. We have more than 350 it-systems.
Another example is in automation. We have a scanner software and we have an archiving system. They both have APIs but the APIs speak very differ languages. This meant that our local scanner people were tasked with distribution after they scanned things, a task taking several hours each week because putting files into many different areas of an archive sucks. What we did was ask the scanning company to build a QR reader into their software, and then we made a piece of software that put the archive recipient addresses into QR codes. We also made a MOX agent, that accepta the output of our scanning software and loads it into the archive through the API. So now the process of distributing is automated.
You can certainly run a municipality without developers, using standard software and outside hires, it’s just really expensive.
Would it be fair to say that the political entity one step above the municipalities (whatever that is in Denmark) are not doing their job? I mean not doing their job on standardising things that can be in common between the municipalities. Some things will of course have to differ, but a lot of stuff likely differ just because not-invented-here. It sounds like the legislative environment is too complex, and that you have to work around it with a ton of software. Could it even be the case that computer systems have somewhat removed the incentive for the administration to rationalise the various systems? With just manual labor and typewriters all of that would have been very expensive, but with a server hall and a medium-size IT-team it kind of works out.
Perhaps digitalisation only having come half-way is a factor - you mention scanning, but by now the so called "paper free office" that was a buzzword in the 1990s should be here already. Or is it perhaps just another sign that the IT industry overall is still very immature and this will sort itself out with time?
I think it's too complicated to blame anyone really. I mean, we working on standardising as much as possible, but it's often impossible because business practices are just so different. Often big standard products fall extremely short, or end up in complete failures because you can't jam people into boxes on an enterprise scale especially not when the people who build the systems have next to no domain knowledge and the people who write contracts have no technical knowledge. :)
I guess our government should work on writing laws that are more friendly to digitisation and stop expecting IT to fix business practices that don't really make sense in the first place. There has been a genuine movement toward that, but it's slow because none of our top politicians or bureaucrats are from technical fields, and they operate on such a high strategic level that they're often rather far from the daily challenges in a daycare institution.
Local political leadership and bureaucracy could certainly do more to focus on corporation, standardisation and digital transformation, and they actually do, but political views differ and they change every 4 years, and the truth is that there just isn't any voter interest in IT unless it goes wrong.
We're trying to build national standards, we've had a set of architectural standards called Rammearkitekturen for a few yers now, but getting them implemented is slow. For one they're made by muniplicities and our structure of government is split in three. Muniplicities, Counties and the State and each branch has it's own ideas, leading to bureaucracy and political differences. Some want us to use EU standards, others want us to build our own, and even if we decided, there are different sets of EU standards as well as different sets of Danish standards.
I personally think the best we can do is try to use whatever national standards are in favour, and build smaller applications on them, with open API's, and run everything as SaaS in infrastructures such as AWS or Azure. I also think we should do a lot more work on business development, modifying business practices before we throw IT at something.
But it's complicated and it's on a giant scale where even minor changes take years to implement
I used it quite happily circa 2009 and at a different company in 2014. In both cases it was being added to systems that already had mature functionality built atop a RDBMS. In the first case it was used to store events that had started to overwhelm the main RDBMS with write volume. (Originally a system with one database as the monolithic data store.) Probably Kafka would have been even better for this use case, had Kafka been available at the time. But MongoDB did the job very well. I did a prototype in Cassandra too before settling on MongoDB, but MongoDB had much better docs, drivers, and single-node read performance at the time.
The second time I used MongoDB to automatically track templated email bodies that were being delivered through a third party mail platform. We had dozens of recurring templates and many more one-off templates for different curated campaigns. If somebody complained that a link or image or token was wrong in their email, we wanted to be able to look back at the history to see if the problem was in the template data or potentially a client issue on their side. Most of the queries were ad-hoc and not very performance-sensitive. This was where a flexible JSON document format came in handy. Modern Postgres would have worked well for it too, but that wasn't available in the company at the time. With MongoDB I got good flexibility, adequate speed, and I avoided reinventing wheels by not trying to shoehorn the data into another MySQL table. I was able to solve a customer support pain point in less than a week and the system has worked well for nearly 4 years now.
I'd be really frustrated if I had to use MongoDB as my only data store. I would guess that much of the hate for it comes from people who were forced into that position, or maybe from people who didn't take its documented limitations seriously enough before productionizing its use.
I don't know much about MongoDB. I am mostly a client-side developer after all.
But every time I see a team transitioning from Mongo to something else, they transition to a relational database. May be their problem is not with MongoDB, but that their data is relational after all?
Personally, I'd take a relational db over NoSQL for most of my needs, but all these stories don't really say anything about how Mongo compares to other NoSQL databases.
That's because nearly all data is relational. At first it seems like you don't need a relational database, in fact NoSQL seems easier to use at first.
As your data grows though you realize that your application become more and more complex. A single query might translate to multiple queries to the database, you need to handle scenarios where fields might not exist etc.
With relational data you might have more work at front, but then the database solves many of the problems for you.
As another person said, when you're using databases like MongoDB you're going back in time and reliving the history, because databases in the past looked a lot like that before Codd invented the relational model, for example [1].
Also the whole NoSQL thing seems to be cyclical, we had XML databases in early 2000s[2].
Or maybe it's useful to get started off with a low overhead, easy to implement DB like Mongo and then as you grow larger, spin off uses that it doesn't serve well to other more specialized and complicated DBs?
One of the biggest problem with relational DBs is that once you decide on a schema, if it's the wrong one, you're gonna be in a lot of pain. Which makes a NoSQL DB a great fit for an early stage product where you are still figuring out what your product needs to do and contain. Once you have some more experience with it, and have a better understanding of your data, it's far easier to build the correct relationships.
> Or maybe it's useful to get started off with a low overhead, easy to implement DB like Mongo and then as you grow larger, spin off uses that it doesn't serve well to other more specialized and complicated DBs?
Not really, converting to relational data is quite a bit of work.
Actually the reverse is the correct approach. You start with normalized data, when there's a bottleneck you start denormalizing it, if that's still not enough you move /subset/ of data to NoSQL database.
> One of the biggest problem with relational DBs is that once you decide on a schema, if it's the wrong one, you're gonna be in a lot of pain.
Not really from my experience all migrations were done through SQL. Also if multiple people (who understand relational databases) come with a schema they pretty much will arrive to the same normalized result.
Yep this is me. I would require some pretty amazing reasons to even consider using Mongo again. Especially now all other relational databases i trust support json column types.
> However, I do not know any developers who, after living through the "hype first, features later" strategy, have been left with a positive enough opinion of MongoDB to ever want to use it again.
A new crop of developers is, always, just an year away though. I feel future adoption depends a lot on how well-suited the tools are for younger devs. That's where MongoDB found the initial audience!
MongoDB has successfully played the 'hype first, features later' strategy. Now it is well on the way to being a decent swiss-army-knife database.
I was going to say that I won't believe that it is on its way to being a decent database until after an article appears on https://aphyr.com/tags/jepsen saying that MongoDB actually delivers on what it claims.
So I looked for the most recent analysis of MongoDB and found https://jepsen.io/analyses/mongodb-3-4-0-rc3. I still want to see verification of the latest release, and hear battle stories from it in production. But I'm provisionally optimistic that a lot of the glaring "it is a pile of shit that doesn't work when the chips are down" issues are now addressed.
That said, I bet that it will be many years before most people who got burned by MongoDB ever rethink their attitudes about it. Once burned, twice shy. And it really was an overhyped steaming pile of shit for a very long time.
I have used MongoDB in production for a number of Fortune 100 sized companies. It has always been a unique database that was ideal for scenarios when your data model was document orientated.
> was an overhyped steaming pile of shit for a very long time
No it wasn't. This is something you heard from people who never really used it. It had its faults but it was never a pile of shit nor was it substantially worse than other databases.
This is something you heard from people who never really used it
I used it at a previous job. Project to move a multi-tera dataset from an Oracle box (24 CPUs, 24G RAM, SAN) to a MongoDB cluster (10 boxes, each with 48 cores, 96G RAM and internal SSD). MongoDB couldn't perform for shit, and it couldn't stay up in a usable state for more than a few hours at a time. This is with 20x the processors and 40x the memory of the system it was replacing. It's a complete joke of a product, sold on the basis of outright lies as far as what they told us and what it could actually do. Having been that badly burned I consider it an act of selfless public service to warn people off it.
If you're just using it for a personal blog that gets 10 views a day, sure it might be barely adequate for that. But I'd still use Postgres.
See this is the sort of crazy things I used to see people do and wonder why they had problems. MongoDB is a document database. You can't just take relational database tables, move them across and expect it to behave the same. And frankly I don't feel sympathy for bad engineering practice. You don't do system migrations without fully testing and understanding all of the systems.
But for those of us that had document orientated data models it allowed for performance that was orders of magnitude faster than any SQL database.
If you never tracked what happened in production, it may have worked most of the time well enough that you never saw how bad it was.
But read https://aphyr.com/posts/284-call-me-maybe-mongodb and https://aphyr.com/posts/284-call-me-maybe-mongodb for an idea of how the promises in MongoDB documentation compared to the reality of the software under stress. And it wasn't just hypothetical either - there are plenty of horror stories floating around from people who ran into those problems in production for uses cases that were supposed to be a fit for MongoDB.
And the performance argument didn't hold water either. As benchmarks like https://www.enterprisedb.com/node/3441 showed, decent relational databases consistently beat MongoDB on the same hardware. Yes, lots of people rewrote bad relational models and saw performance improve. But apples to apples, writing an application against a relational databases in the same way you would against MongoDB resulted in a win for the relational database.
So yes, there were lots of people saying exactly what you are saying now. But the ones who actually tested their systems and ran performance tests came to a very, very different conclusion.
Again. I have been personally involved the deployment and support of MongoDB clusters for very large datasets at very large companies. It does work if you use it for the right task i.e. highly nestable data not relational data. And let's be clear that if MongoDB was unusable then the company wouldn't still be here as successful as they are.
That EnterpriseDB link is completely ridiculous. Firstly, it predates WiredTiger which replaced the entire storage layer. Secondly, doing one for one comparisons with relational systems doesn't make sense. MongoDB is a document database. Compare it with other document databases.
From your link, go to https://newbiedba.wordpress.com/2017/11/27/thoughts-on-postg... for the followup after he wrote those benchmarks. When he ran his benchmarks, he indeed got better throughput on MongoDB. But the 99% performance was massively worse - in fact slow enough to be unacceptable. To an extent that he concluded that you'd be better off using PostgreSQL.
And he's right. As pages like http://latencytipoftheday.blogspot.com/2014/06/latencytipoft... make clear, we have a lot of calls back to the application happening. Users will notice the occasional slow load surprisingly quickly, and it is worth a lot to get rid of them.
So even your chosen source agrees. A relational database is not orders of magnitude slower. In fact, a relational database is probably a better fit.
See this is the sort of crazy things I used to see people do and wonder why they had problems.
Financial time series data is exactly one of the use cases Mongo claimed to be for. Seems you’re the one who can’t tell good engineering practice from bad. And yes, they also pitched themselves as a direct replacement for Oracle. That was highly disingenuous.
> No it wasn't. This is something you heard from people who never really used it. It had its faults but it was never a pile of shit nor was it substantially worse than other databases.
This is FUD, I have used mongodb, I have a certification in mongodb even.
Unless you know precisely what you're doing it's very easy to burn yourself. And mongo markets itself as being "easy to use out of the box" this is not a good thing to do.
I consider MySQL defaults to be unsafe, (as in, it used to corrupt data silently) but it's a godsend compared to the data consistency in mongodb.
There are countless promises it fails to deliver on too, I will not, ever, recommend it for a project. However in recent months I've heard it got better- This means I will stop deriding developers who now use it. But it does not mean I will be realistically allowing its use in the environments I work in. I tend to care about the data consistency in those.
Most people that “get it right” the first time around do not get any recognition whatsoever.
It is the people that screw up, release with big flaws that the customer then pressures the company about, that are heralded as heroes and bacon savers when the fix those flaws. After 3 years and as many releases.
That is true in life in general, not just work place.
Nobody cares about people who are healthy all their life. But someone who suddenly realizes they need to eat better and exercise, and they do, they are applauded. They are defended too, if they go back to old ways. And so on...
>Most people that “get it right” the first time around do not get any recognition whatsoever.
I've heard similar complaints before. And, I get it, too-- at a glance, that person is playing the "superhero" by saving the project. But, good management will insist on root causing failures where this will unravel. If it's a recurring problem, you should bring it up with management.
My biggest gripe as a "lateral manager" (I don't manage engineers, I manage products) is that I see those things happen all the time, and I spend time coaching developers to interact effectively with their managers as much as I can. It's frustrating when I see people that should know better (because I know they heard me) not taking notes about serious issues they want to discuss with their superiors, not knowing how to escalate issues that threaten the well being of the product or the team but that their direct superior doesn't believe are urgent etc...
Developers complain about management but tend to forget that managers are people just like everyone else, and we need to apply some skill to our interactions if we are to get the results we desire.
This completely squares with my experiences as well — a lot of instances of complaints about management are hollow because developers aren't managing upward correctly. Their followup on their issues is missing, or non-actionable.
Do you have any resources you've found helpful improving your skill at this?
>But, good management will insist on root causing failures where this will unravel.
You can have management that understand tech who will get to the bottom of the problem and you can have management who don't understand tech. They won't.
Management who don't understand tech will either keep somebody on hand who they know and trust who does understand tech (e.g. a consultant) or, more likely, they'll just keep rewarding the faux superheroes who keep screwing up and bailing themselves out.
I'd say that good management needs to understand how their subordinates think and operate, even if they haven't played their exact role (e.g. engineer). The best managers that I've worked with, both lateral (e.g. PM) and direct (e.g. EM), take the time to get familiar with engineering processes if they don't know about them already and speak their language.
>It is the people that screw up, release with big flaws that the customer then pressures the company about, that are heralded as heroes and bacon savers when the fix those flaws. After 3 years and as many releases.
There's going to be a ton of survivorship bias even with them. It just goes to show that big marketing budgets are such a competitive advantage that can outweigh not actually being any good.
I'd seriously like somebody with a passing knowledge of data integrity who believes the tech industry is meritocratic to explain what they think the success of mongo is all about.
On the other hand, PostgreSQL is a very good example of a successful implementation of the opposite strategy, that is, "correctness first".
And since PostgreSQL fills that niche very well (correctness + real ACID + extensibility + decent performance), maybe it was really PostgreSQL who killed RethinkDB?
If you're playing the long game and not looking to make a profit that's fine, but PostgreSQL as a company would have been doomed a long time ago. You have to keep in mind the timelines of the business and what they need to do to keep the lights on.
MongoDB has identified a real pain point: many developers don't like to use SQL to interface with a transactional database. I'm not going into the merits of SQL vs. NoSQL, I'm just stating that it's clear there's a need or they wouldn't have gotten any traction.
Now they are maturing the product to the point it might be a safe bet for some use cases, it remains to be seen if their approach to product development will pay dividends or the reputation they have created for themselves has created a time bomb that will eventually kill them.
"PostgreSQL as a company would have been doomed a long time ago"
PG has astonishing feature throughput. With each yearly release, they add 1-3 wow features, 6-10 major features, and countless smaller features still worthy of the release notes.
That's really, really impressive for any database, commercial or otherwise.
There's a perception that postgres is slow to add features because sometimes the feature latency is high. The reason for that is they build a solid foundation first, and slowly build multiple major features on top of that foundation. Consider replication:
That's a lot of engineering work there, but they delivered value to users at each stage along the way. And during this time, they did a ton of other stuff -- did you notice that we got parallel query along the way? And logical table partitioning came along too, which means the parallel query can now do partition-wise parallel joins.
Not to mention all of the SQL features and tons and tons of other stuff.
Postgres has kept the lights on for a lot of companies for a long time. I absolutely reject the idea that good engineering is at odds with business success.
Postgres has kept the lights on for a lot of companies for a long time. I absolutely reject the idea that good engineering is at odds with business success.
I don't think they're at odds, per se, but having been around through the original dotcom bubble, PostgreSQL (or "Postgres95," as I'm pretty sure it was still called when I was introduced to it!) was mostly known to, well, database nerds for at least the first decade of its life. One person's "solid foundation" is another person's "technically correct but practically crawling" -- a perception that, rightly or wrongly, PostgreSQL fought against for a very long time. And I think that's what OP was trying to get at: if PostgreSQL was being developed primarily by a single VC-funded company, they just might not have had the luxury to spend years building that solid foundation.
(I'll allow that as an ex-RethinkDBer, I may have some bias here: I loved many things about the product and especially about the product, but it's hard not to suspect we should have focused on speed and, y'know, revenue earlier than we did.)
MongoDB supports 1) 2) 6) 7). Not sure what 3 and 4 is, but you can just add a new node and new data will be copied over, no need to restore data from snapshot, but you can restore from snapshot too, it shorten the time the replice become available.
Not sure what you mean by 5) though.
Anyway, replciation is strong point of MongoDB with oplog and I don't think Postgres can beat it.
There are several companies that are leveraging PostgreSQL for their own businesses, but that doesn't seem to me to be a rebuttal of the OP's assertion that PostgreSQL couldn't survive as a company itself. Citus Data is not "PostgreSQL as a company," it is "a company that exists because PostgreSQL already existed."
Well majority of key Postgres contributors work for
2ndQuadrant, EnterpriseDB, Crunchy Data, Citus etc.
It basically means PostgreSQL is a distributed company and it would survive fine but being distributed it looks like it is able to innovate faster and is more resilient.
| It basically means PostgreSQL is a distributed company
I would argue that it means that different companies using PostgreSQL help fund PostgreSQL development. That's not the same thing as being a single company. It's a model which clearly works very well for PostgreSQL, but it doesn't really give us good data on whether the "single company doing closed source development" (e.g., Oracle) and "single company driving the bulk of open source development" (e.g., MongoDB) models would have worked as well for them.
There seems to be two points in this comment, one talking about the development of PostgreSQL and the other talking about the usability of it.
PostgreSQL remains one of the most mysteriously difficult common DBMSs to setup which is unfortunate, but since the advent of MongoDB they've adopted all the ease of use features that are warranted from it. Developing a quick-and-dirty product prototype on postgres is a breeze and bootstraping constraints and data-integrity to it afterwards is trivial. I am really not seeing any reason to start a new app on MongoDB exclusively at this point, start off in a strong DBMS like postgres and if you end up needing MongoDB-style document storage you can always branch to it later, but using it initially is a case of premature optimization, there is no need for it.
The problem isn’t the schema, it’s that you must have exactly one at all times. Sometimes you need zero, sometimes you need many. Having a fixed schema in production reduces unpredictability and provides optimization opportunities. The journey to get to that fixed schema, however, generally benefits from more flexibility.
I agree with the lessons in Worse Is Better, but I don't think that the author properly understood what he was observing. The result was a confused and confusing essay.
The way that I understand it is that what is "Good" depends on how you measure it. When we measure in terms of technical quality, we get one answer. When we measure in terms of suited to be widely adopted, we get a different answer.
We tend to idealize for technical quality, but popularity is what matters more. And once something is widely enough adopted, the technical inferiority tends to be fixable.
I was badly burned by Mongo hype back in the day, and as a result I won’t touch it with a 10-foot pole for the rest of my life, no matter how many times people say “No, really, it’s good now”. Falling for that was how I got into trouble in the first place. I know a lot of other devs like this.
If they can be successful despite us, more power to em I suppose. I’m a little annoyed that their path to success was built on the flaming wreckage of so many products that fell apart because of Mongo, by using us as their beta testers instead of building a non-shitty product, and I’m at least going to get this comment in so we aren’t completely forgotten among the congratulation.
I agree here, but I’d go further; build things people want, not the idealistic future some day version where we eventually get to a priority feature for a lot of people like releasing fast scalable software quickly (I’m not saying Rethink didn’t do this but they prioritised correctness and sharding, features fewer people need). For most apps built with Mongo this transaction support isn’t a problem (until it is).
The TCP/IP stack was built and used while the OSI model was being designed, and it won all the mindshare. Perhaps it would have been better to have separate presentation and session layers, but we don't; the application layer handles that stuff. It works well enough.
OTOH, this quote is wise:
> It is easier to optimize correct code than to correct optimized code (Bill Harlan)
I think this is doubly true for databases; at least with obfuscated code, you can recover the underlying meaning with work and exploration.
Losing or corrupting data is the worst thing a database can do. Given "this will be correct and hopefully we can scale it" vs "this will be fast and hopefully we can keep it correct", I'd choose the former for any "source of truth" data every time.
There are tricks for speeding up queries - indexes, cacheing (including materialized views), sharding, read replicas, etc.
> I think this is doubly true for databases; at least with obfuscated code, you can recover the underlying meaning with work and exploration.
True for databases, but not true for businesses.
> Losing or corrupting data is the worst thing a database can do.
Clearly people building simple crud websites with slick JS features didn’t agree otherwise Mongo would be gone and Rethink would be worth hundreds of millions of dollars.
>Clearly people building simple crud websites with slick JS features didn’t agree
I doubt it's that they didn't agree, it's more likely that the thought simply never occurred to them.
Mongo's marketing is directed with laser like focus on the beginner developer seeking out tutorials to build a website, etc. Questions about data consistency simply never arise in that context.
Later on that developer who was gently guided towards using mongo by all of the slick marketing will likely try to defend their decision when somebody attacks it ("their data consistency problems aren't that bad" or "data consistency isn't that important"), but that's something else.
> build things people want, not the idealistic future some day version where we eventually get to a priority feature
FWIW, I don't think this was what happened. RethinkDB started out as an SSD optimized database, and quickly repositioned itself (due to "is this what people want") to something more generally useful, and was one of the most feature-rich databases at the time, I thought.
MongoDB however got first mover advantage and a bunch of cash that comes with it. They could afford to invest heavily in developer evangelism. Then they bought WiredTiger. If I sound bitter, I am a bit - not that Mongo did well in the end, but that RethinkDB went the way it did.
Clearly the evidence bears out the success of that strategy, but it's hard not to summarize it as, "Apparently a lot of developers, applications, and users don't need a database that works." But I don't know what that's really an indictment of exactly.
> I have to admit them a certain respect for executing their strategy so successfully.
But be certain not to conflate your respect as a business strategist with your judgment as a mindful developer. To speak clearly: By systematically playing a weak spot of ours [1] they have used countless of small teams as a stepping stone to sell their business contracts to large players, while hurting a lot of these small teams with an (at the time) inappropriate product for their needs. And through these huge costs they still made a product that is inferior to one that was designed properly.
As a community (both as a startup, as well as a developer community) we should resent these tactics and try to find ways to protect us against players that abuse the common good of mindshare. And lest you say, that is the price you have to pay to at all get a product like MongoDB in harsh business environments: We could also lobby for open source funds that are organized like research funds, producing fundamental technology that benefit everyone. Not every technology fits the model of for-profit startup innovation.
[1] Our community has very little defenses against marketing that comes from our midst, aiming to produce the (false) impression that a disproportionate amount of our fellows have evaluated the product and found it to be excellent. See https://www.nemil.com/mongo/3.html for a discussion about MongoDb specifically (HN thread: https://news.ycombinator.com/item?id=15124306 )
With all its problems, I built a MEAN (MongoDB, Express, Angular, Node) app from zero knowledge to production 2 years ago far faster than this React, Apollo, GraphQL, and Postgres app I'm building from zero knowledge.
Speed isn't always a great thing... If it takes you 2x faster to build but 10x extra support/maintenance after the fact and eventually you need to migrate to postgres anyway because of acid features and stability.. then the time/money loss > benefits.
Build something the right way first, even if it does take longer though I use rbdms(mysql or postgres) all the time with an ORM and the ORM does most of the heavy lifting. (Laravel/Eloquent in my case).. so I still develop pretty rapidly. I'm sure if you use pg+react on multiple projects eventually the speed to launch will increase...
It is honestly a nightmare. I had to make a decision which framework to invest learning in with very limited funds. At the time, the big choices where Angular which had been established and backed by Google and React which was still very new with a much smaller community. I went with Angular and by the time I learned everything I needed, everyone wanted to hire React developers. Running out of money, I ended selling all my belongings, moving to a new city, and doing Backbone.js development. I've been working on learning React for the last several months not earning money and it is more difficult to learn than either Angular or Backbone because it isn't as opinionated driving me to have to learn each tool to decide which is best. My mind craves structure. Whereas most developers have 2 years React experience on me. I figure if I waited 6 months to learn JavaScript frameworks, React would have been the better choice and I would have been far better off than what happened. In a way the MEAN stack screwed me.
The flip side of 'hype first, features later' is that if you are a user who is burnt by mongodb (or another solution) you'll recommend against using it for a long time. So there's a knife edge to balance on--hype enough, but not so much that too many people get burned.
While MongoDB might be a decent document store, I found that Elasticsearch is better at this job (as a secondary datastore). Its aggregation capabilities are juste far better than MongoDB's with the added bonus of being really good for all kind of searches.
I also quoted Rethink's post in "The Marketing Behind MongoDB" in part 3 of my series on MongoDB:
> I sympathize with RethinkDB's team — they did what thoughtful engineers are trained to do. Engineering purity and humility is a tiny part of building a sustainable, venture-backed company.
Despite their claims to the contrary, RethinkDB also released and claimed 'ready for production use' a version of the product that was pretty broken. I used it heavily and hit numerous serious bugs. The RethinkDB devs did do a very good job of tracking down and fixing them. Software is hard.
It was unfathomable to us why people would choose a system that barely does the thing it’s supposed to do (store data), has a big kernel lock, throws away errors at random, implements single node features that stop working when you shard, has a barely working sharding system despite it being one of the core features of the product, provides essentially no correctness guarantees, and exposes a hodge-podge of interfaces that have no discernible consistency or unity of vision.
I mean... that's unfathomable to me too. He explains it later, that " MongoDB turned regular developers into heroes when people needed it"
I have a hard time understanding why devs choose / chose MongoDB. Postgres with JSON columns gets you so far, why would you go with MongoDB, given the issues it's had?
> I have a hard time understanding why devs choose / chose MongoDB. Postgres with JSON columns gets you so far, why would you go with MongoDB, given the issues it's had?
Jsonb is a pretty recent addition to postgres, when compared with the MongoDB timeline. And even today postgres still doesn't have the replication/failover story that made MongoDB pretty compelling. I know, it's coming, whatever, but the point is that there was a time where if you wanted a json store that could stay alive through network issues, MongoDB was one of the only choices available, and postgres simply didn't have what was needed.
The problem with that thinking is that the replication mattered. I'd argue it didn't, it was essentially a scam that people fell for. Who cares about failover when you're losing data due to a bad implementation? Who cares about replication when you can gain the same performance by using a performant database on a single node?
Did mongodb truly allow anyone to really horizontally scale? Most places that need massive horizontal scaling using something like mysql as far as I know.
I loved working with RethinkDB, and the changefeed stuff was awesome. It gave me relational documents, which is all I wanted for most projects. Bummed that project has been basically slowed to nothing.
A database which utterly fails the Jepsen test should not be considered for production. It might be good enough for a cache, but trusting it with real data is reckless.
Most software developers have a negative impression of MongoDB, based on the many flaws that it had back in 2010. Among the people who did the best job of documenting those flaws, was Kyle Kingsbury, in his Jespen series:
But it is important to realize that the team at MongoDB has actually been working with Kingsbury, for several years now, and they have slowly and patiently fixed the problems he identified. Consider how the situation had evolved by 2017:
MongoDB 3.4 Passes Jepsen – The Industry’s Toughest Database Test
Jepsen Evaluation Demonstrates MongoDB Data Safety, Correctness & Consistency
On February 7th 2017, Kyle Kingsbury, creator of Jepsen, published the results of his tests against MongoDB 3.41. His conclusions:
"MongoDB has devoted significant resources to improved safety in the past two years, and much of that ground-work is paying off in 3.2 and 3.4. MongoDB 3.4.1 (and the current development release, 3.5.1) currently pass all MongoDB Jepsen tests….These results hold during general network partitions, and the isolated & clock-skewed primary scenario."
MongoDB has become an excellent document-store database. If you are still repeating FUD from 2010, then you are simply out of date. It's time to come up to speed on the reality of 2018.
I'm one of those people who come in and comment about MongoDB, positively. My first "stack" was LAMP, in 2012 when I learnt how to use it, it was:
- a JSON store that works well with NodeJS
- a geospatial database (yea, it has what I need) that was "easier" to work with
- a database that made it easier for me to change my schema (if you just throw data at it, garbage in, garbage out).
Over the years, I've followed development, and adopted new features to make my life easier.
- I was one of the people who were excited about full GeoJSON support in the 2.* days, because that's something I depended on.
- I've tailed the oplog for as long as I remember (never needed Redis), and have been learning about change streams (announced in 3.6) with the hope of submitting a PR to Apache Beam to support them.
- I adopted a lot of the aggregation framework (Asya Kamsky from MongoDB personally helped me a lot)
- The last time I migrated data from MongoDB was when I turned on WiredTiger in 3.0.*
My little replica has been up for as long as I remember, the only time I have downtime is when restarting my server.
When I start a new project these days, I still go to Mongo, because what held true in its early days still holds. It's quick to get something started. Yeah, there's $lookups, transactions in future; I've just leveraged features as they become available. I have Postgres in my stack, which runs TimescaleDB and is a backing store for my Gitlab instance.
Would I use MongoDB professionally? It depends on the use-case. Over the past 2 years I've worked with Oracle, SAP HANA, Teradata, Hive+Impala, etc. From OLTP to OLAP, but once in a while when I find it quick to, I still use MongoDB, and when I have it my way, I don't later migrate elsewhere.
For good or bad, even though they've fixed these problems, I still don't have any trust in them as a company and will never recommend or look at their product. For me, their early days behaviour has defined their values as a company, and those are values I don't trust or agree with.
If they were willing to nearly scam people once, what is to stop them again in the future? Clearly their motive is money over quality and without a serious change of management I don't see why anyone should consider that they've changed.
Next up will be SQL compliance, and we'll be back to a relational database. I'm curious as to what the impact to speed will be, and what the use cases for these types of databases is now that the major SQL players support JSON.
I beg to differ. (Disclosure: I work for MongoDB.) Using JSON as your data model, rather than relational tables, lets you build different applications that don't need multi-document transactions as often, because the data is already together in a single document. But when you do need multi-document transactions (a small percentage of applications do, and only few use cases inside those applications), they are now available. There is no speed impact on cases when you don't use them. And most of the time, you shouldn't use them, otherwise you wouldn't be capitalizing on the advantages of JSON. I think that's a game changer, but then again: I do work for MongoDB.
Its usually only after a while you realize almost every piece of meaningful data is relational. It just didn't look that way when the project started. But now you're committed on the wrong database and its very costly to switch back to SQL.
Literally every project I saw using MongoDB ended up going back to SQL within the first 2 years after realizing the data is indeed very much relational and theres no clean way to model it using documents.
You always end up with either tons of duplication across documents, which is hell to maintain, or tons of multi-document queries with hacks to look ACID, which is also hell to maintain.
Sure Mongo makes it easy to prototype applications, but it makes it very complex to build robust and maintainable software. Its especially bad if you think your data isn't relational, because it almost certainly is.
Disclaimer: I believe Datomic to be the game changing database; because it values simplicity and composition and these attributes drive the entire design.
Video games storing player data are a great example of nonrelational data. I'm intending on writing a blog post after I finish my game detailing the structure of data I store and why it was so perfect for mongodb.
On the surface it sounds like you might have a case for Mongo, lookout for scenarios like...
* Trading in game items between two users (needs multi document atomic locks if you don't want duplicate or lost items) assuming your "schema" is a document per user
* You want to rename or restructure an attribute in the future, with no schema it's not possible change migrate data easily without writing ad hoc code (maybe you can use third party tools) or changing queries to expect data in multiple "schemas" which quickly gets painful
> You want to rename or restructure an attribute in the future, with no schema it's not possible change migrate data easily without writing ad hoc code (maybe you can use third party tools) or changing queries to expect data in multiple "schemas" which quickly gets painful
You can have schemas with MongoDB. There are various libraries to facilitate database design by schema specification.
Also renaming or restructuring your data is not necessarily an easy task with SQL. The nature of a database dictates that how good it works for your application depends on up to how well thought-out your schema is. Having to change your schema around is tasking. One of the reported advantages of document stores when they were becoming trendy was that it was easy to change your schema since your schema is essentially determined and regulated at the application layer.
Also MongoDB has ACIDic transactions now (freaking finally) so if it’s as-advertised then I feel like half of your argument is not really a strong one any more.
Yes players can sell items to other players, so that's the one place so far I've needed to worry about atomicity, but even mongodb docs give examples with how to deal with something like that: https://docs.mongodb.com/manual/tutorial/perform-two-phase-c...
So yes, it's annoying for a very small % of what I'm doing, but 99% of my updates/writes are within a single document, so I find it very nice for development.
You still need ACID transactions over multiple entries even if your data is nonrelational otherwise there is the potential for item and money duping bugs.
A simple example would be a marketplace.
Player buys item X with Y gold from another player.
1. Server checks that item X exists.
2. Server checks that the player has at least Y gold.
3. Server removes the gold from the player
4. Server gives gold to the seller.
5. Server removes item from marketplace.
6. Server adds item to inventory.
What if someone maliciously crafts two requests in a way that step 2 of the second request happens before step 3 of the first request?
The money is deducted properly but the account can now have a negative balance and there are now two instances of the item.
Most meaningful data is in fact not relational. When you consider machine generated data, logs, metrics, network event data and all other types of data like this you find there are not relations in it.
This data is meaningful because it allows to analyze what's going on over massive systems, detect when problems will happen, find bottle necks in applications and infrastructure, among many other use cases.
Application domain data tends to be relational I'd agree. But in general, this makes up a very small percentage of meaningful data in the world.
> When you consider machine generated data, logs, metrics, network event data and all other types of data like this you find there are not relations in it.
I find that hard to believe. Maybe not that the raw data isn't already relational, but that there are no relations real or implied.
If logs contain info about 'things' and any of those things can be considered to be the 'same thing' for multiple entries, then there's a relation right there – entry to thing.
And even metrics and network event data I'd expect to be full of cryptic IDs that reference some 'thing', i.e. a typical 'code' for which it's really nice to have a table with at least a friendly description.
Admittedly some of this data – or maybe even most of this data – isn't very 'deeply relational', but it definitely seems that claiming that "there are [no] relations in it" isn't strictly true.
Well, it's all a point of view really. The data is in the form of an "event". An occurrence of a fact which occurred at a particular time and has data associated with it. So therefore, a "relation" as constructed in a relational database isn't appropriate. You aren't truly denormalizing the data when repeating ID's or tags or labels in this type of data. This is because, at that time that was in the fact the associated ID, tag, label, etc. It would make the stored event false if a field associated with it were to be changed as at the time of the event, that field did not have that value.
But it's all semantics really at that point.
Anyways, a relational database is a poor solution for this type of data. The stored data gains little to nothing, and may even negatively affect it's integrity (at time t, the event DID have this ID; it DID have this label), when stored relationally. Each event is discrete and there will be many of them which optimizes better for scale than relational organization.
I guess my point was there is vastly more useful data appropriate for a non-relational database than there is for relational databases. You might say it still has a "relation" in an abstract sense but this data does not need relational semantics within the database it resides in.
As a developer: What? Almost everything is relational. I do appreciate Mongo's query language and ease of use (it was the first DB I learned), but your statement is ludicrous. Think about a basic blog system. You'll have relations between authors, posts, categories, and comments.
In my experience, Mongo is most often used with ORMs that emulate joins, like Mongoose. And the possibility of data inconsistency due to lack of transactions is ignored, or patched over with cleanup scripts after the fact.
How does MongoDB handle schema changes? For example, let's say I want to add a mobile phone field to a customer record type. How would I go about doing that in MongoDB?
The short answer is: just do it. You can add any field to any document at any time. That's the beauty of JSON documents without schema constraints. Then of course you need to let your application understand that. But it turns out it's almost trivial to make an application display a phone number field if it finds one, and not display a phone number if there isn't one in the document.
It hurts when the next requirement comes along something like...
"As a user I want to have a home, work, and mobile phone number"
Now you have 3 "versions" of your implicit "schema" to contend with
1) No phoneNumber
2) phoneNumber and mapping it into / out of one of the three phone numbers in the UI
3) objects with three properties homePhoneNumber, workPhoneNumber, mobilePhoneNumber etc
Then the business comes up with "As a user I want to have arbitrary phone numbers that I can label" now the developers start to squeal
RDBMS + SQL is no panacea but having DDL operations like the following (all probably syntactically invalid but you get the idea) out of the box is incredibly powerful.
ALTER TABLE user RENAME COLUMN phone_number TO home_phone_number;
ALTER TABLE user ADD COLUMN work_phone VARCHAR(32) NOT NULL;
CREATE TABLE phone_number (id BIGINT NOT NULL, user_id BIGINT NOT NULL, name VARCHAR(64) NOT NULL, phone_number VARCHAR(32) NOT NULL);
I have had reasonable success using MongoDB as a store of "things that happened" and will never change
And I would still claim that this is easier in MongoDB because several versions of the phone number field(s) can happily coexist in the same collection. Those variants are usually trivial to understand for someone who even just looks at the data, and the application can be written to either accept the different formats, or adjust the format on the fly when it encounters a document that still uses an old schema. Or you could indeed write a batch job that bumps all your phone numbers to a new format, and you could put a JSON schema constraint on your collection that enforces the new schema for every future document. All those possibilities exist, and I truly see that as a big advantage.
For any "real" system that is going to be in production for a long time this becomes a real problem
There are tools to "migrate" data but they come with all the limitations of the Mongo isolation model
Typically you either
* Write ad hoc (possibly using some tooling) code to iterate over your old data adding or mutating the field(s) in question
* Write queries such that they can handle the data being present, absent or in different forms for all of time. As you could expect this is a large burden
There's also https://www.torodb.com/stampede/docs/1.0.0-beta3/relational-... which tails the Mongo oplog and makes a fully-relational read replica, adding columns and indices as needed. An amazing (and free) shortcut to using analytics tools if Mongo's your main datasource.
To be honest I've only used it for pretty trivial things, I didn't do any joins. From the docs [0] it looks like they only have self joins, so joining on children or something like that but not across documents/collections.
Depending how much graph relation stuff you need you might be better off just using the graph API. I have no experience with that though. Or they support a MongoDB API if that covers your needs too.
Like I said I've only done basic stuff, but I really liked it - it's performant and really easy to set up and use. I used the Python API and it was really easy, then I switched to the Node one to try using it in Azure functions (Python library imports aren't really supported there) and that's nice too - it uses promises and works great. It also doesn't feel like a giant lockin (IMO) - their APIs work anywhere and there's no magic in Azure AFAIK to make you put your compute there if you're using the Database.
Just remember, SQL is and always was "not only relational" - a play on the, should be long dead and buried acronym NoSQL (IMHO). Structured query language - I know for a fact from actually doing it (see Cache database, Hadoop schemaless SQL, even a product I worked on that was SQL to Mongo, heck the SQL standard itself) - guess what - it works with object/document stores too!
Mongo could have implemented SQL on top of their storage engine a long time ago minus the joins. Instead they built their own query mechanisms. Mind you, mapreduce can't be reproduced explicitly in SQL, but SQL expressions can compile to Mapreduce (see Apache Hive), so even that was not an excuse.
Edit: NoSQL served a purpose to remind people that there were other options other than relational databases (including those that predate the relational model and those that came after it), but man, what a terrible and misleading misnomer.
On this point, the datastores that have insisted on using a query language that isn't SQL have always seemed to just be attempts at setting up walled gardens. Plenty of people have extended SQL when their needs required it (MySQL, PostgreSQL, Oracle, TSQL/MSSQL) but intentionally rejecting the basic format of SQL queries just seems like an attempt to setup a barrier to ever move off your DBMS.
Speaking as someone who ported an application from MySQL to MSSQL there is still a lot of work required to remove those custom extensions, but the core of what you're doing can remain the same.
I'm the Product Manager on the Core Server responsible for the multi-document transactions project. For those of you interested in learning more about how we're building transactions in MongoDB, I suggest checking out this video that discusses creating WiredTiger timestamps to enforce correctness in operation ordering across the storage layer. The description is presented by Dr. Michael Cahill, the co-founder of the WiredTiger storage engine aquired by MongoDB. https://www.mongodb.com/presentations/wiredtiger-timestamps-...
In 4.0, transactions will just be across replica-sets. The following release will have transactions across the entire sharded cluster (across multiple primaries).
Mongodb can be quite a nightmare once you start requiring anything more than a 1:1 relationship, which is pretty much any kind of app that is doing anything meaningful. Having to resort to doing things like map/reduce for a simple group by / order is not the way to go IMO. I think you later truly realize the beauty of SQL once you get far down that rabbit hole.
Initially I rode on the Mongo's NoSQL bandwagon when I saw that you can just save a JSON hash and thought that's the coolest thing in the world. But ever since I tried out Postgres's JSONB, I just can't go back to Mongo anymore. With Postgres, I have the best of both worlds, performance, relational data, and reliability. I don't have to sacrifice any of it. Also, I don't know who codes using raw SQL, it's been years languages have had ORMs that made queries look just like a Mongo query.
Also, for anything else, like a super simple requirement of saving data (JSON), Firebase have fit that role perfectly.
Mongo is starting to look like it's out of place in the eco system.
We currently use MongoDB 3.4. It's definitely much improved. The replication protocol that came along with 3.4 has been very reassuring. But aggregate queries really do not give you the power of SQL. They are great for transforming stuff for a report, but I would avoid using them for anything else unless 1) it can be cached (and therefor properly invalidated) and 2) doesn't need to be "real time"
Being a document, schema-less database, your "quality of life" as you scale with MongoDB is going to be heavily dependent on how you structure your documents and the types of their fields. Are you treating your collections like SQL tables? Have any many-to-many relationships in hot code paths? Welcome to hell. Its type system is also limited compared to modern SQL DBs. Storing IP addresses, and want to query them based on a given CIDR range? Postgres makes this eas, MongoDB has you writing code that does sub-queries.
I think you are referring to the 100MB RAM limit, but that’s not a hard limit, it’s more of a bad default. The `allowDiskUse` option lets MongoDB write intermediate results to the disk (which is exactly what SQL databases are doing).
> But aggregate queries really do not give you the power of SQL. They are great for transforming stuff for a report, but I would avoid using them for anything else unless 1) it can be cached (and therefor properly invalidated) and 2) doesn't need to be "real time"
I really don’t see a difference between what you can do with MongoDB and SQL. I can’t say much more without knowing specifically what impediments you have in mind, but I would certainly like to hear more. For example, why do you cite results caching and lack of real-time requirements?
> Being a document, schema-less database,
I guess if you’re on 3.4 you can’t take advantage of JSON Schema yet, but keep that in mind as a part of your upgrade plans. In the meanwhile you can still use document validation?
> your "quality of life" as you scale with MongoDB is going to be heavily dependent on how you structure your documents and the types of their fields.
This is completely true, but couldn’t we say that just as much about any database? I’d put money on there being way more grief out there over bad tabular schema than over bad document schema. I mean, who’s worked on large-scale systems that hasn’t put off implementing great ideas, or had to hack up app code to compensate for a restrictive schema, because you can’t take the pain of ALTER TABLE?
> Are you treating your collections like SQL tables?
Ouch. Please don't!
> Have any many-to-many relationships in hot code paths? Welcome to hell.
That's probably fair, but if you're in hell to a vastly greater than you would be with Postgres, I'm pretty sure that's a modeling problem. Again, can you tell me more about the particular example?
> Its type system is also limited compared to modern SQL DBs. Storing IP addresses, and want to query them based on a given CIDR range? Postgres makes this eas, MongoDB has you writing code that does sub-queries
That's 100% legit. MongoDB needs to do a ton better with that... types rule.
> I think you are referring to the 100MB RAM limit, but that’s not a hard limit, it’s more of a bad default. The `allowDiskUse` option lets MongoDB write intermediate results to the disk (which is exactly what SQL databases are doing).
While technically true, this isn't an apples-to-apples comparison. MongoDB's memory limit, as best I can tell --
and please correct me if I'm wrong -- is based on the contents of all documents and operations in the pipeline. Documents in MongoDB tend to be larger, so you can run into that limit faster than one might anticipate.
In Postgres, you have the equivalent of "work_mem". It defaults to 4MB, but most production installation will bump this up. Regardless, this limit is per operation (a join, a sort, etc), not per query. And often times the operation is against a specific fields, as opposed to the entirety of the record contents.
> I really don’t see a difference between what you can do with MongoDB and SQL. I can’t say much more without knowing specifically what impediments you have in mind, but I would certainly like to hear more. For example, why do you cite results caching and lack of real-time requirements?
This might be my own hangups or just us fighting our own specific problems, but I've never been happy with the latency I see come out of the aggregate pipelines we've created. I also don't like what they do to the server's memory.
> At its core, MongoDB is a document database and — almost by default — these kind of databases aren’t ACID compliant, especially when it comes to multi-document transactions. For the most part, that’s not a big deal for companies that use database systems like MongoDB because they are not trying to write to multiple documents at the same time.
No. At least in the open-source world, you can see many applications make multi-document "transactions". I don't see how it would be different in companies using MongoDB.
> Because of this, though, many MongoDB users still run relational databases in parallel with their document database.
No. For the most part, they do write to multiple documents and not think about consistency.
At least the situation seems to be getting remedied. Better late than never.
Every time I hear about some NoSQL "breakthrough" that existed a while ago in SQL databases, I can't help but feel underwhelmed.
In general, I'm convinced SQL is like Constitutional Democracy; it's not perfect, but it's better than any alternative humans have come up with so far.
The consistency is really nice too. I've always hated the fact that column declarations are the first portion of a query and wanted them at the end but... I'd rather be slightly disappointed all the time than occasionally need to rewrite huge swathes of queries if we're changing DBMSs
We have been using mongo over the last three years on our project and it has been pretty smooth so far. However, lately I gave some thoughts into what our project would look like if we used PostgreSQL instead. I tried to figure out what problems Mongo solves that PostgreSQL doesn't.
I am far from being a database expert, I just know enough basics to query what I need, so feel free to correct/complete the following:
- Mongo has been built to store json objects -> Yes, but from what I understand benchmarks indicate that PostgreSQL is faster at reading/storing/indexing json/jsonb content. I don't think that it is good reason to use it.
- Mongo is schemaless -> There might be some usecases, but I bet in most cases this problem can be worked around. Especially in a database with JSONB support.
- MongoDB horizontal scaling is way easier than PostgreSQL. Yes, it seems that scaling horizontally Mongo is extremely easy compared to any other relational database.
And ... that's it. But there is probably more.
At the moment, here's how I would summarize MongoDB benefits if asked my opinion when starting a project:
- For a small projects or a prototype: ease of use, ease of configuration, don't require too much thinking into my data model while I am experimenting
- For a bigger project: horizontal scaling should be easier
Does that sound accurate to you? Am I missing anything important?
Honestly, the existence of Mongo is mostly an indictment about how user-hostile conventional RDBMS is. The fact that other DBs can do what Mongo does is not helpful when there is no easy workflow to do what Mongo does.
The fact that I theoretically implement a web-based CMS in C and it could be more performant than all these web-language CMS products doesn't mean that C is better for making a CMS.
> "Honestly, the existence of Mongo is mostly an indictment about how user-hostile conventional RDBMS is."
That's nonsense. I've taught people to use SQL before, even people with little to no programming experience. After the initial concepts were understood it was fairly easy to gradually expand knowledge over time.
The basic SQL keywords to start getting useful information out of a RDBMS are:
SELECT
FROM
WHERE
INNER JOIN
LEFT JOIN
ON
AS
AND
OR
Those 9 keywords give you a good starting point for exploring SQL. It's really not hard. You could probably learn enough to get started in a couple of hours, and it's easy to expand your knowledge as and when you need to once you've got the basics sorted.
That's after you set up users, schemas, tables, columns, oddly-named datatypes with unexpected behaviors, etc.
The RDBMS is a lie. It's a beautiful kernel of relational theory wrapped in a 60-foot ball of hacks, tweaks, duct-tape, bubble-gum, and hate. The document store doesn't lie. It doesn't pretend. It's honest that it's stupid and it's a glorified hashtable with a string in it. It doesn't make any ridiculous pretenses of having a Sufficiently Smart Query Optimizer that will inevitably let you down and leave you pulling your hair out trying to figure out why on earth a simple, straightforward query is running so goddamned slow.
Then you build a complicated model and have to figure out from the query plan why your query is slow and deal with indices and foreign keys and all that nonsense.
Meanwhile an object DB may be inefficent and clumsy, but it gets all of that stuff out of the way. Also, if you don't want to join, you can work around that by duplicating the data all over the place. Something you can't do with an RDBMS because tables are fundamentally flat and so you can't stuff a parent-child relationship into a single table.
> That's after you set up users, schemas, tables, columns
You have software engineer you similarly should have a data engineer (i.e. a DBA). Nearly 50 years has passed and we still didn't find a better way to represent data, so perhaps it is the right model. The only difficult part is to bother enough to learn it.
> oddly-named datatypes with unexpected behaviors [...] 60-foot ball of hacks, tweaks, duct-tape, bubble-gum, and hate
That's only when using MySQL
> Meanwhile an object DB may be inefficent and clumsy, but it gets all of that stuff out of the way. Also, if you don't want to join, you can work around that by duplicating the data all over the place. Something you can't do with an RDBMS because tables are fundamentally flat and so you can't stuff a parent-child relationship into a single table.
You absolutely can store data inefficiently in RDBMS for example in Postgres you can create a table with two columns one named key, and another data. The data type for the second column can be JSONB. End you're essentially have equivalent of Mongo's collection.
But in that case you store data inefficiently and if your application starts evolving and you need to make different queries things will get more complex quickly.
> "That's after you set up users, schemas, tables, columns, oddly-named datatypes with unexpected behaviors, etc."
You shouldn't jump in at the deep end when learning this stuff, getting a feel for how an existing database works before creating your own is helpful, and you can download sample databases to play around with when you start learning if you don't have access to one already. Aside from this recommendation, if you're starting out learning about RDBMS, then SQlite is a good starting point, and the setup of a SQlite database is fairly simple.
If you're starting from scratch with a brand new database, here are some of the SQL keywords you'll find useful:
CREATE TABLE
ALTER TABLE
DELETE TABLE
TRUNCATE TABLE
INSERT INTO
UPDATE
SET
DELETE FROM
VARCHAR
INT
DECIMAL
PRIMARY KEY
FOREIGN KEY
REFERENCES
Furthermore, if you're using a decent DB GUI frontend, you don't even need to remember most of the above, aside from VARCHAR (for strings), INT (for whole numbers) and DECIMAL (for numbers with a fractional element). Reason being, you can do all of the database setup graphically. Tools like SQL Server Management Studio help in reducing friction.
I literally just finished making some code changes minutes ago to very carefully sequence a set of changes to some related documents to make sure that if write failures occur they'll have the least impact on our system. I've been very happy with our decision to use MongoDB because in the vast majority of cases I just don't need transactions, but there's that one place where using them will be a big win.
I looked into MongoDB a couple of years back, because it was the hot thing at the time. About fifteen minutes in, I try to find out how to do transactions. That is strange, I thought, the manual says nothing about transactions. I asked a popular search engine and was a little shocked to find out there were no transactions.
That was a dealbreaker for me. If MongoDB has now grown support for transactions, that changes things. I think I am going to look at it again sometime.
While doing so, I recommend you think some about the meaning of embedding relationships vs. referencing them in other collections. MongoDB can do both (the aggregation framework provides joins in the form of the $lookup stage).
If you have a “one-to-many(some)” kind of relationship, embedding is a good option, and gets you ACID semantics without even resorting to multi-statement transactions.
If you have a “one-to-many(thousands or more)” kind of relationship, objectID references (a la foreign key) is more likely what you want.
It will be interesting to learn about database design in a non-relational environment. I guess you can replicate a relational structure in MongoDB, but then why not use an RDBMS in the first place?
As far as I understand it, MongoDB's claim to fame is a) handling huge amounts of data and b) clustering. Neither of these apply to me, so the only reason to use Mongo would be data that does not match the relational model well. I am still looking for a use case, any use case, that might make a valid excuse to learn it, but so far I have come up empty.
I am a little worried that I am facing a situation somewhat like the one when I tried to learn Lisp. Learning Lisp was very hard for me because I had absorbed the structured programming approach so deeply that the functional part of Lisp programming seemed downright alien to me. At first, at least. So for the time being, I cannot tell with any certainty if problems that are a good match for a document store are just so rare, or if I am just incapable of modeling my data in ways other than the relational model.
Think about it less in terms of relational/non-relational and more in terms of tabular/document. Tables can only model things in terms of relations. Documents can model them either as embedding or as external references, depending on access patterns.
A decent example is a person record with their email addresses and phones. In a relational DB, you would always and only model those as three sepearate tables, and you would quite frequently need a three way join to assemble that person again.
In MongoDB, you would definitely have an array of emails and an array of phone numbers embedded in that person document, sparing the join on those queries. But in an ecommerce context, you would likely not embed an array of all past orders with all the line items into that person, instead you’d have an array of previous order numbers.
But an order document would have an embedded array of line items, sparing the DB a bunch more joins (and the attendant indices you’d need for joining line items to orders efficiently).
Getting the 10 most recent orders from a customer would involve joining the customers collection with the orders collection (MongoDB’s join is the `$lookup` aggregation stage), but it wouldn’t involve joining the line items to the order.
Is that replicating a relational structure? A little column A, a little column B.
I find it telling of MongoDB's sales strategy that this is being covered by Techcrunch. I don't think you'd get the same coverage on Postgres (granted MongoDB is a company, vs PG is a project).
Triggers?? Where you add some application logic to the database, 10 years go by, and no one has any idea how the triggers work? Or even how to test them? I've never seen triggers used successfully in any production application (maybe they work at first, but give them time, and a few code changes).
> "I've never seen triggers used successfully in any production application"
Triggers have one or two good use cases, and plenty of bad use cases.
I would suggest the strongest use case is for data validation. Databases don't have very sophisticated type systems, and custom data types can cause headaches. Database triggers allow you to ensure that the data stored in fields matches a set of criteria. To give a basic example, if you had a customer table with an email address field, you could validate that the email address had an @ symbol using a code in an insert and/or update trigger. Of course, you should have similar checks within the software that sits on top of a database, but by putting the validation at a low level by using database triggers you can be more confident that the data integrity will be protected.
As a second use case, if the triggers are being used to maintain data for audit or reporting purposes, that can be fine.
However, aside from the audit/reporting use case, I would recommend avoiding using triggers which span multiple tables (unless there are some exceptional circumstances where it's the best option). Things get messy when you have a chain of tables, each with their own triggers that can update other tables within that chain. If you spot lots of code like that, run to the hills!
We are using it for building aggregate tables from raw table. We insert data into raw table. And triggers propagate running Min/Max/Avg for hourly/daily/monthly tables. After couple of months we truncate our raw tables but keep aggregate tables. Duplicates are easily removed based on insert. And we get instantaneous result on our aggregate tables (No hourly batch job)
Third case (and most important IMO): schema migrations. When doing major shuffling of a schema, you can use triggers to "redirect" accesses to the old tables or columns into new ones. This way, the schema updates can be applied without bringing down the application.
(Such triggers of course are removed once the application is updated, along with any old schema bits.)
You setup your preconditions, you execute any action, you assert your post condition. You can use something like http://pgtap.org/documentation.html to write your database tests.
Triggers have a place in app development, and sometimes are the best things to use.
Has anyone worked with MySQL JSON data types in production? For most projects, I prefer working in SQL via a query builder or ORM for abstraction, but find a few features that would benefit from denormalized JSON storage.
I've been using MySQL 5.7 since its release in production. I used to store JSON data in a blob, but when they released JSON support - I just couldn't wait to use it. It's working really, really good. JSON support solves a ton of problems we used to have, and we used EAV model to tackle those problems (I won't get into details, the discussion will go the other way). I deal with a few hundred MySQL deployments ranging from 1GB dataset size to several terabytes, with multimaster and MySQL cluster setup to in-app handled sharding. There are several notable pieces of software that never caused any problems in our use scenario, and MySQL is one of them. The other is nginx. Can't really remember the third. I often wondered why anyone would use MongoDB or similar, but I always keep forgetting that developers are usually inexperienced hipsters who are looking for magic unicorn to solve their lack of knowledge / experience.
So what you're saying that if I find something on google, it's undoubtedly true and I should blindly believe it? For every one of those use cases, I can provide a counter-argument. And there are many more who can do the same.
I used RDMSs for 15 years before ever using Mongo. I fell in love with Mongo from the first time I used it with a pre-existing system. The beauty of just being able to insert and retrieve my whole C# object graph with one statement was a thing of beauty.
I'm a Domain Driven Design true believer. The thought of just being able to store my entire domain model without the machinations of the object relational impedance mismatches was glorious.
I use C# so my document collections are strongly typed C# objects and thanks to the Mongo C# Linq provider, my queries are also strongly typed.
Thanks for the advice and vote of confidence in MySQL JSON support. I don't anticipate reaching that kind of scale, but it's great to hear about large, successful, production deployments. These days I try to stick with a very stable back-end stack and innovate more on the front-end.
We switched over some non-essential data to json. All of it was for stuff that we don't index, join, or SQL search (elasticsearch still indexes it).
It is always just the display data on details page
Not sure if thats a good use case, but it cleaned up our tables and given the data fields can vary from customer to customer its a nice flexibility. Previously it was an Entity-attribute-value setup for these fields (which was very slow after we got into millions of rows) and we were considering a NoSQL option
You're aware that the vendor will post gospel about their product on their own website? I'm not using Postgres, I use MySQL, but I can assert with 100% certainty that most of the "comparison" on that URL is just marketing bullshit. You're a prime example that this new-age bullshit works and that's what's worrying.
How's that relevant? You're neither lead architect nor does your employment at MySQL / Oracle mean anything. Many incapable developers score positions at big companies. Without facts, maths and measurement you can be Linus Torvalds for all I care. The article at the URL provided is simply marketing ploy. I'm experienced enough to notice what's wrong with the comparison, but I won't take that article apart to prove anything. If you believe Mongo is a product that helps you - perfect. Now, if you're thinking you'll convince me that it's great by berating another great product - well, that raises a red flag. Who says I've to use a single software vendor anyway?
it was the reference to me as "You're a prime example that this new-age bullshit works and that's what's worrying". The same statements were leveled at anyone from mysql 10-15 years ago. I'm glad mysql works well for your use-case
Are you the Mat Keep who is Director of Product Marketing at MongoDB? If so then, FWIW, people don't often respond well to marketing links posted without identifying your connection to the company.
That changes nothing for ArangoDB (https://docs.arangodb.com/3.3/Manual/Transactions/) as they have had this feature long before MongoDB. But MongoDB gets the headlines for the hype and being light years late to the party.
"Every NoSQL attempts to add features until it can do relational ops and then ends with a sql layer plugin on top. Those who can't are replaced by ones which can." - Rehash of Zawinski's Law by me
I think Percona's MongoDB version also supports RocksDB as storage engine but they are not keeping up with latest releases and hence new features offerings.
Correct me if my assumption is outdated as of today.
I never used MognoDB, but read many stories that few years ago MognoDB default configuration permitted data loss: server returned write success response to client without syncing data to replica or disk, and if your server died then your data gone.
Additionally some time ago MongoDB had single-threaded writes with collection level lock, which could cause poor write performance.
That's one thing I keep "hearing". The default configuration now is a WriteConcern of 1 - meaning at least the primary has to have written it successfully. You can choose 2 or majority among others.
The last analysis looks good, but you need to note that this is only true with the strongest settings (this means that Mongo will be configured with slowest settings) also this analysis did not include node crashes or restarts.
I read it. From his analysis it seems like the loss of data will rarely happen in the real world if you have a WriteConcern of Majority and you don't have a large latency between nodes.
One of the comments said that theoretically, that could also happen with any system if something happened between the time that it wrote to disk and the time it acknowledged it, you may have extra data.
There are times though that you care more about speed than reliability - ie capturing string data for an IOT device, logging etc.
There are even times that eventual consistentency is good enough. But definitely choose the right tool for the job. I wouldn't trust Mongo for financial data where transactions are a must.
And in my preferred language - C#, if you write your code correctly, you can switch out your Linq provider from Entity Framework to the Mongo driver for instance without any major code rewrite so you aren't stuck with your choice.
From what I'm reading they were trying to do things with Mongo that it wasn't designed to do around transactions. If you have a schema that is well defined and doesn't change frequently, why use a nosql database?
I chose Mongo for a relatively large project that I was responsible for because I knew the schema would change frequently, we didn't have a need for transactions and honestly we hardly ever update documents. We insert and replace.
I also had enough sense to set my Write concerns appropriately.
My choice was also informed by a preexisting project that the company was using that had close to 150K new documents a day with basically the same semantics as the one we started.
No, it doesn’t work for that use-case. It would all need to happen within a single service in a single invocation.
To enable transactions across network boundaries, you need to enable the transactional safety at the application/API level.
And, for the record, one service per table is a terrible idea. It would have none of the benefits of a service-oriented architecture, while adding all the complexity and caveats of a distributed architecture.
Except I've implemented it that way in highly visible, high traffic internal web applications that were lauded as some of the most successful transitions in company history in a very large corporation. So your argument is really legacy opinion and not relevant to micro-service architectures.
None of the things you've just said make any difference to the fact that it doesn't make sense and is a terrible idea.
I fail to see in what way my statements are "legacy" or not relevant to micro-services. My company (Cuvva, insurance in the UK) runs a "micro" service architecture and deals with these problems every day.
It is not possible to use DB-level locking/transactions across network boundaries. It's just as simple as that. It's not an opinion - it's a fact.
I wasn't responding to the distributed transaction argument, though nothing in the MongoDB blog says it won't support it.
I was specifically responding to "one table one service" which is the foundation of domain driven micro service architecture. To have a single domain silo'd and tested within its own boundaries.
Whether you're firing events between services or just calling other services is a design decision, but the encapsulation of a specific domain is the point and the primary benefit.
I have no idea what you're doing, but this is what I worked on:
Just because something can be done, including the fashionable microservices "architecture", it does not mean that it should be done, or that it is a good idea.
Certainly, one can architect anything, use NoSQL for critical data storage, or use Kafka for synchronizing two SQL databases with the same schema and living on the same netowrk(an actual proposal from a fan of microservices, btw). But whether it should be done is another question, and I have yet to see a single example of microservices being a good idea. Not "it works" -- obviously, you can get it to work, but being better than alternatives. Come to think of it, Mongo is the same...
The Accenture link I posted is a living example of a highly successful micro-services architecture in a critical internal application in a large corporation.
The RethinkDB retrospective[0] contains a lot of insight into how MongoDB has succeeded despite being vastly inferior on a technical level back when it first launched. I have to admit them a certain respect for executing their strategy so successfully.
Choice quote:
Every time MongoDB shipped a new release and people congratulated them on making improvements, I felt pangs of resentment. They’d announce they fixed the BKL, but really they’d get the granularity level down from a database to a collection. They’d add more operations, but instead of a composable interface that fits with the rest of the system, they’d simply bolt on one-off commands. They’d make sharding improvements, but it was obvious they were unwilling or unable to make even rudimentary data consistency guarantees.
But over time I learned to appreciate the wisdom of the crowds. MongoDB turned regular developers into heroes when people needed it, not years after the fact. It made data storage fast, and let people ship products quickly. And over time, MongoDB grew up. One by one, they fixed the issues with the architecture, and now it is an excellent product. It may not be as beautiful as we would have wanted, but it does the job, and it does it well.
[0] http://www.defmacro.org/2017/01/18/why-rethinkdb-failed.html