Using the new working set analyser is going to make figuring out how much memory you need significantly easier. Giving MongoDB enough memory for your working set is the easiest way to get performance but it was quite difficult to figure it out: you had to know both data and index sizes across your most common usage patterns.
Do you know of any resources that give some info on this topic in general? I've been looking but I haven't found anything that's sufficiently plainly-spoken that I fully grasp the contents.
pagesInMemory contains a count of the total number of pages accessed by mongod over the period displayed in overSeconds. The default page size is 4 kilobytes: to convert this value to the amount of data in memory multiply this value by 4 kilobytes.
If your total working set is less than the size of physical memory, over time the value of pagesInMemory will reflect your data size. Conversely, if your data set is greater than the size of the physical RAM, this number will reflect the total size of physical RAM.
Use pagesInMemory in conjunction with overSeconds to help estimate the actual size of the working set.
The best way to estimate resource usage is to run something and store metrics, though. I would expect better success scaling when you can see more of what's going on ...
I didn't realize this is the first MongoDB release for which you can even reasonably do capacity planning. How did people plan before if there were no metrics?
Yeah, your right, MongoHQ hasn't brought anything positive to web development and there's no way they're going to make a business supporting their db. It definitely needs to be put in it's place and we should all take the time to talk about just how cool we are for predicting it's downfall. If only the whole world had enough insight to just stop before they tried to do something awesome. After all, someone may have tried it before.
I suppose answering snark with snark is fair play, but our industry has a real unhealthy preoccupation with the new. The early releases of MongoDB were great proof that the engineers had failed to learn many of the lessons of their predecessors, and I think this is why I and other database-y programmers distrust it to this day. Sexiness is not a prerequisite for reliability.
Your unspoken message reflects the widespread idea in the industry that a detailed knowledge of old things is incompatible with doing new and exciting things. I think this notion is naive and also contributes to ageism, and I hope our industry recovers from it in my lifetime.
I feel like you're more concerned with making sure people understand where they sit on the totem pole than encouraging progress or exciting developers. While learning from history is certainly paramount in software development, raw experimentation, naive excitement, and continuing in the face of nay-sayers is what pushes all industries forward.
People are allowed to make mistakes, tons of them, and the software industry is one of the best industries to make mistakes in. You get quality feedback almost instantly and can fail faster than anywhere else. No one is claiming that "detailed knowledge of old things" is on it's way out. We are engulfed in systems and code that are decades old. I don't see the point in nay-saying.
Actually I am claiming that detailed knowledge of old things has never been "in" to begin with. HN seems to harbor a great breadth and depth of knowledge and I don't think it represents normality in our industry.
Failing fast is nice, but a more measured approach might result in less of it. When it comes to storage and email, people are surprisingly unwelcoming of failure at any speed. If a little study can avert a few catastrophes it's worth it. I think ours is probably the only industry that would question the wisdom of study.
By positioning history as in conflict with raw experimentation, naive excitement and all the rest, I think you're engraving the line in the false dichotomy. Is a week or two of research sans coding really such a high price to pay? Personally, I find there are a lot of really inspiring ideas in old stuff, ideas that strike me as newer and more exciting than "embed Javascript in X."
Are young developers' egos really so fragile they can't weather having someone with experience say "fsync should be on by default"? Is cracking open the source code to BerkDB really such a deflationary moment that it threatens our industry's progress? I, personally, think they can handle it.
As a young and very "green" developer, I have to agree. The biggest thing I've learned is:
Someone else has probably already solved this problem 30 years ago, and they solved it in O(n) time.
However, that's never a bad thing, or it shouldn't be.
I think an unwillingness to look backwards is a sign of an unhealthy level of pride, and I'll freely admit that it's something I struggle with. However, I believe many (most!) developers would also benefit from trying to get past one's ego.
I assume you're referring to Greg's comment about Pick and not my response to your response to it, because I think we've had a pretty good conversation.
But we can meaningfully compare and contrast newer database technologies as well. I think it's pretty clear that Mongo is the MySQL of NoSQL, and I do mean that in the most pejorative way imaginable. Architecture decisions like "let the kernel do all the swapping" decision, a casual approach to durability, the very mindset that leads to inclusion of SSL without certificate validation... ick.
They'll gradually polish their stuff up so that it's quasi-usable and maybe make lots of money, and good for their pocketbooks, but there are plenty of alternative data stores out there which have characteristics which render them more suitable for storing data for most if not all projects.
In a production environment with real applications and important data people are not allowed to make tons of mistakes. Experiment and learn all you want, but I hope you eventually understand that what professional software developers and database admins do has very little relationship to weekend hackathons.
I know my clients won't tolerate "tons" of mistakes. Not even a few. In fact I am usually hired because the person they just let go made one too many mistakes with important data.
I used to work at a big professional shop. They made lots of mistakes, and the people who made the most fundamental ones were often promoted, not "let go".
The main difference in startups isn't that they don't build software the "professional way" and thus make avoidable mistakes. Instead, the difference is that startups recognize that the mistakes are inevitable, so may as well embrace them and build a process / culture around prototyping and continual refinement toward actual business value.
Pie-in-the-sky posturing about The One Right Way is nice, and especially nice when your BigCo is footing the bill, but when you operate under real constraints (e.g. economical) prototyping speed and simplicity can outweigh all the other factors.
All that said, I still think MongoDB full-text search was a bad idea -- only because their implementation is pretty naive and this corner is satisfied nicely in F/OSS by a number of other indexes like Solr, ElasticSearch, Sphinx, etc.
>While learning from history is certainly paramount in software development, raw experimentation, naive excitement, and continuing in the face of nay-sayers is what pushes all industries forward.
Citation needed.
This "continuing in the face of nay-sayers" meme comes from old wives tales of "courageous inventors" that "went against the tide" etc.
In the general history of science and technology, those wasn't that many or important. They just make for a good media story.
The other side is more clear cut: incrementally building and learning from each field's history has been paramount.
So your reply is to take it from one extreme to the other.
Not to mention that you didn't address the (wrong? right? I dunno) extreme the parent said: that Mongo is mostly similar to the Pick database of yore. Is it? Can anybody shed more light?
>If only the whole world had enough insight to just stop before they tried to do something awesome.
MongoDB is fairly decent for what it does (it used to be much worse pre 2.0). But awesome? Really?
While MongoDB was certainly built upon the shoulders of the giants that went before, so has been every product ever. To call it a reinvention of the Pick Database is flattering but highly inaccurate.
Perhaps more accurate is learn from the past, improve the future.
My point is that the Pick OS was stable, fast, and very good at maintaining data integrity. It ultimately lost favor (to relational database systems) because typical Pick applications had all of the database logic in application code, and that logic ended up all over the place. Simple things like enforcing referential integrity, transactions, and so on. RDBMSs do that in the database, where it belongs.
MongoDB and NoSQL are not new, they are backwards steps into the pre-relational world. They seem new to anyone who doesn't know how database management was done pre-MySQL. That isn't to say that everything has to be in a relational database, or that there's never a use case for a non-relational data store like key/value or flat file, but at least recognize that those are not new ideas -- they are the prehistory of database management.
Relational Databases are great, until you have downtime.
PostgreSQL has the slowest database restore procedure the world has ever seen. So backups are basically useless for anything bigger than "trivial". Master/Slave replication is death for HA.
MySQL has Master/Master, and is actually reasonably supported "in the cloud", so that's probably the only realistic option if you don't want to run your own hardware, need HA, and don't want to pay Oracle or MS.
AFAIK Oracle and MS both have decent HA solutions, but you're probably looking at $100K starting prices, and again, you're probably going to have to run your own hardware.
My #1 database of any type would be PostgreSQL. If it was HA. And if you could run a decent sized deployment (hundreds of GB) without having to run your own hardware.
I think this more than anything is probably why NoSQL is so popular with this (HN) crowd. Because in-Memory "free" databases are a good fit for people who don't have $100K to throw at a vendor in licensing, and who don't want to lease their own hardware.
RDBMSs are way behind the curve on that. Even for people like me who genuinely like them. Unless you're willing to go with MySQL I guess. But even then, do you trust RDS to be dependable and fast? Without a dedicated DBA?
I can goto Cloudant and get a database that's very fast, has relatively modest costs, I don't have to run my own hardware, and BigCouch (derivative) has a far better HA story than any RDBMS I'm aware of outside of maybe VoltDB (which I don't have any practical experience with).
I make some compromises on ACID, which I'm learning to be OK with. I lose transactions (which really hurts some things, but rolling your own 2PC on a case-by-case basis works passably when you really need the semantics). And I have a mostly-pleasant, sometimes very fragmented set of tools to help me work with my data. It's a living.
But I'm not responsible for the hardware. It's much cheaper. And it does HA. When PostgreSQL steps up and gets that right, I'll switch back.
Not to mention, there's no excuse for the amount of tuning a decent PostgreSQL install takes. Speaking as a guy who's built several 80K-TPS systems, the barrier to entry is needlessly high. Try MS-SQL sometime and contrast. Then again, PostgreSQL is free.
Plus the free RDBMSs need enforced clustered indexes. This crontab stuff is for chumps. 99% of the time an enforced cluster would be preferable to increased insert performance for 99% of web-apps.
Good points, but I don't agree that slow database restores are "more than anything is probably why NoSQL is so popular with this (HN) crowd." The database has to be pretty big and your application has to be rather important before database restore time becomes a deciding factor. Companies with huge databases and mission-critical applications should have the resources to make their database highly available.
Every time this relational vs. NoSQL topic comes up it's clear that few, if any, of the developers experimenting with MongoDB et al. have mission-critical apps with huge databases. They didn't look at Oracle or even PostgreSQL and come to a reasoned decision to risk everything on MongoDB because it can restore from backup faster. They don't have the patience or inclination to master a relational database, they want something that appears to work right away, and they don't know the consequences down the road of abandoning ACID -- if they even know what ACID is or why they want it. And then there's the coolness factor -- no points in the cube farm for advocating some stuffy old technology.
I know there are exceptions, and I frequently use non-relational data management solutions too (though not MongoDB -- no beta testing with real data for me, thanks). But there's a difference between choosing an alternative tool because you've done the research and decided it's a reasonable solution, and choosing an alternative just to be alternative, or worse because you don't know what it's an alternative to.
I probably made that too confusing. I only mentioned the restores because the availability story for PostgreSQL is very poor and backups above the low tens-of-gigs are fairly pointless if you have SLAs to meet.
The reason why NoSQL would be popular with startups (IMO) is because it's easy to outsource the administration, they're relatively inexpensive, and they're typically much faster when faced with limited resources (both hardware and skill).
I like to think I know my way around your average RDBMS, but there's still no question that tuning your average database is non-trivial unless you're very familiar with the schema, data, production hardware and performance requirements.
I'd echo your sentiments on MongoDB, but that's my own personal prejudices. Probably just from reading about so many horror stories and what sounds like to me at least questionable implementation choices.
Four nines on a monthly SLA means you can suffer less than 5 minutes of outage without penalty. I just wouldn't bet on PostgreSQL to deliver that.
I'm not entirely convinced that a third party can either, but I think BigCouch itself fundamentally can get you there in a way that PostgreSQL can't. PostgreSQL is extremely reliable, but if you try to automate failover you could really shoot yourself in the foot and be looking at a nightmare of an outage recovery. If you go with manual failover (my choice) you're just going to have to accept that if/when you do need to failover, you're going to be able to consistently do it within minutes, but it's not going to be less than 5 minutes.
IMO HA is just one of those things that really has to be baked in at a design level.
> PostgreSQL has the slowest database restore procedure the world has ever seen. So backups are basically useless for anything bigger than "trivial". Master/Slave replication is death for HA.
What's wrong with hot standby, log shipping and heartbeat?
> Plus the free RDBMSs need enforced clustered indexes. This crontab stuff is for chumps. 99% of the time an enforced cluster would be preferable to increased insert performance for 99% of web-apps.
> What's wrong with hot standby, log shipping and heartbeat?
It doesn't work outside the lab.
I know that's audacious, but there's been plenty of Heroku based services with outages because of just this. Automated failover to standby systems is very risky and IMO just never a good idea when it comes to your database servers. You also need a bullet-proof STONITH solution so you avoid the nightmare of split-brain on your database servers then. I mean who wants to have to manually consolidate customer data after a failure?
> I don't know how easy this would be with MVCC.
I didn't consider that. Good point. Still want it. ;-)
No experience with it. It's the only real cloud option I've seen while I was investigating. It promises everything you'd want though.
As long as you can get someone else to manage it, and it's on top of some fast SSDs, it looks like a good choice to me as long as it delivers on it's promises and meets your performance goals.
Basically the guys from MongoDB are rebuilding databases as they existed in the 1980s with the motto "we will do better than relational databases!" or "Sybase, Oracle, prepare to die!".
It's not clear which problem MongoDB is trying to solve or if it is an improvement over existing technology (this is my personal opinion).
What comes into play for me, is that if you are working with data, that in a traditional RDBMS would be split across several tables, where the majority of said data really isn't queried on, or even used other than as a whole, MongoDB works out better.
Example, where I work we have classified ads for cars... having a structure for a listing.. { listinginfo1:...,listinginfoN:..., vehicleinfo: {...}, sellerinfo: {...} }
The above example sees sellerinfo as a duplicate, but the listing/vehicle info would be broken up into several tables for an ACID/SQL database, and is mostly read. Querying against a single dataset, returning singular results shaped how you need them in the application has huge performance advantages. (Note: the old database still used for edits/updates, then exported to mongo for search/display requires over 20 joins, and +N calls to get the shape of data for a search results display. I created a view with all the base data for easier export, it's too slow to actually search against (there was a "search" table updated every half hour that was being used), and the single record for each listing in MongoDB fits better, and searches as fast as the old version on a smaller server.
Another example, is when you have to pull data from many different sources, with similar shapes.. for example tracking health issues, and chemical exposure data... the core data who/what/where would be similar, but additional details may vary. Allowing for this dynamic shaping of data is difficult in most RDBMS systems. Though Postgres does have inherited tables, you would still want your structure broken out, or too flat for proper structure in an SQL db.
That doesn't even go into the option in support of sharding, and replication that's far easier to configure and use than what you get with other DBs, let alone for the free version, without commercial addons.
Is it perfect, absolutely not.. is it the best solution for everything, no. But the above are just a couple of examples where a non-relational database can be a better fit than SQL.
It depends on your data, and how you are using it... Worst case, it should perform similar to SQL or slightly worse. If you can reliably shard your data, and don't need result sets larger than 16MB (not sure if they raised this in 2.4), you can do pretty well. Again, it depends on your data, structure and use.
In some cases a traditional RDBMS is a better call. In others, for mongo, having an additional collection with your indexed subset may be the best approach. It really does depend. I don't consider it best for all cases, but do consider MongoDB a really good fit for a lot of web based cases.
I think it's pretty clear to anyone who's used it. It allows for rapid, low overhead changes to your data model in the very early stages of building something new. Of course, the further along you get those changes are no longer low overhead, but at the start of the new project it's very easy to get up and running.
I guess I just don't find schema that big a deal even in the early goings. A relational DB schema isn't necessarily hard to change and adapt on the fly. I don't feel like this is the best thing for Mongo to hang its hat on, and I'm not sure they really are. I think its users tend to point at this as a big feature more than they should.
I feel like the other reasons to use Mongo should show up more in these discussions than this "It's easy to make schema changes". Perhaps these are easier scaling/sharding, I don't know, I don't use Mongo. I just see the schema part of this as a sidepoint (albeit an important one from an application design perspective). There are legitimate cases where supremely simple schema flexibility is desirable, but I suspect a lot of people who think they need this really don't (there are easy ways to do this with relational DBs), and are in fact making some things more complicated as a result.
While it IS very easy to change your schema with Mongo, let's not forget that you're going to need to enforce certain schema-related rules in your codebase instead of the DB now.
This. I don't understand why this isn't mentioned more often. Developer convenience and productivity are important things, but the future integrity of your application shouldn't be compromised for either.
There are legitimate usage cases for a non-relational document store like MongoDB, but I suspect there are a non-trivial number of people using it that would be better served by a relational DB. Or even something like Postgres that has both relational and non-relational capabilities.
It's not easier than nothing. You don't have to do ANYTHING in mongo. In any reasonable sized web app with a well designed ERD changing is going to result in you stopping coding and doing some ALTER TABLE, DROP INDEX, etc commands before you can start back with the coding.
Also when you want to create a report you can just write a script aggregates the data and fills another collection on the fly without creating the table first. You can even try out 6 different reports at the same time, without issuing separate create table statements.
It's easy to change a relational schema if you understand relational databases. If you don't then Mongo et al. look like good ideas because they solve an actual developer problem: ignorance.
The Wikipedia article on Pick[1] has some good background info. This stuff is pretty "old school" but is still in use in places.
Funny thing... I did an A.S. in Computer Programming back in the early 90's and the community college I attended had a Prime mainframe in the IT department, and ran Pick databases. I did some "work study" stuff with the IT manager there, but never really had a chance to get deeply into the Pick stuff. I was always curious about it though, so seeing this story (and this comment) brings back memories...
My understanding of Pick databases was that they were typeless. Mongo isn't. There's a difference between being typeless and schemaless, and Mongo is a good mix for my usecases.
You don't understand Pick databases very well then. They supported types and could enforce them. Some Pick-style databases were weaker in this regard than others but there's nothing inherently typeless about the Pick database.
I wouldn't call Pick databases schema-less, either. Pick databases called their schema a dictionary. In a Pick database the dictionary is less strictly enforced than in a SQL database but it's not schema-less.
And before anyone goes off, yes, I know that Pick systems usually stored everything as a string. That was done for reasons that made sense in the 1970s, and because the Pick OS was implemented on several different hardware architectures.
That isn't the same as type-less though it looks that way at first glance. Although a number might be stored as a string of digits on disk the dictionary allowed for strict validation on the way in (Input CONVersions) and formatting on the way out (Output CONVersions). In my extensive experience with Pick OS I found that most programmers preferred doing that stuff in application code, the features of the dictionary were not much used, and as a result Pick code was harder to work with and maintain than necessary. I find the same thing when working with SQL databases -- instead of putting validation and enforcement in the database schema the rules are enforced in application code, usually in more than one place.
Reminds me of Bill Gates recent IAMA where he talked about how they were close but didn't make it to market with a similar product, e.g. he seemed to be saying "MSFT was there first...!" I guess not, if this has been around since 1980 :)
http://www.reddit.com/r/IAmA/comments/18bhme/im_bill_gates_c...
Pick-style databases were a dominant technology from around 1973 until relational databases (led by Oracle) marginalized them, more or less. Pick databases are still widely used and I run into legacy applications all the time, especially in publishing. There's a decent Wikipedia article on the Pick operating system if you're interested.
I was hoping for collection-level locking to be a part of the 2.4 release. I didn't see any mention of it in the release notes. Last I heard they were going to implement collection-level locking and then begin work on document-level locking. I'm still hoping document-level locking isn't too far off.
Collection level locking isn't in 2.4. Working on more granular locking for 2.6/2.8. May skip collection level entirely for something more granular. More details can be found in the collection level locking ticket https://jira.mongodb.org/browse/SERVER-1240
From an atomicity perspective, yes. From a performance perspective, no; other concurrent operations affecting anything else on the whole database wait on your write.
They are not fast, they are "normal speed" now. They were slow before. That was one of the problems which made me question the competency of the Mongo team.
The MongoDB staff had said they're mostly adding text searching to satisfy a repeat request that they've heard over and over - but they've said that anyone really wanting to do text searching at scale should use something like Sphinx, Solr or ElasticSearch.
tl;dr: use it, if a simple reverse index fits your needs, use Lucene for the grown-up stuff
> MongoDB text search is still in its infancy and we encourage you to try it out on your datasets. Many applications use both MongoDB and Solr/Lucene, but realize that there is still a feature gap. For some applications, the basic text search that we are introducing may be sufficient. As you get to know text search, you can determine when MongoDB has crossed the threshold for what you need.
Keep in mind that it is both new and experimental.
If you are comparing them based on functionality and performance you are missing the point.
It is my opinion that this easily bests external implementations by reducing moving parts and removing the asynchronous nature of the updates to search indexes.
The text index is updated atomically and in real time. Best of all: it's not another part of your stack that needs to be setup and maintained.
I was hoping the security enhancements would include SSL certificate validation. Anyone know why they don't do that, or how a user should approach that limitation?
We'll be moving away from MongoDB because it doesn't support certificate validation. What is the point of SSL connections if you don't validate the certificate? It seems that you get all the drawbacks of encryption (overhead, throughput) with none of the benefits (security).
I can't actually find any details about how this works in practice. Are multiple maps and reduces in a map-reduce executed simultaneously within a single mongod process? If so, how many? Is it based on the number of cores in the server?
Edit: not asking you specifically, just generally curious.
Each map reduce job can still only use one thread.
Before, under spider monkey, only one job could be executed per mongod instance. Now you can have many jobs executing in parallel on a single server instance but each one of them still only uses a single core.
Do you know if this is likely to be permanent? In an application I work on, I do map-reduces on Mongo data using a third-party framework at the moment specifically to work around the inability to easily parallelize map-reduce jobs on a single machine.
Yeah, this is a big deal. It enables very interesting opportunities to create a map reduce engine that isn't awful or which requires Java.
People use Hadoop because they have to, not because it is great. Isn't it time for a better alternative now that the toothpaste is out of the tube re: map/reduce?
Java doesn't even have hash literal syntax! Why would you ever want to query document oriented data with it as your language of expression?
MongoDB's MapReduce implementation is not being deprecated. The primary beneficiary of the V8 implementation is MapReduce, so it should be seen as further investment in this area.