No shit, nmongo. Anyone with half a brain can go look at the MongoDB codebase an...

christkv · on Nov 6, 2011

I think the discussion here also misses an important aspect of the conversation which is about application data modeling. Mongo will sooner or later reach a "stable" level as it matures just as mysql, postgres and all other datastores have done. I picked mongo due to the good fit it had to the problems I needed solved not only from the server perspective but from the modeling perspective. The ease of ad-hoc queries and the schemaless nature of the db lent itself well to the kind of problems I wanted to solve.

So even if in 30 years it's got the same characteristics as our current dominant data storage models I consider it a net win that I will be able to use a document oriented database for development over a more traditional RDBMS for some off my applications.

The richer our toolset is the better we are off as not every problems is a nail to be hammered in with an RDBMS.

So a high five to all the people who dare go against convention and take a chance on a new approach to data modeling being it Mongo, Riak, CouchDb, Redis, Neo4j, Cassandra, HBase or any other awesome opensource project out there.

jacques_chester · on Nov 6, 2011

These are not new approaches to data modelling.

Document databases, network databases and hierarchical databases (IMS, CODASYL etc) predate relational databases by decades.

Relational is the universal default for a simple reason. When first introduced it proved to be far better, in every conceivable way, than the technologies it replaced.

It's as simple as that. Relational is a slam-dunk, no-brainer for 99.99% of use cases.

Still, if you really want a fast, proven system for one of the older models, you can get IBM to host stuff for you on a z/OS or z/TPF instance, running IMS. It'll have more predictable performance than AWS to boot.

Vivtek · on Nov 6, 2011

I agree entirely - I think when people rebel against "relational databases" they're actually just realizing that the normalization fetish can be harmful in many application cases.

You're better off with MySQL or PostgreSQL managing a key-value table where the value is a blob of JSON (or XML, which I've done in the past), then defining a custom index, which is pretty damn easy in PostgreSQL. Then you have hundreds of genius-years of effort keeping everything stable, and you still get NoSQL's benefits. Everybody wins.

einhverfr · on Nov 7, 2011

Normalization is a tricky thing. On one hand, highly normalized databases have better flexibility in reporting, IMHO. On the other, you lose some expressiveness regarding data constraints. High degrees of normalization would be ideal if cross-relation constraints were possible. As they are not, typically one has to normalize in part based on constraint dependencies just as much as data dependencies.

einhverfr · on Nov 6, 2011

First, the more I have looked, the more I have found that non-relational database systems are remarkably common and have been for a long time.

The relational model is ideal in many circumstances. However, it breaks down in semi-structured content, content where---parentheses for grouping---(hierarchical structure is important, data is seldom written and frequently read, and where read performance navigating the hierarchy is most important) and so forth.

So I'd generally agree, but not every problem is in fact a nail.

jacques_chester · on Nov 6, 2011

> However, it breaks down in semi-structured content, content where---parentheses for grouping---(hierarchical structure is important, data is seldom written and frequently read, and where read performance navigating the hierarchy is most important) and so forth.

Again, this problem is not new. Database greybeards call this OLAP and it's been around since the 80s.

There is nothing new under the sun in this trade.

einhverfr · on Nov 6, 2011

No. I am talking about something like LDAP, not OLAP. LDAP may suck badly in many many ways but it is almost exactly not like OLAP.

OLAP is typically used to refer to environments which provide complex reports quickly across huge datasets, so a lot of materialized views, summary tables, and the like may be used (as well as CUBEs and the like). Hierarchical directories are different. In a relational model you have to transfers the hierarchy to get the single record you want and you are not aggregating like you typically do in an OnLine Analytical Processing environment.

This is why OpenLDAP with a PostgreSQL backend sucks, while OpenLDAP with a non-relational backend (say BDB) does ok.

I am not saying anything new is under the sun, just that some of the old structures haven't gone away.

jacques_chester · on Nov 7, 2011

I was referring to the read/write preponderance. Normalisation optimises write performance, storage space and also provides strong confidence of integrity. But it means lots of joins, which can slow things down on the read side.

That's why OLAP came along. Structured denormalisation, usually into star schemata, that provide fast ad-hoc querying. I think part of the enthusiasm for NoSQL arises because most university courses and introductory database books will go into normalisation in great detail, but OLAP might only get name checked. So folk can get an incomplete impression of what relational systems can do.

If I had a purely K/V data problem -- a cache, for example -- I would turn to a pure K/V toolset. Memcache, for example.

Hierarchical datasets have long been the blindside for relational systems. Representable, but usually requiring fiddly schemes. But in the last decade SQL has gotten recursive queries, so it's not as big a problem as it used to be.

einhverfr · on Nov 7, 2011

Normalization is formally defined based on data value dependencies. However, because there is no way to set constraints across joins, in practice, the dependencies of data constraints are as important as the dependencies of data values.

As far as recursive queries, I am not 100% sure this is ideal either from a read performance perspective. There are times when recursive queries are helpful from a performance perspective, but I don't see a good way to index, for example, path to a node. Certainly most databases don't do this well enough to be ideal for hierarchical directories. For example indexing the path to a node might be problematic, and I am not even sure you could do this reliably in PostgreSQL because the function involved is not immutable.

jacques_chester · on Nov 7, 2011

Your replies so far are excellent. You're pointing out things I've overlooked, thanks.

> However, because there is no way to set constraints across joins, in practice, the dependencies of data constraints are as important as the dependencies of data values.

I don't follow your argument here. Could you restate it?

> As far as recursive queries, I am not 100% sure this is ideal either from a read performance perspective. There are times when recursive queries are helpful from a performance perspective, but I don't see a good way to index, for example, path to a node.

Poking around the Oracle documentation and Ask Tom articles, it seems to be more art than science; mostly based on creating compound indices over the relevant fields. Oracle is smart enough to use an index if it's there for a recursive field, but will struggle unless there's a compound index for other fields. I don't see an obvious way to create what you might call 'recursive indices', short of having an MV.

> Certainly most databases don't do this well enough to be ideal for hierarchical directories.

It'll never perform as well as a specialised system. But relational never will. An RDBMS won't outperform a K/V store on K/V problems, won't outperform a file system for blob handling and so on. This is just another example of the No Free Lunch theorem in action.

My contention is that we, as a profession of people who Like Cool Things, tend to discount the value of ACID early and then painfully rediscover its value later on. The business value of ACID is not revealable in a benchmark, so nobody writes breathless blog posts where DrongoDB is 10,000x more atomic than MetaspasmCache.

einhverfr · on Nov 7, 2011

> I don't follow your argument here. Could you restate it?

Sure.

Quick note, will use PostgreSQL SQL for this post.

Ok, take a simple example regarding US street addresses.

A street address contains the following important portions:

1) Street address designation (may or may not start with a digit). We will call this 'address' for relational purposes. 2) City 3) State 4) Zipcode

As for data value dependencies:

zipcode is functionally dependent on (city, state), and so for normalization purposes we might create two relations, assuming this is all the data we ever intend to store (which of course is always a bad assumption):

create table zipcode(zipcode varchar(10) not null primary key, city text not null, state text not null, id serial not null unique);

create table street( id serial not null, address text, zipcode_id int references zipcode(id), primary key(address, zipcode_id));

So far this works fine. However, suppose I need to place an additional constraint on (address) for some subset of (zipcodes), let's say all those in New York City. I can't do it declaratively, because all data constraints must be internal to a relation.

So at that point I have two options:

1) You can write a function which determines whether a zipcode_id matches the constraint and check on that, or

2) You can denormalize your schema and add the constraint declaratively.

I did some searching and determined strangely that although subqueries in check constraints are part of SQL92, the only "database" that seems to support them is MS Access. But while there are obvious issues regarding performance, I don't see why these couldn't be solved using indexes the same way foreign keys are typically addressed.

> Poking around the Oracle documentation and Ask Tom articles, it seems to be more art than science; mostly based on creating compound indices over the relevant fields. Oracle is smart enough to use an index if it's there for a recursive field, but will struggle unless there's a compound index for other fields. I don't see an obvious way to create what you might call 'recursive indices', short of having an MV.

No, there is an inherent problem here. Your index depends on other data in the database to be accurate. You can create an index over parent, etc. but you still end up having to check the hierarchy all the way down to find the path. You can't just index the path.

Consider this:

CREATE TABLE treetest (id int, parent int references treetest(id));

INSERT INTO treetest (id, parent) values (1, null), (2, 1), (3, 1), (4, 1), (5, 2), (6, 2), (7, 6);

The path to 7 is: 1,2,6,7. To find this, I have to hit 4 records in a recursive query. That means 4 scans.

So suppose we index this value, reducing this to one scan.

Then suppose we: update treetest set parent = 3 where id = 6;

and now our index doesn't match the actual path anymore.

With specialized hierachical databases, you could keep such paths indexed and make sure they are updated when any node in the path changes. There isn't a good way to do this in relational systems though because it is outside the concept of a relational index.

> My contention is that we, as a profession of people who Like Cool Things, tend to discount the value of ACID early and then painfully rediscover its value later on. The business value of ACID is not revealable in a benchmark, so nobody writes breathless blog posts where DrongoDB is 10,000x more atomic than MetaspasmCache.

No doubt about that. I think we are 100% in agreement there!

I'd also add that while RDBMS's aren't really optimal as backings for something like LDAP for a big directory, and while RDBMS's are horribly abused by dev's who don't understand them (ORM's and the like), they really are amazing, valuable tools, which are rarely valued enough or used to their fullest.

Later this week, I expect to write a bit of a blog post on http://ledgersmbdev.blogspot.com on why the intelligent database model (for RDBMS's) is usually the right model for the development of many business applications.

einhverfr · on Nov 7, 2011

In response to PostgreSQL's custom index types, taking a quick look at the API, I don't see a way of telling GiST indexes which entries need to be updated when a row's parent id is changed.

Consequently I don't believe there is a reasonable way to index this because there is no way to ensure the indexes are current and so you don't have a good way of testing that a row is in a path on the tree other than building the tree with recursive subqueries.

The thing is, unless you have a system which is aware of hierarchical relationships between the rows (which by definition is outside the relational model), you have no way of handling this gracefully. So here you have lots of reads, I really think dedicated hierarchical systems will win for hierarchical data.

Of course this wouldn't necessarily mean you couldn't store everything in the RDBMS and periodically export it to the hierarchical store.....

jacques_chester · on Nov 7, 2011

Informative replies, thank you.

> 1) You can write a function which determines whether a zipcode_id matches the constraint and check on that, or

Ah, this old chestnut. Been there, written the PL/SQL trigger, got the t-shirt. Agreed that there isn't a purely declarative approach here.

> You can't just index the path.

Postgres might be the winner here, if someone sufficiently motivated came along and wrote a custom index type for this use case.

gnaritas · on Nov 6, 2011

> When first introduced it proved to be far better, in every conceivable way, than the technologies it replaced.

That's not exactly true; what they did was offer a generic query and constraint model that worked well in all cases while offering reasonable performance. They were not generally faster in optimal cases, but they were much easier to query especially given new requirements after the fact because the queries weren't baked into the data model itself. That generic query ability and general data model always come at the cost of speed; always. Document databases have always been faster in the optimal use case.

SemanticFog · on Nov 6, 2011

You're absolutely right -- RDBMSes were designed to solve problems with the nosql-type approaches that preceded them. The nosql bandwagon is blindly rolling into the past, where it will crash into the old problems of concurrency and consistency under load.

BTW if you want nosql-style schema flexibility within an RDBMS, then a simple solution is to store XML or JSON in in a character blob. Keep the fields you need to search over in separate indexed fields. If you make incompatible version changes, then add a new json/xml field.

desas · on Nov 6, 2011

Another solution is to use the hstore feature in postgres to store key value data.

jacques_chester · on Nov 7, 2011

> BTW if you want nosql-style schema flexibility within an RDBMS, then a simple solution is to store XML or JSON in in a character blob.

In all sincerity, I would strongly recommend against this. If your problem absolutely defies normalisation, don't use a relational database.

christkv · on Nov 6, 2011

very true but it's a resurgence of modeling alternatives which can only help to enrich our ability to write interesting applications. yes you can model a social network in a RDBMS but it's not as efficient or as flexible as using neo4j. or yes you can model a key value document in a RDBMS but again it's not a good fit. The right tool for the right problem. You don't build a house with only a hammer so why should we build applications only on one storage concept ?

drdaeman · on Nov 6, 2011

I've opened the source code (at Github), but didn't really understood it. The code seems readable, though.

Do you care to provide some examples for those not familar with proper C++/Boost development practices, please?

itaborai83 · on Nov 6, 2011

I'm curious and I might be missing more than half of my brain. Would you be willing to show some examples of bad coding on their source tree?

krig · on Nov 6, 2011

I looked at using BSON in a project a while back, and ended up scrapping it mainly due to perceived poor code quality. Plenty of potential errors ignored, unclear error messages, unsafe practices.

I was also turned off by the sloppy use of memory. Heap allocated objects returned from functions with poor checks to see if anyone manages that memory on the other side. Lots of instances of strcmp, strcpy and similar unsafe string/buffer manipulation functions.

It's been a while since I looked at it so I don't have any particular examples at hand, but that was my impression.

mushishi · on Nov 6, 2011

I haven't ever used MongoDB but got interested, and first non-trivial source file I picked is this: https://github.com/mongodb/mongo/blob/master/db/btree.cpp

Take a look at for example: bool BtreeBucket<V>::find

Without even thinking about what it is doing, it's quite clear that it is not readable code, and it's not immediately obvious what the high level structure of the logic is. The function does not even fit into two screens so it's hard to reason about; your short-time memory is overused.

felix_krull · on Nov 6, 2011

this is the implementation of a b+ tree. the underlying logic has been very well researched since the 70s.

if there is a part of mongodb that I am sure does not contain bugs, it is that very file you link to.

if you want to know what it does, go out and read the relevant papers on data base technology. or graduate in CS.

jemfinch · on Nov 6, 2011

Clearly you didn't actually read the source file. I graduated in CS. I know B+ trees.

I also know that an 85-line, 7-argument method in a 1988-line file shouldn't depend on a global variable ("guessIncreasing") modified from several other, unrelated functions. I know that in bt_insert, which (apparently) assigns to "guessIncreasing" and then resets it to false just prior to exit, should be using an RAII class to do so instead of trying to catch every exit path, especially in a codebase that uses exceptions.

This code is amateur hour.

mushishi · on Nov 6, 2011

Thanks for attacking me personally. But I have no interest to pursue it more. I made claims that clearly hold true, and they have nothing to do with what you said (I did not say anything about bugs, for example)

pnathan · on Nov 6, 2011

That is characteristic of mathematical code, like btree. (ranty aside: being able to recognize this and find out information regarding btree for maintenance is(should be) one of the key reasons to get a CS degree)

I found the btree file relatively readable. Some macro stuff is not familiar to me, but I am sure I could figure it out in a few hours if I felt like. And I haven't yet rolled around to implementing a full-on btree, ever.