These are *not* new approaches to data modelling. Document databases, network da...

Vivtek · on Nov 6, 2011

I agree entirely - I think when people rebel against "relational databases" they're actually just realizing that the normalization fetish can be harmful in many application cases.

You're better off with MySQL or PostgreSQL managing a key-value table where the value is a blob of JSON (or XML, which I've done in the past), then defining a custom index, which is pretty damn easy in PostgreSQL. Then you have hundreds of genius-years of effort keeping everything stable, and you still get NoSQL's benefits. Everybody wins.

einhverfr · on Nov 7, 2011

Normalization is a tricky thing. On one hand, highly normalized databases have better flexibility in reporting, IMHO. On the other, you lose some expressiveness regarding data constraints. High degrees of normalization would be ideal if cross-relation constraints were possible. As they are not, typically one has to normalize in part based on constraint dependencies just as much as data dependencies.

einhverfr · on Nov 6, 2011

First, the more I have looked, the more I have found that non-relational database systems are remarkably common and have been for a long time.

The relational model is ideal in many circumstances. However, it breaks down in semi-structured content, content where---parentheses for grouping---(hierarchical structure is important, data is seldom written and frequently read, and where read performance navigating the hierarchy is most important) and so forth.

So I'd generally agree, but not every problem is in fact a nail.

jacques_chester · on Nov 6, 2011

> However, it breaks down in semi-structured content, content where---parentheses for grouping---(hierarchical structure is important, data is seldom written and frequently read, and where read performance navigating the hierarchy is most important) and so forth.

Again, this problem is not new. Database greybeards call this OLAP and it's been around since the 80s.

There is nothing new under the sun in this trade.

einhverfr · on Nov 6, 2011

No. I am talking about something like LDAP, not OLAP. LDAP may suck badly in many many ways but it is almost exactly not like OLAP.

OLAP is typically used to refer to environments which provide complex reports quickly across huge datasets, so a lot of materialized views, summary tables, and the like may be used (as well as CUBEs and the like). Hierarchical directories are different. In a relational model you have to transfers the hierarchy to get the single record you want and you are not aggregating like you typically do in an OnLine Analytical Processing environment.

This is why OpenLDAP with a PostgreSQL backend sucks, while OpenLDAP with a non-relational backend (say BDB) does ok.

I am not saying anything new is under the sun, just that some of the old structures haven't gone away.

jacques_chester · on Nov 7, 2011

I was referring to the read/write preponderance. Normalisation optimises write performance, storage space and also provides strong confidence of integrity. But it means lots of joins, which can slow things down on the read side.

That's why OLAP came along. Structured denormalisation, usually into star schemata, that provide fast ad-hoc querying. I think part of the enthusiasm for NoSQL arises because most university courses and introductory database books will go into normalisation in great detail, but OLAP might only get name checked. So folk can get an incomplete impression of what relational systems can do.

If I had a purely K/V data problem -- a cache, for example -- I would turn to a pure K/V toolset. Memcache, for example.

Hierarchical datasets have long been the blindside for relational systems. Representable, but usually requiring fiddly schemes. But in the last decade SQL has gotten recursive queries, so it's not as big a problem as it used to be.

einhverfr · on Nov 7, 2011

Normalization is formally defined based on data value dependencies. However, because there is no way to set constraints across joins, in practice, the dependencies of data constraints are as important as the dependencies of data values.

As far as recursive queries, I am not 100% sure this is ideal either from a read performance perspective. There are times when recursive queries are helpful from a performance perspective, but I don't see a good way to index, for example, path to a node. Certainly most databases don't do this well enough to be ideal for hierarchical directories. For example indexing the path to a node might be problematic, and I am not even sure you could do this reliably in PostgreSQL because the function involved is not immutable.

jacques_chester · on Nov 7, 2011

Your replies so far are excellent. You're pointing out things I've overlooked, thanks.

> However, because there is no way to set constraints across joins, in practice, the dependencies of data constraints are as important as the dependencies of data values.

I don't follow your argument here. Could you restate it?

> As far as recursive queries, I am not 100% sure this is ideal either from a read performance perspective. There are times when recursive queries are helpful from a performance perspective, but I don't see a good way to index, for example, path to a node.

Poking around the Oracle documentation and Ask Tom articles, it seems to be more art than science; mostly based on creating compound indices over the relevant fields. Oracle is smart enough to use an index if it's there for a recursive field, but will struggle unless there's a compound index for other fields. I don't see an obvious way to create what you might call 'recursive indices', short of having an MV.

> Certainly most databases don't do this well enough to be ideal for hierarchical directories.

It'll never perform as well as a specialised system. But relational never will. An RDBMS won't outperform a K/V store on K/V problems, won't outperform a file system for blob handling and so on. This is just another example of the No Free Lunch theorem in action.

My contention is that we, as a profession of people who Like Cool Things, tend to discount the value of ACID early and then painfully rediscover its value later on. The business value of ACID is not revealable in a benchmark, so nobody writes breathless blog posts where DrongoDB is 10,000x more atomic than MetaspasmCache.

einhverfr · on Nov 7, 2011

> I don't follow your argument here. Could you restate it?

Sure.

Quick note, will use PostgreSQL SQL for this post.

Ok, take a simple example regarding US street addresses.

A street address contains the following important portions:

1) Street address designation (may or may not start with a digit). We will call this 'address' for relational purposes. 2) City 3) State 4) Zipcode

As for data value dependencies:

zipcode is functionally dependent on (city, state), and so for normalization purposes we might create two relations, assuming this is all the data we ever intend to store (which of course is always a bad assumption):

create table zipcode(zipcode varchar(10) not null primary key, city text not null, state text not null, id serial not null unique);

create table street( id serial not null, address text, zipcode_id int references zipcode(id), primary key(address, zipcode_id));

So far this works fine. However, suppose I need to place an additional constraint on (address) for some subset of (zipcodes), let's say all those in New York City. I can't do it declaratively, because all data constraints must be internal to a relation.

So at that point I have two options:

1) You can write a function which determines whether a zipcode_id matches the constraint and check on that, or

2) You can denormalize your schema and add the constraint declaratively.

I did some searching and determined strangely that although subqueries in check constraints are part of SQL92, the only "database" that seems to support them is MS Access. But while there are obvious issues regarding performance, I don't see why these couldn't be solved using indexes the same way foreign keys are typically addressed.

> Poking around the Oracle documentation and Ask Tom articles, it seems to be more art than science; mostly based on creating compound indices over the relevant fields. Oracle is smart enough to use an index if it's there for a recursive field, but will struggle unless there's a compound index for other fields. I don't see an obvious way to create what you might call 'recursive indices', short of having an MV.

No, there is an inherent problem here. Your index depends on other data in the database to be accurate. You can create an index over parent, etc. but you still end up having to check the hierarchy all the way down to find the path. You can't just index the path.

Consider this:

CREATE TABLE treetest (id int, parent int references treetest(id));

INSERT INTO treetest (id, parent) values (1, null), (2, 1), (3, 1), (4, 1), (5, 2), (6, 2), (7, 6);

The path to 7 is: 1,2,6,7. To find this, I have to hit 4 records in a recursive query. That means 4 scans.

So suppose we index this value, reducing this to one scan.

Then suppose we: update treetest set parent = 3 where id = 6;

and now our index doesn't match the actual path anymore.

With specialized hierachical databases, you could keep such paths indexed and make sure they are updated when any node in the path changes. There isn't a good way to do this in relational systems though because it is outside the concept of a relational index.

> My contention is that we, as a profession of people who Like Cool Things, tend to discount the value of ACID early and then painfully rediscover its value later on. The business value of ACID is not revealable in a benchmark, so nobody writes breathless blog posts where DrongoDB is 10,000x more atomic than MetaspasmCache.

No doubt about that. I think we are 100% in agreement there!

I'd also add that while RDBMS's aren't really optimal as backings for something like LDAP for a big directory, and while RDBMS's are horribly abused by dev's who don't understand them (ORM's and the like), they really are amazing, valuable tools, which are rarely valued enough or used to their fullest.

Later this week, I expect to write a bit of a blog post on http://ledgersmbdev.blogspot.com on why the intelligent database model (for RDBMS's) is usually the right model for the development of many business applications.

einhverfr · on Nov 7, 2011

In response to PostgreSQL's custom index types, taking a quick look at the API, I don't see a way of telling GiST indexes which entries need to be updated when a row's parent id is changed.

Consequently I don't believe there is a reasonable way to index this because there is no way to ensure the indexes are current and so you don't have a good way of testing that a row is in a path on the tree other than building the tree with recursive subqueries.

The thing is, unless you have a system which is aware of hierarchical relationships between the rows (which by definition is outside the relational model), you have no way of handling this gracefully. So here you have lots of reads, I really think dedicated hierarchical systems will win for hierarchical data.

Of course this wouldn't necessarily mean you couldn't store everything in the RDBMS and periodically export it to the hierarchical store.....

jacques_chester · on Nov 7, 2011

Informative replies, thank you.

> 1) You can write a function which determines whether a zipcode_id matches the constraint and check on that, or

Ah, this old chestnut. Been there, written the PL/SQL trigger, got the t-shirt. Agreed that there isn't a purely declarative approach here.

> You can't just index the path.

Postgres might be the winner here, if someone sufficiently motivated came along and wrote a custom index type for this use case.

gnaritas · on Nov 6, 2011

> When first introduced it proved to be far better, in every conceivable way, than the technologies it replaced.

That's not exactly true; what they did was offer a generic query and constraint model that worked well in all cases while offering reasonable performance. They were not generally faster in optimal cases, but they were much easier to query especially given new requirements after the fact because the queries weren't baked into the data model itself. That generic query ability and general data model always come at the cost of speed; always. Document databases have always been faster in the optimal use case.

SemanticFog · on Nov 6, 2011

You're absolutely right -- RDBMSes were designed to solve problems with the nosql-type approaches that preceded them. The nosql bandwagon is blindly rolling into the past, where it will crash into the old problems of concurrency and consistency under load.

BTW if you want nosql-style schema flexibility within an RDBMS, then a simple solution is to store XML or JSON in in a character blob. Keep the fields you need to search over in separate indexed fields. If you make incompatible version changes, then add a new json/xml field.

desas · on Nov 6, 2011

Another solution is to use the hstore feature in postgres to store key value data.

jacques_chester · on Nov 7, 2011

> BTW if you want nosql-style schema flexibility within an RDBMS, then a simple solution is to store XML or JSON in in a character blob.

In all sincerity, I would strongly recommend against this. If your problem absolutely defies normalisation, don't use a relational database.

christkv · on Nov 6, 2011

very true but it's a resurgence of modeling alternatives which can only help to enrich our ability to write interesting applications. yes you can model a social network in a RDBMS but it's not as efficient or as flexible as using neo4j. or yes you can model a key value document in a RDBMS but again it's not a good fit. The right tool for the right problem. You don't build a house with only a hammer so why should we build applications only on one storage concept ?