I think if anything, all of it could've been put into a single indexed column si...

gavinray · on Nov 7, 2021

This has been the advice given to me by Postgres experts in a similar scenario:

  "If you want to efficiently fuzzy-search through a combination of firstname + lastname + (etc), it's faster to make a generated column which concatenates them and index the generated column and do a text search on that."

(Doesn't have to be fuzzy-searching, but just a search in general, as there's a single column to scan per row rather than multiple)

But also yeah I think just a compound UNIQUE constraint on the original columns would have worked

I'm pretty sure that the degree of normalization given in the end goes also well beyond 3rd-Normal-Form.

I think that is 5th Normal Form/6th Normal Form or so, almost as extreme as you can get:

https://en.wikipedia.org/wiki/Database_normalization#Satisfy...

The way I remember 3rd Normal Form is "Every table can stand alone as it's own coherent entity/has no cross-cutting concerns".

So if you have a "product" record, then your table might have "product.name", "product.price", "product.description", etc.

The end schema shown could be described as a "Star Schema" too, I believe (see image on right):

https://en.wikipedia.org/wiki/Star_schema#Example

This Rachel person is also much smarter than I am, and you can make just about anything work, so we're all bikeshedding anyways!

dghf · on Nov 8, 2021

>I'm pretty sure that the degree of normalization given in the end goes also well beyond 3rd-Normal-Form.

> I think that is 5th Normal Form/6th Normal Form or so, almost as extreme as you can get

The original schema is also in 5NF, unless there are constraints we aren't privy too.

5NF means you can't decompose a table into smaller tables without loss of information, unless each smaller table has a unique key in common with the original table (in formal terms, no non-trivial join dependencies except for those implied by the candidate key(s)).

The original table appears to meet this constraint: you could break it down into, e.g., four tables, one mapping the quad id to the IP address, one mapping it to the HELO string, etc. However, each of these would share a unique constraint with the original table, the quad id; hence the original table is in 5NF.

As for 6NF, I don't think the revised schema meets that: 6NF means you can't losslessly decompose the table at all. In the revised schema, the four tables mapping IP id to IP address etc. are in 6NF, but the table mapping quad id to a unique combination of IP id, HELO id, etc. is not: it could be decomposed into four tables similarly to how the original table could be.

(Interestingly, if the original table dropped the quad ID column and just relied on a composite primary key of IP address, HELO string, FROM address and TO address, it would be in 6NF.)

tkot · on Nov 7, 2021

Would it not prevent some optimizations based on statistics regarding the data distribution, like using the most selective attribute to narrow down the rows that need to be scanned? I'm assuming there are 2 indexes, 1 for each column that gets combined.

Let's say you know the lastname (Smith) but only the first letter of the firstname (A) - in the proposed scenario only the first letter of the firstname helps you narrow down the search to records starting with the letter (WHERE combined LIKE "A%SMITH"), you will have to check all the rows where firstname starts with "A" even if their lastname is not Smith. If there are two separate indexed columns the WHERE clause will look like this:

WHERE firstname like "A%" AND lastname = "Smith"

so the search can be restricted to smaller number of rows.

Of course having 2 indexes will have its costs like increased storage and slower writes.

Overall the blog post conveys a useful message but the table example is a bit confusing, it doesn't look like there are any functional dependencies in the original table so pointing to normalization as the solution sounds a bit odd.

Given that the query from the post is always about concrete values (as opposed to ranges) it sounds like the right way to index the table is to use hash index (which might not have been available back in 2002 in MySql).

gavinray · on Nov 7, 2021

I won't pretend to be an expert in this realm, but see here:

https://www.postgresql.org/docs/current/textsearch-tables.ht...

Specifically, the part starting at the below paragraph, the explanation for which continues to the bottom of the page:

  "Another approach is to create a separate tsvector column to hold the output of to_tsvector. To keep this column automatically up to date with its source data, use a stored generated column. This example is a concatenation of title and body, using coalesce to ensure that one field will still be indexed when the other is NULL:"

I have been told by PG wizards this same generated, concatenated single-column approach with an index on each individual column, plus the concatenated column, is the most effective way for things like ILIKE search as well.

But I couldn't explain to you why

tkot · on Nov 8, 2021

Oh, so by fuzzy search you mean full text search, not just a simple 'LIKE A%' pattern - it's an entirely different kind of flying altogether (https://www.youtube.com/watch?v=3qNtyfZP8bE).

I don't know what kind of internal representation is used to store tsvector column type in PG, but I think GIN index lookups should be very simple to parallelize and combine (I believe a GIN index would basically be a map [word -> set_of_row_ids_that_contain_the_word] so you could perform them on all indexed columns at the same time and then compute intersection of the results?). But maybe two lookups in the same index could be somehow more efficient than two lookups in different indexes, I don't know.

I'm still sceptical about the LIKE/ILIKE scenario though. "WHERE firstname LIKE 'A%' and lastname LIKE 'Smith' " can easily discard all the Joneses and whatnot before even looking at the firstname column, whereas "WHERE combined_firstname_and_lastname LIKE 'A%SMITH' " will only be able to reject "Andrew Jones" after reading the entire string.

ryandrake · on Nov 7, 2021

Also, the "better" solution assumes IPv4 addresses and is not robust to a sudden requirements change to support IPv6. Best to keep IP address as a string unless you really, REALLY need to do something numerical with one or more of the octets.

pilif · on Nov 7, 2021

Based on my experience I disagree. If there is a native data type that represents your data, you should really use it.

It will make sure that incorrect data is detected at the time of storage rather than at some indeterminate time in the future and it will likely be more efficient than arbitrary strings.

And in case of IP addresses, when you are using Postgres, it comes with an inet type that covers both ipv4 and ipv6, so you will be save from requirement changes

johannes1234321 · on Nov 7, 2021

Keeping IP addresses as string isn't trivial. You can write those in many ways. Even with IPv4. IPv6 just brings new variations. And when talking about migration to IPv6 you have to decide if you keep IPv4 Addresses as such or prefix with ::ffff:.

In the end you have to normalize anyways. Some databases have specific types. If not you have to pick a scheme and then ideally verify.

riquito · on Nov 7, 2021

That way you can get false positives unless you concatenate using a non-valid character in the protocol

foo@example.com other@example.net foo@example.co mother@example.net

Btw, storing just the domains, or inverting the email strings, would have speed up the comparison

kelnos · on Nov 7, 2021

Right, and when you do that, you don't even need a RDBMS. An key-value store would suffice. This essentially just becomes a set! Redis or memcached are battle-tested workhorses that would work even better than a relational DB here.

But this was also back in the early '00s, when "data store" meant "relational DB", and anything that wasn't a RDBMS was probably either a research project or a toy that most people wouldn't be comfortable using in production.

dehrmann · on Nov 7, 2021

> this was also back in the early '00s, when "data store" meant "relational DB", and anything that wasn't a RDBMS was probably either a research project or a toy that most people wouldn't be comfortable using in production.

Indeed; her problem looks to have been pre-memcached.

dehrmann · on Nov 7, 2021

Yeah, I'm not clear on how multiple tables fixed it other than allowing you to scan through the main table faster.

Multiple tables could be a big win if long email addresses are causing you a data size problem, but for this use case, I think a hash of the email would suffice.

jeremyjh · on Nov 7, 2021

Well she did say she had indexes on the new table. It could have been fixed with an index on the previous table, but a new table with indexes also fixed it.