Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One of the most useful lessons I ever learned about designing database schemas was the utter uselessness of indexing and searching on datetime fields. Since virtually every value is going to be different, indexing and searching on that field is (almost) no better than having no index on the datetime field at all.

It was a revelation to me when I decided to experiment with having a indexed date-ONLY field and using that to search instead. It improved performance by almost two orders of magnitude. In hindsight, this should have been totally obvious if I had stopped to think about how indexing works. But, as is stated in this article, we have to keep relearning the same lessons. Maybe I should write an article about how useless it is to index a datetime field...

EDIT: This was something I ran into about 10 years ago. It is possible there was something else going on at the time that I didn't know about that caused the issue I was seeing. This is an anecdote from my past self. I have not had to use this technique since then, and we were using an on-prem server. It's possible that the rest of the table was not designed well and index I was trying to use was already inefficient for the resources that we had at the time.



This a very strange, and incorrect, conclusion you have come to. It doesn't matter that datetime fields are unique, as most of the time you are not searching for a particular date time, but a range. I.e. "show me all of rows created between date X and Y". In that case, the ordering of the index makes it efficient to do that query.

Furthermore, often times your query will have an "ORDER BY dateTimeCol" value on it (or, more commonly, ORDER BY dateTimeCol DESC). If your index is created correctly, it means it can return rows with a quick index scan instead of needing to implement a sort as part of the execution plan.


This entirely depends on what kind of index is used. A sorted index, such as a B or B+ tree (used in many SQL databases), will allow for fast point/range lookups in a continuous value space. A typical inverted index or hash based index only allows point lookups of specific values in a discrete value space.


I was just going to make this comment.

Postgres as far as I know uses B-tree by default.

You can switch sort order I think for this as well, so "most recent" becomes more efficient.

Multi-column indexes also work, if you are just searching for first column postgres can still use multi-column index.


It was a sorted BTREE index in MySQL 5.x. I agree that its supposed to be fast but it just wasn't for some reason.


Are you sure it was actually using the index you expected? There are subtleties in index field order that can prevent the query optimizer from using an index that you might think it should be using.

One common misstep is having a table with columns like (id, user, date), with an index on (user, id) and on (date, id), then issuing a query like "... WHERE user = 1 and date > ...". There is no optimal index for that query, so the optimizer will have to guess which one is better, or try to intersect them. In this example, it might use only the (user, id) index, and scan the date for all records with that user. A better index for this query would be (user, date, id).


One of the more bizarre things we'd found in MySQL 5.something was that accidentally creating two identical indexes significantly slowed down queries that used it.

I wouldn't be surprised if you hit some sort of similar strange bug.


> Since virtually every value is going to be different, indexing and searching on that field is (almost) no better than having no index on the datetime field at all.

Do you mean exact equality is rarely what you want because most of the values are different?

Or are you talking about the negative effect on index size of having so many distinct values?

I think the latter point could be quite database-dependent, eg. BTree de-duplication support was only added in Postgres 13. However, you could shave off quite a bit just from the fact that storing a date requires less space in the index than a datetime.


B-Tree indexes perform just fine on datetime fields. Were you using hash or inverted indexes maybe?


You’re maybe thinking of compression? Primary id is unique too but indexed search is still logn compared to full table scan (n)


there's more index types than only hashes (which do indeed only support equality)

e.g. btrees allow greater than/less than comparisons

postgres supports these and many more: https://www.postgresql.org/docs/9.5/indexes-types.html


I was using a BTREE index and we were doing greater/less than queries. It was not a hash index.


Did you also learn that certain indexes allow for range queries, and modern databases are quite efficient at those?

And yes, please write an article, it would be quite interesting. With data and scripts to reproduce, of course.


Your result is surprising: I suspect your wide table with an index wasn't in cache, and your timestamp-only table was.

The whole point is that indexes (can) prevent full table scans, which are expensive. This is true even if the column(s) you're indexing have high cardinality: but it relies on performant table or index joins (which, if your rdbms is configured with sufficient memory, should be the case).


I have no idea how you came to this conclusion, but indices on datetime fields are completely required to bring down the seek time from O(n) to O(log(N)), plus the length of the range. Massive parts of the businesses I've worked at would be simply impossible without this.

The cardinality of the index has other ramifications that aren't generally important here.


If you were doing something like this then your story makes sense

`WHERE state_date > x AND end_date < x`

You can't index two ranges at once in a b tree!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: