Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Feels like you could just concatenate and hash the 4 values with MD5 and store the hash and time.

Edit: I guess concatenate with a delimiter if you're worried about false positives with the concat. But it does read like a cache of "I've seen this before". Doing it this way would be compact and indexed well. MD5 was fast in 2002, and you could just use a CRC instead if it weren't. I suppose you lose some operational visibility to what's going on.



Yup, we do this at work for similar purposes and it works a-ok. We also have some use cases that follow the original “problem” schema and they work fine with the correct indices involved.

My guess is that in 2002 there were some issues making those options unappealing to the Engineering team.

When we do this in the realm of huge traffic then we run the data through a log stream and it ends up in either a KV store, Parquet with SQL/Query layer on top of it, or hashed and rolled into a database (and all of the above of there are a lot of disparate consumers. Weee Data Lakes).

This is also the sort of thing I’d imagine Elastic would love you to use their search engine for.


Isn't that exactly what adding an index would do internally?


2002 MySQL had some limitations on indexing variable length string (varchar) columns. That's the gist of the linked story.


Depending on the RDBMS system, in most cases it does not.


For the 2021 version, you'd just generate a bloom filter/cuckoo hash from all the data gathered by your spamhaus database periodically. Make a separate one for each value in your tuple and your score would be the number of the sub-hashes that matched.


“Periodically” here would have to be quite frequent, as you have to rebuild the database before the next retry.


A couple times a day at most?


From the post, you'd have to be rebuilding it every 15 minutes:

> A real mail server which did SMTP properly would retry at some point, typically 15 minutes to an hour later. If it did retry and enough time had elapsed, we would allow it through.


Still better than locking the tables every time you receive a piece of mail.


I'm not super familiar with this stuff, but I believe you could then use a key-value store with automatic expiry like Redis for automatic pruning and faster lookups.


Redis and memcached didn't exist in that timeframe.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: