Feels like you could just concatenate and hash the 4 values with MD5 and store the hash and time.
Edit: I guess concatenate with a delimiter if you're worried about false positives with the concat. But it does read like a cache of "I've seen this before". Doing it this way would be compact and indexed well. MD5 was fast in 2002, and you could just use a CRC instead if it weren't. I suppose you lose some operational visibility to what's going on.
Yup, we do this at work for similar purposes and it works a-ok. We also have some use cases that follow the original “problem” schema and they work fine with the correct indices involved.
My guess is that in 2002 there were some issues making those options unappealing to the Engineering team.
When we do this in the realm of huge traffic then we run the data through a log stream and it ends up in either a KV store, Parquet with SQL/Query layer on top of it, or hashed and rolled into a database (and all of the above of there are a lot of disparate consumers. Weee Data Lakes).
This is also the sort of thing I’d imagine Elastic would love you to use their search engine for.
For the 2021 version, you'd just generate a bloom filter/cuckoo hash from all the data gathered by your spamhaus database periodically. Make a separate one for each value in your tuple and your score would be the number of the sub-hashes that matched.
From the post, you'd have to be rebuilding it every 15 minutes:
> A real mail server which did SMTP properly would retry at some point, typically 15 minutes to an hour later. If it did retry and enough time had elapsed, we would allow it through.
I'm not super familiar with this stuff, but I believe you could then use a key-value store with automatic expiry like Redis for automatic pruning and faster lookups.
Edit: I guess concatenate with a delimiter if you're worried about false positives with the concat. But it does read like a cache of "I've seen this before". Doing it this way would be compact and indexed well. MD5 was fast in 2002, and you could just use a CRC instead if it weren't. I suppose you lose some operational visibility to what's going on.