Is there a standalone c/c++ implementation of this string type anywhere?

lor_louis · on Aug 6, 2024

It looks a lot like cedarDB's "german strings".

<https://cedardb.com/blog/german_strings/>

You could probably write a C++ implementation based on the article.

Note that it's not all that useful if you don't plan on searching for text based on their prefix. From my understanding, this is mostly a better way to store SStables in RAM/partially on disk if you mmap.

o11c · on Aug 6, 2024

Hm, both the main article and your link are wasteful for strings of length 13-15, which are still pretty common. As a rule for SSO, if you're already take up the bytes unconditionally, it's going to be worth complexifying your "size" calculation to squeeze a little more into the no-alloc case.

That said, note that there are a lot of strings around 20 bytes (stringified 64-bit integers or floats), so pushing to 24-1 is a reasonable choice too.

I'd use 3 size classes:

* 0-15 bytes, stored in-line. If you need C-string compatibility (which is fairly likely somewhere in your application), ensure that size 15 is encoded as a zero at the last byte.

* up to 3.75GB (if my mental math is right), stored with a mangled 32-bit size. Alternatively you could use a simpler cutoff if it makes the mangling easier. Another possibility would be to have a 16-bit size class.

* add support for very large strings (likely with a completely different allocation method) too; a 4GB limit sucks and is easily exceeded. If need be this can use an extra indirection.

Honestly, with a 16-byte budget, I'd consider spending more of that on the prefix - you can get 8 bytes with enough contortion elsewhere.

Duplicating the prefix is probably worth it in more cases than you might think, since it does speed up comparisons. Just remember, you have to check the size to distinguish "a\0" from "a" too, not just from "a\0\0\0\0".

NikkiA · on Aug 7, 2024

The point of a 12-char short string is that it fits, with the size, into a 16byte cache unit and a single memory transfer on current architectures. Anything else will involve some loss of performance.

As for short strings, they don't have a prefix at all - the 12 bytes simply follow the 4 bytes of the length.

Long strings will need the pointer 64bit aligned, so a 4 byte length means you'd have 4 bytes wasted to the pointer alignment anyway, and you fill those with the preview 'prefix'. dword length + 4 bytes + 64bit address = 16 bytes, again. They both occupy the same space in the cache, and only the data at the other end of a pointer on long strings gets pulled into cache if you decide the prefix matches and you need to follow it.

lor_louis · on Aug 6, 2024

Yeah I tend to work with strings data around the 15 to 30 ish char count so I'm also sceptical of german strings when it comes to raw memory usage. What really interests me, is that in theory, an SSTable built from German Strings that point into a larger text buffer further down could result in less pagefaults during a binary search? Maybe?

o11c · on Aug 7, 2024

I mean, if you're doing binary searches, the first thing to do is to switch to Eytzinger searches (at least for the prefixes; it's unclear if Eytzing-sorting large strings matters) which are much more cache friendly. (incidentally, Eytzinger was Austrian, not German - same language though!)

Second, I'd probably switch to struct-of-array instead of array-of-struct, since the search part only needs the (head of the) key. Possibly some trie-like logic would also help (that is, avoid duplicating the common prefix, instead point to an array of continuations rather than a single one), but beware of costly indirections.

Third, I would make sure the user is able to specify the SSO limit, since different domains will have different requirements.

Fourth ... if you're column-oriented (or for simple data in the first place), consider implicit union of different `Set` (or whatever) implementations. For example, store all short strings in one array, and all long strings in a different array.

lor_louis · on Aug 7, 2024

I did not know about Eytzing-sorting, I'll look into it after my vacation, thanks! And yeah our current system is column oriented, and I already tried most of your recommendations (including tries) but the biggest limitation we face is that different kinds of queries will be better served by different kinds of data structures but you can only store a few of them on disk before the cost to index becomes too big.

o11c · on Aug 7, 2024

Whoops, I dropped the "er" the second time. I've also seen it labeled "cache-friendly binary sort", and sometimes the Breadth-First-Search aspect is played up (but that's unsearchable).

Logical-order (in-order DFS) iteration is obvious and trivial (though a lot of tasks are fine with physical-order (BFS) iteration. Beware that some tasks that nominally don't care about input order perform much better on already-sorted input though).

Conversion between logical index and physical index (if you need it - chances are you don't) can still be done efficiently, but requires some (fun!) thought especially for the non-power-of-2 tail.

Explicit prefetching is trivial to arrange for k-great grandchildren if needed (though the width of course grows exponentially). Using an n-ary rather than binary tree is possible but I'm not sure of the cache effects in practice. Switching to linear search just at the last level may be worth it? All of this means more complex logic though.

twic · on Aug 7, 2024

As an aside, what is "German" about these?

danvonk · on Aug 7, 2024

They were developed by the database group at TU Munich.