Hacker Newsnew | past | comments | ask | show | jobs | submit | hundredwatt's commentslogin

The graph constructed by using bloom filter-style hash functions supports a decoding process called "peeling" where you:

1. Find a batch with 1 missing element 2. Delete that element from its other assigned partitions 3. Repeat, as the modified batches may now be recoverable

This iterative process (surprisingly!) succeeds with very high probability as long as the number of partitions is 1.22x larger than the number of missing elements with k=3 hash functions.


You don't lose absolute guarantees, but the probabilistic nature means the process may fail (in a guaranteed detectable way) in which case you can try again with a larger parameter.

The "bloom filter" name is misleading in regard to this.


> (in a guaranteed detectable way)

To be pedantic, not guaranteed. The xor of multiple elements may erroneously have a passing checksum, resulting in an undetected false decode. You can make the probability of this as low as you like by using a larger checksum, but the initial HN example was IIRC a list of 32-bit integers, so using a (say) 128 bit checksum to make false decodes 'cryptographically unlikely' would come at a rather high cost since it's a size added to each bucket.

If your sent members are multi-kilobyte contact database entries or something than the overhead required to make false decode impossible would be insignificant.

This limitation also applies somewhat to the alternative algebraic approach in my comments-- an overfull sketch could be falsely decoded--, except the added size needed to make a false decode cryptographically unlikely is very small and goes down relative to the size of the sketch as the sketch grows instead of being linear in the size of the sketch.

I haven't looked at your implementation but it can be useful to have at least 1 bit counter or just make the LSB of your checksum always 1. Doing so prevents falsely decoding an overfull bucket with an even number of members in it, and since the distribution of members to bucket is binomial 2 is an extremely common number for overfull buckets. You can use a counter bigger than 1 bit (and combine it with addition in its ring rather than xor), but the tradeoff vs just having more checksum bits is less obvious.

It's probably an interesting open question about the existence of checksums such that the xor of 2..N valid codewords is unlikely to be a valid codeword... the "always emit 1" function has perfect performance for even values but are there schemes that still contribute distance even in cases were the N isn't completely precluded?


> To be pedantic, not guaranteed. The xor of multiple elements may erroneously have a passing checksum, resulting in an undetected false decode

The false decodes can be detected. During peeling, deleting a false decode inserts a new element with the opposite sign of count. Later, you decode this second false element and end up with the same element in both the A / B and B / A result sets (as long as decode completes without encountering a cycle).

So, after decode, check for any elements present in both A / B and B / A result sets and remove them.

--

Beyond that, you can also use the cell position for additional checksum bits in the decode process without increasing the data structure's bit size. i.e., if we attempt to decode the element X from a cell at position m, then one of the h_i(x) hash functions for computing indices should return m.

There's even a paper about a variant of IBFs that has no checksum field at all: https://arxiv.org/abs/2211.03683. It uses the cell position among other techniques.


A neat trick to make the accumulator both collision-resistant and self-diagnosing.

  For every normalized link id x:
      y = (x << k) | h(x)   # append a k-bit hash to the id
      acc ^= y
If acc is zero, all links are reciprocal (same guarantee as before).

If acc is non-zero, split it back into (x', h'):

* Re-compute h(x').

* If it equals h', exactly one link is unpaired and x' tells you which one (or an astronomically unlikely collision). Otherwise there are >= 2 problems.

This has collision-resistance like the parent comment and adds the ability to pinpoint a single offending link without a second pass or a hash table.


The recently released sqlite_rsync utility uses a version of the rsync algorithm optimized to work on the internal structure of a SQLite database. It compares the internal data pages efficiently, then only syncs changed or missing pages.

Nice tricks in the article, but you can more easily use the builtin utility now :)

I blogged about how it works in detail here: https://nochlin.com/blog/how-the-new-sqlite3_rsync-utility-w...


Also note:

sqlite3_rsync is now built into the rsync.net platform.

  ssh user@rsync.net sqlite3_rsync … blah blah …
… just added last week and not rolled out in all regions but … all initial users reported it worked exactly as they expected it to.


sqlite_rsync can only be used in WAL mode. A further constraint of WAL mode is the database file must be stored on local disk. Clearly, you'd want to do this almost all the time, but for the times this is not possible this utility won't work.


I just checked in an experimental change to sqlite3_rsync that allows it to work on non-WAL-mode database files, as long as you do not use the --wal-only command-line option. The downside of this is that the origin database will block all writers while the sync is going on, and the replicate database will block both reads and writers during the sync, because to do otherwise requires WAL-mode. Nevertheless, being able to sync DELETE-mode databases might well be useful, as you observe.

If you are able, please try out this enhancement and let me know if it solves your problem. See <https://sqlite.org/src/info/2025-05-01T16:07Z> for the patch.


Update: This enhancement is now on trunk and will be included in the 3.50.0 release of SQLite due out in about four weeks.


WAL mode works on many network filesystems provided it's being written from a single host at a time.


I'm not sure understand your comment. Regardless of WAL or network filesystem usage, the sqlite file cannot be written to from multiple processes simultaneously. Am I missing something here, or did you misstate?


You can have multiple writer connections from multiple writer processes (on one host at a time). Their transactions will be safely serialized by SQLite.

So, sure, from a low-level point of view, the write transactions are not really simultaneous (they never are with SQLite3), but from the user perspective they are in the sense that you don't need to serialize them yourself.


Demands increasing page size if you sync frequently (bandwidth).


I'm building a new tool for end-to-end data validation and reconciliation in ELT pipelines, especially for teams replicating data from relational databases (Postgres, MySQL, SQL Server, Oracle) to data warehouses or data lakes.

Most existing solutions only validate at the destination (dbt tests, Great Expectations), rely on aggregate comparisons (row counts, checksums), or generate too much noise (alert fatigue from observability tools). My tool:

* Validates every row and column directly between source and destination * Handles live source changes without false positives * Eliminates noise by distinguishing in-flight changes from real discrepancies * Detects even the smallest data mismatches without relying on thresholds * Performs efficiently with an IO-bound, bandwidth-efficient algorithm

If you're dealing with data integrity issues in ELT workflows, I'd love to hear about your challenges!


This sounds interesting. Is this meant to run in pipelines or be used interactively?

Are you building something open source? Link to the repo?


It’s meant to run in or alongside the pipeline continuously.

Not planning to open source, working on a commercial offering but haven’t launched anything publicly yet.

Would love to hear any more thoughts on the concepts here or my email is in bio


+100000


That’s one of the cases where query-based CDC may out perform log-based (as long as you don’t care to see every intermediate change that happened to a row between syncs)


> Much of the FHE industry is focused right now on the subset of applications where legal hurdles make the two available options “slow, but compliant” and “fast, but incurs prison time.”

Anyone have any examples of these applications?


+1 for Metabase

For our team, using an ELT architecture (as opposed to ETL) [1] for managing our data warehouse has greatly reduced the complexity of our data processes. Instead of creating ETLs for every table we want to load into the data warehouse, we create the minimum necessary setup to copy the table into our data warehouse. Then, we write transforms, which are simply SQL statements, to generate wide-column tables that our non-technical users can use to explore data without worrying about joins or having to learn esoteric naming conventions.

Custom EL Scripts -> Redshift -> Transform Statements -> Redshift -> Metabase supports the data needs of all our departments with no dedicated data team members.

[1] https://www.dataliftoff.com/elt-with-amazon-redshift-an-over...


I had the same problem. In addition to regular exercise, try doing these stretches for once or twice a day: https://www.youtube.com/watch?v=FdNS95hpL-o (they are fun too! Perhaps get a group together at your office to stretch together once a day)

These exercises work to move your body back toward perfect posture, undoing the damage caused by sitting, typing, etc.

I stopped using these in November due to travel. My back and shoulder pain returned. It took about 7 days of consistent stretching for the pain to go away again.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: