How did you handle cleaning up aborted / rolled back transactions after binaries restarted? Or did you just let the pending records hang around forever.
When a node rejoins after a restart, have all other nodes drop any pending transactions initiated by the node in question. I implemented something similar for a distributed filesystem and it worked pretty well.
I get what you're saying. This was not an issue. If one or more of the records in the 'transaction' failed to be inserted, then none of the other records inserted as part of this transaction would be able to be converted into the settled state. As part of my settlement script, there was logic which looked at creation timestamps and would change the state to 'failed' if they had been pending for too long (which was only a few minutes). My script settled records created by different accounts in parallel (though for each account, sequentially based on timestamp) so as not to allow a user to intentionally make incomplete transactions in order to hold up settlement and prevent the script from settling other users' records. A user could only delay settlement of their own records.
Because I used UUID as IDs and they were created on the front end, the user in my app could potentially re-submit the form (click submit again after seeing the 'Unexpected connection error please try again...' message); this represented a second chance to complete the transaction. Those records which were already successfully inserted into the db the last time would be ignored the second time (due to ID conflict) and those which had failed the first time and not been inserted into the db would then be inserted as pending; then the settlement script could complete its job and settle on the next interval.