Transactions can be done "easily" by reusing virtual memory support to trap accesses to memory (read/write/execute) and basically running coherence protocols in software. This is also how distributed transactions can be implemented, which enable you to deploy a bunch of threads running in a single address space across separate nodes in a cluster. It's not all that much harder to implement than checkpoint or restore, which they seem to have already.
If I understand this idea correctly, it would have a huge performance impact if every load and store had to be trapped. Or am I missing something here?
Not every access, but those that have to be trapped in order to maintain coherence. The cost of a trap might simply be in the same ballpark as other kinds of synchronization.