Building a Curve25519 Hardware Accelerator

staticassertion · on July 15, 2021

> Wait, what? So many articles and journals I read on the topic just talk about prime fields, modular reduction and blah blah blah like it’s asking someone to buy milk and eggs at the grocery store.

No kidding, it's so frustrating! I have to read some articles/ papers like 20x and open a bunch of wikipedia tabs to understand wtf they're talking about. If they gave a simple, high level explanation it would save tons of time - while a wikipedia article is going to be very in depth, it's not tailored to the context of what I'm reading, so I often have to look at a whole bunch of things to start to build a good model in my head for wtf they're trying to convey.

This post, on the other hand, is perfect for me. Thank you so much for writing it I'm learning a ton.

bunnie · on July 15, 2021

Thanks! It's good to know I'm not alone in feeling this way.

This is actually an abridged version, if you click into the link for the "datasheet" (https://ci.betrusted.io/betrusted-soc/doc/engine.html) you can find all the gory details down to the register level...

staticassertion · on July 15, 2021

Awesome, much appreciated, I'll dig into it. I was actually just using the dalek crate myself quite heavily this week, so this is great timing.

amirhirsch · on July 15, 2021

I liked your struggle with the corner-case :)

Kalium · on July 15, 2021

This feels, to me, like the complaint of someone with a general technical grounding going into a narrow specialization and discovering they understand none of the terminology.

This is exactly how I felt when I was learning Haskell and hearing about eta conversions and pointfreeness and so on. Each new bit of terminology had a precise definition, but specialists writing primarily for one another were not inclined to offer definitions each time. They didn't need them and neither did their anticipated audience. I had to learn the terminology and jargon so I could understand what they were talking about, and high-level abstracted explanations were not going to be enough to participate usefully. I had to learn the basics.

Similarly, Watson & Crick's seminal paper might not have been the best of all possible times to stop and define an X-ray or how crystallography worked.

So yes, just instantiate the metaclass, implement the interface, and go buy milk. Easy. You just have to learn what each of those things actually means, especially if the details might be important (and in cryptography, that's always the case).

dragontamer · on July 15, 2021

https://www.youtube.com/watch?v=xE4jEKx9fTM

GF5 is the 3rd smallest finite field (GF2 and GF3 are smaller), but GF2 and GF3 are too small to really see where the beauty of all this math lies.

------

GF5 (which is simply "arithmetic modulo 5") has all the properties of Real-numbers "we care about", and none of the hassles of decimals or partial numbers.

GF5 is simply arithmetic modulo 5. Its really easy to do. 4 + 2 == 1. (Aka: 4 + 2 mod 5 == 6 mod 5 == 1). Addition and multiplication are easily proven to be the same.

The fun starts when you start doing roots, exponents and logarithms (!!!). Take the definition of addition / multiplication above (aka: addition / multiplication modulo 5) and redefine square-root, exponents, and logarithms to be consistent with it.

That is to say: 2 * 3 == 1. Which means that 3 == (1/2) (yes, "one half" is now 3 in GF5). Which means 3 is 2^(-1).

* 2^2 == 4.

* 2^2 * (2^-1) == 4 * 3 == 12 mod 5 == 2.

Which means 2^2 * 2^-1 == 2^(2-1) and look, exponents work as expected. Feel free to play around with all the numbers: you'll find that exponents work once you match this new "definition" of exponents.

---------

Because exponents work, logarithms work. Because everything works, we can have complicated ellipical curves defined over the numbers {0, 1, 2, 3, 4} (aka: the numbers mod 5) and have everything work out mathematically.

--------

The hard part is proving that all of this works in GF(5^2) aka GF25. It turns out that you can't just "mod 25" the stuff, you need to do an extension field (and basically make a complex number out of things).

Once you understand GF25 extension fields and its relationship with GF5, then it becomes easy to generalize that to GF2 and its extension fields GF(2^8) (aka: binary numbers of 8 bits) or GF(2^32) (aka: 32-bit numbers).

Turns out for extension fields (like GF(5^2) aka GF25) you need to redefine multiplication and addition as well and then rebuild all the math from the ground up off of this new definition of addition and new definition of multiplication. And again, this is most important for GF(2^32), aka the 32-bit number field that's common on today's computers.

Bonus points: GF(2^32)'s new addition operator is just an xor. Only multiplication is difficult.

But once you do that, it all __JUST WORKS__ and is amazing.

----------

Galois Fields are a way to "cheat" a definition of logarithm, exponents, square roots, multiplication, division, and addition (by redefining these operations).

It turns out that these operations continue to work "as expected" in the Galois Field... as long as you're cool with the new definitions of them.

------

EDIT: I'm slightly wrong. Seems like Curve25519 is a prime field and not an extension field. So this is more similar to GF5 than it is to GF(2^32). Still, other elliptical curves are in GF(2^32) or other extension fields, so the study of extension fields is useful for other curves.

Curve25519 can be understood with just the GF(5) description I gave earlier in this post. Prime fields are really simple: just mod(the prime number) and you're set. There's some weird properties that can accelerate speed (Any division by Foo is equivalent to multiplying by (1/Foo), which might be faster).

nullc · on July 16, 2021

Nice comment! Some nits.

> Any division by Foo is equivalent to multiplying by (1/Foo), which might be faster

that's not just a weird property you can use for performance, standard grade-school division doesn't work in fields because the results are outside of the field.

So division is redefined using the modular inverse-- which has a natural definition the modular inverse a of element x is the value that makes the product ax equal the identity.

The "cool optimization" though is that since modular inverses are kinda expensive, you can rewrite your arithmetic to keep things in fractional form as long as possible and get a big speed-up.

[This also applies to ordinary numbers with division: E.g. use newton's method to compute square roots but don't do any division, keep everything as fractions -- with that trick you can compute pretty accurate square roots in your head! ... seems people don't get taught this kind of thing any more because of access to calculator :) ]

> And again, this is most important for GF(2^32), aka the 32-bit number field that's common on today's computers.

This makes it sound like you're saying that regular computer arithmetic is GF(2^32). :)

In my experience GF(2^32) arithmetic is fairly uncommon! Work in GF(2^8) is extremely common, e.g. it's used all over in error correcting codes. The field size is common because it's perfectly reasonable to implement multiplication using a log and inverse log table.

For 2^32 you really want a CLMUL instruction if you want high performance. Though once you have one you can easily go to 64-bits if it's useful for your application.

You won't find a lot of binary extension field (GF(2^n)) stuff in crypto these days except sometimes hidden deep inside a block cipher (like the AES sbox) because the extensions create a lot of additional and surprising algebraic structure. E.g. in these fields (x + y)^2 == (x^2 + y^2)! ( https://en.wikipedia.org/wiki/Freshman%27s_dream ) Unless you're doing something where the structure is useful, it's better to not have it.

Error correcting codes, CRCs, hashes, and related stuff-- however, has lots of it.

fogof · on July 15, 2021

> GF5 is the 3rd smallest finite field (GF2 and GF3 are smaller)

There is also a field of 4 elements, since 4 is a power of a prime.

dragontamer · on July 15, 2021

Fair. I guess I was implicitly staying away from extension fields because people's attention span will start drifing away the minute I start to attempt to explain how to perform modulo polynomial arithmetic over irreducible polynomials.

Prime fields (GF2, GF3, GF5...) are just easier to explain than extension fields (GF (2^2), GF(3^2), GF(5^2)... GF(2^8)...)

sowbug · on July 15, 2021

This work is associated with Precursor (https://www.bunniestudios.com/blog/?p=5921), which is Bunnie's project to build a mobile hardware platform from the ground up.

OJFord · on July 15, 2021

> It’s a weird thing about academics — they like to write papers and “share ideas”, but it’s very hard to get source code from them.

I suppose at least part of the reason, especially for things like crypto and statistics, is that they don't want to get bogged down in plausibly-correct claims of fatal flaws, and be perhaps repeatedly (over years hence) put in a defensive position of having to prove the criticism incorrect or have everyone assume the work was wrong?

I can sort of understand that. The same's true for the papers themselves of course, but I can see it being more annoying for code. ('But it's never going to actually be in that state', etc.)

bunnie · on July 15, 2021

You're probably right. And assuming you are, then I think maybe the core problem may be that academics even tolerate the petty undercutting of reputation based not on the merit of the big ideas, but rather minor flaws in the proofs of concept.

There's also a problem in allowing papers to get away with making arbitrary claims that are by definition irreproducible. Every paper always includes some section about how many gates a design took up and how fast it ran -- but there aren't entirely consistent standards for reporting these things; it's often too easy to hide significant setup costs when running benchmarks.

This leads to an incentive to overstate results, because without any threat of checks it would be disadvantageous to not cherry-pick the absolute best numbers you could get through a peer review committee.

I get the desire to escape petty criticism: but it seems like a weak argument against the myriad of reproducibility and integrity hazards such a lack of transparency begets, especially for a line of work that ostensibly distinguishes itself on high standards for merit and integrity.

It also would have been nice to just receive a polite response of the form of "sorry, I don't support this anymore, I've moved on..." from anyone I had reached out to. Before starting on this project, I emailed four or five groups that had all published various papers on Curve25519 accelerators hoping to get even a broken non-compiling repo that could have helped me sanity-check my understanding of the thornier concepts presented in the papers.

Not even a single response...hence the motivation to write up notes and share them. Seems a waste of funding to have so many implementations, but no way to actually use any of them...

nullc · on July 16, 2021

One reason you sometimes don't get responses on this sort of thing is that the claims in the paper are sometimes just fake.

Not totally fake-- e.g. they'll do part of the work so that they feel like they can extrapolate to the rest. Like they might implement a field multiply and benchmark it, then just count how many of those they need for the whole circuit.

Then no one wants to waste the column inches explaining what they actually did or dealing with reviewers that want to nitpick it ... at least not if they can help it. I know I've personally regretted including experimental details for that reason. ("oh great, a reviewer wants me to re-run 20 CPU years of number crunching because they've taken issue with some irrelevant detail that I could have just left out. Fine.")

The "more approximate than they let on" doesn't just come from sketchy and obscure research groups but this kind of work happens under well known names too. If you read a lot of the lit you can get an impression on which groups/people are actually competent implementers because they publish them or at least also solve sub-problems that you'd only encounter when creating a full implementation.

There is a lot of pressure to go this route because of time constraints and because the engineering to actually implement this stuff is often too far outside the authors expertise-- the people who really know about implementations can get highly paid in industry and many do.

For anyone who isn't faking their research this should be all the more incentive to publish your code! But so far it seems the incentives haven't lined up that way yet.

OJFord · on July 15, 2021

Actually thinking about it, another aspect is almost certainly 'commercial sensitivity' for want of a better phrase? The university will own the copyright; they might have a policy against just handing it out when they might want to be able to later decide to commercialise it (because it's some whizz bang faster than everyone else's 25519 accelerator)?

I spent my first internship at Arm benchmarking various libraries' performance on Cortex-M. Arm subsequently acquired PolarSSL, now MbedTLS (I don't recall my results - no claims here, about PolarSSL nor the influence of my lowly undergrad intern work!) for I assume some non-trivial amount. You could imagine a similar company acquiring WhizzBang25519AccelCorp from the university but perhaps less likely/for less if there's a convenient WhizzBunnie25519AccelOSS re-implementation?

bunnie · on July 16, 2021

You might be right about that concern, however, a lot of this work is done through public funding. Seems a little bit like double-dipping to get the public to pay for you to make IP that you then later sell to the highest bidder, with no obligation to share to the public.

Of course, if the professor was say, the NXP chair for crypto core excellence and they endowed her department with the understanding that all IP would belong exclusively to NXP, then sure, that would make more sense. I did not see anything like that, though, in the research groups' bio.

JJMcJ · on July 15, 2021

Donald Knuth once wrote a paper with some source code.

He warned his readers about just copying the code, perhaps half jokingly.

He said, approximately, "this code has been proven correct, but not tested."

staticassertion · on July 15, 2021

Are you saying this based on experience? My assumption was that it was a number of factors.

1. Researchers are often weak coders, and they may be self conscious about their code.

2. My own experience is that researcher code is totally undocument, or broken. I've repeatedly found that the code linked to from a paper has no straightforward way to be built, or appears to have compiler errors etc.

3. It's more work, and researchers have the incentive to put out a compelling paper, not to put out code (because we don't hold them to that standard).

My own experiences have led to me to mostly dismiss papers where I can't find the code or can't get it to build.

OJFord · on July 15, 2021

> Are you saying this based on experience?

No, just observation & supposition.

> My assumption was that it was a number of factors.

I'm sure! I did say I suppose part of the reason.

Didn't mean to imply any experience/authority, OP just made me think about it. Sorry if it sounded like that.

staticassertion · on July 16, 2021

Oh, my question was genuine, I wasn't trying to be like "you DONT have experience". I was just curious.

Rereading my post it sounds like I'm trying to disagree, I'm no expert either.

kzrdude · on July 15, 2021

Given all that, it's understandable that it's not released and completely reasonable - a support structure is needed if the code should be released.

staticassertion · on July 15, 2021

I'm setting the bar super low. A `./configure` and a readme with basic info about the project and how to build/run it would be quite exceptional. I don't need them to be triaging github issues and merging in PRs.

joppy · on July 15, 2021

In a field like crypto you also have the concern that someone might copy your implementation and actually use it - this is fine if your specialty is crypto implementation (resistant to side-channel attacks and so on), but if your specialty is instead coming up with faster algorithm, your implementation is going to be much more proof-of-concept. In the theme of “don’t write your own crypto” is “don’t release your own crypto if you look authoritative (Eg a professor) and are not absolutely sure of it “

gww · on July 15, 2021

In the biological sciences there is a huge push for the code to be released alongside the paper. Many publications (but not enough yet) include a complete pipeline in a workflow language that downloads, processes the data and generates the figures used in the paper.

toomuchtodo · on July 15, 2021

What’s the language de jour typically leveraged to declare these workflows?

gww · on July 15, 2021

Nextflow and Snakemake are quite common. I prefer the later as I have a lot of Python experience.

semi-extrinsic · on July 15, 2021

Since GP mentions biology, I would guess R is the first choice.

jazzyjackson · on July 15, 2021

TIL 255 is divisible by 17

Edit: thanks to my replies I know several more things now lol big day for me

Sesse__ · on July 15, 2021

2^p - 1 can only be prime (a so-called Mersenne prime) if p is prime. 2^8 - 1 thus cannot be a prime, since 8 is composite.

(Note that is it not _sufficient_ that p is prime, but it is necessary.)

pfortuny · on July 15, 2021

(a^2-1)=(a+1)(a-1).

pavpanchekha · on July 15, 2021

256 = 16 * 16

255 = 256 - 1 = 16 * 16 - 1 = 15 * 17

jandrese · on July 15, 2021

It is also divisible by 5.

upofadown · on July 15, 2021

>The “double ratchet” algorithm is integral to modern end-to-end-encrypted chat apps, such as Signal, WhatsApp, and Matrix.

Signal protocol uses a hash ratchet for the message to message forward secrecy. So you don't need to do the expensive key agreement stuff unless that hash ratchet falls out of synchronization.

It isn't really clear if any messaging application really needs message to message forward secrecy in the first place. You really only need to do that expensive key agreement operation when you want to get rid of some old messages that someone has captured off the wire. That can be reasonably done at the end of a session or even weekly. Very few people need messages that have not even scrolled off the screen to be forward secret, assuming they need it at all.

admax88q · on July 15, 2021

I'm not sure you gain much by moving the forward secrecy to a weekly or session based level, and I think you lose a lot in terms of simplicity.

It is simpler to design and analyze if it applies at every message. It's a design for the best case. No longer need to make judgement calls on if 1 week is enough, or maybe 2, or maybe some middle ground.

sliken · on July 15, 2021

Say it takes a $5,000 machine 24 hours to break the code. Clearly it's much more attractive to an attacker if they can break a weeks worth of messages than 10 seconds worth.

tptacek · on July 15, 2021

You really only need to do that expensive key agreement operation when you want to get rid of some old messages that someone has captured off the wire.

Could you perhaps share more of your analysis here? I'd like to be able to better follow what you are trying to say.

upofadown · on July 15, 2021

If you are using a temporary shared secret then after you forget that shared secret you will need a new secret which you will have to generate. I mean nothing more than that.

tptacek · on July 15, 2021

Right, why would anyone ever be better off holding that temporary secret over many messages and many days?

upofadown · on July 15, 2021

My entire point is that it is not really necessary to do shared secret generation at any but a very slow rate. Even if you don't so it the Signal protocol way it is still perfectly reasonable to only do it once in a while. That might be a short while but it could reasonably be quite long.

tptacek · on July 15, 2021

Why is it reasonable to only do shared secret generation once in awhile? What's the win to doing that? The win to not doing it is that it narrows the blast radius of a single key compromise to a small number of messages (or, in full ratchets, to a single message). Why would I ever, ever want to increase the blast radius of a key?

lmm · on July 16, 2021

On the assumption that it's expensive (I don't know if it's true or not, but upofadown said so explicitly and you didn't contradict them), the win is saving that expense. Obviously if you can reduce the blast radius for free then you do so, but if it's not free then it's reasonable to ask how big the difference between one message/several messages/a week's worth of messages is in your threat model.

tptacek · on July 16, 2021

It is not expensive.

a1369209993 · on July 16, 2021

> What's the win to doing that?

The win is that you spend less processing power on elliptic-curve cryptographic operations. Just like (even if RSA didn't have all its other problems) there's a win to using ECC instead of X-thousand-bit RSA.

toast0 · on July 15, 2021

> That can be reasonably done at the end of a session or even weekly.

At the end of a session might be reasonable cryptographically, but isn't really reasonable in terms of application development; you don't consistently get to have use CPU and network when a session ends and you may not get access again until a new session begins. A weekly trigger is more likely to be practical, but that may not work consistently either. Better to do the ratchetting as you're doing the messaging to guarantee it gets done.

ChuckMcM · on July 15, 2021

This is an excellent read[1]. It reminded me of the process I went through putting together an FFT engine in an FPGA although I think the small execution engine is just stellar, and not something I had considered.

[1] To be fair I find most of the stuff Bunnie writes about to be worth reading! :-)

dreamcompiler · on July 15, 2021

I learned Montgomery multiplication from the Handbook of Applied Cryptography (Menezes, Oorschot and Vanstone), and coded it directly from the pseudocode therein. Not to mention lots of other crypto algorithms. This book is a gold mine.

IfOnlyYouKnew · on July 15, 2021

Excellent deep dive on this beast that had me similarly question my intelligence in the past.

As a former academic: the reason sharing source code is/was rare is that it tends to be extremely ugly, and that you don’t get any credit for publishing it where it matters (I. e. tenure committees). To go from something that works to something that you would allow people to see takes about as much time as the original work.

(As a side note: I feared for the worst when I saw, among tags like “hacking” and “open source”, the tag “feminism” in the side bar. I enjoyed being wrong.)

vlovich123 · on July 15, 2021

I don’t think that’s a good reason. Production code is also ugly.

jlrubin · on July 15, 2021

Hi Bunnie!

Curious if you also have a power usage comparison, which I'm guessing is also relevant in the mobile context.

bunnie · on July 16, 2021

Hi Jeremy!

So far only some anecdotal characterizations. Running the crypto core full-out on our wiggle-everything-to-see-that-it-works test vector suite adds 247mW +/- 5mW of power over the baseline idle condition.

This is fairly comparable to the power delta of loading our RISC-V core fully with a software task versus idling it (around 200mW).

So compared to the alternative of running the math on a RISC-V CPU core, we consume about 20% more power, but we complete the operation in 1/15th the time -- overall, it's a significant net savings of energy out of the battery pack.

Also, we do have the ability to completely gate the clock to the crypto blocks when they are not in use, so they don't contribute to any dynamic power during normal run time.

Static leakage then becomes the main concern. In the case of an FPGA, I suspect we're paying the static leakage penalty whether we use the logic or not, as they don't do VDD switching of configured domains on this chip. If this were to go to silicon, you'd either have to include circuitry to power off the block, bias the back side to reduce leakage, or just be willing to pay the static leakage penalty for the extra logic.

korethr · on July 15, 2021

This is neat. The ever-present admonition against rolling one's own crypto lurks in the back of my mind, though. It is not my intent here to impugn Bunnie's competence, though I do wonder if he's run the design or implementation past any cryptographers to try to verify that the design or implementation are sound, and don't subtly break critical assumptions vital to the security of the algorithm.

bunnie · on July 16, 2021

The admonition runs through my mind all the time, too. I'm not qualified to implement Curve25519 entirely from scratch.

But, the good news is that this accelerates certain primitives within a Rust crate that was written by a cryptographer I personally know and trust; I think she's done a good job with the validation and test vectors, and she's been meticulous on documentation.

The theory is that as long as I don't break her tests, I'm probably OK. I did bump into some corner cases for correctness that might be extremely hard to catch in a random vector test bench, which I've noted in the post, but I do have some synthetic vectors that try to make sure I don't have a problem. I also tried hard to make sure I didn't introduce any new timing side channels; since all the ops are constant-time, hopefully I'm safe there.

So, I'm fully aware that I'm not perhaps the best qualified person in the world to build this. However, as I had mentioned elsewhere, my attempt to get source code from several research groups who have built such blocks met with no success. And also knowing how the silicon industry works, any "secure microcontroller" group peddling you some secret sauce closed source cryptoblocks likely were not designed to any level of scrutiny I would be satisfied with. Given the options of no acceleration, dubious closed source blocks, or trying to roll my own, I went with the latter, but with copious amount of documentation so that hopefully people smarter than me will have an easy time to review it for errors.

As a side note, in terms of things that I did implement for this system that should be cause for worry, I'd personally say the TRNG is the most worrisome, because TRNGs are full of scary footguns that are very exploitable. Then I'd worry about the AES instructions in the RISC-V core, the Curve25519 engine, and the SHA hardware accelerator in that order. I didn't write the AES instructions or the SHA accelerator, but I also haven't done a thorough review of them for things like timing side channels. I only rank the AES instructions as slightly more worrisome because I know those were implemented by a CPU architect without much input from cryptographers, whereas the SHA block comes from Google's OpenTitan project, which I would /hope/ was thoroughly reviewed but then again I'm just relying on the G-brand.

There is something comforting, though, to be able to know the provenance of your crypto primitives.

pinewurst · on July 15, 2021

But he’s not rolling his own crypto - in an algorithm sense - purely trying to speed the execution of an existing algorithm which can be easily validated against other existing implementations

korethr · on July 15, 2021

Sure, but is it not possible that a hardware acceleration, if not also proven correct, could subtly undermine an otherwise valid software implementation?

I do not assert that this is the case generally, nor particularly here -- I would be quite happy to be wrong about this. But if there's one thing I've taken from my admittedly minor study of crypto is that the most trivial of details can be critically important, and failure to get a detail exactly correct can compromise an entire algorithm and anything built thereupon.

I would love to see hardware acceleration for ed25519, just like there's been hardware acceleration for AES. It could help drive adoption of an algorithm that is less fiddly and easy to get wrong than RSA, and maybe in another 15 years, I'd be able to purchase IT equipment that supports ed25519 and not just hardware accelerated RSA and AES.

not2b · on July 16, 2021

Correctness alone is not sufficient. A crypto implementation also has to avoid introducing side channels. He describes how he avoids introducing a timing side channel (can't have any conditional branches that depend on the data) so it seems he took care to get that part right.