Here's the "how so fast" explanation: https://github.com/kostya/benchmarks/pull/...

Tinyyy · on Oct 22, 2015

I still don’t get it. ELI5?

grayrest · on Oct 22, 2015

The JSON looks like this:

    {"coordinates": [
       {"x": 0.65, "y": 0.23, "z": 0.91, "name": "fwgzd", "opts": {'1': [1, true]},
       {"x": 0.45, "y": 0.78, "z": 0.22, "name": "alfsj", "opts": {'1': [1, true]},
       ...
    ],
     "info": "some info"}

The benchmark code [1] (very readable) is reading an array of structs containing x,y,z from 'coordinates'.

[1] https://github.com/kostya/benchmarks/blob/master/json/test_f...

I haven't read the code but the algorithm would look at this roughly like:

   See `{` process? yes
   See `"coordinates"` process? yes
   See `[` process? yes
   See `{` process? yes
   See `"x"` process? yes => new Coord, Coord.x = 0.65
   See `"y"` process? yes => Coord.y = 0.23
   See `"z"` process? yes => Coord.z = 0.91
   See `"name"` process? no, next token is `"`, skip to `"`
   See `"opts"` process? no, next token is `{`, skip to `}`

The tradeoff is that he's completely ignoring the contents of `name` or `opts` or `info` and those values could potentially be invalid JSON but this processor doesn't care.

The code is also picking up efficiencies from being a static C-like language. The "new Coord" isn't actually doing anything, the alloc happened for the array as a whole so the assignments just write a known size value to a known offset from the start of the array. He's also using SIMD instructions to process multiple bytes at a time and some other tricks but the skipping is the main difference.

I think it's also interesting that the Rust code implemented value skipping in the benchmark file itself. The relative slowness there is likely that the library used (serde_json) is the JSON plugin for a generic serialization/deserialization lib and that Rust doesn't have a way to do SIMD yet.

erickt · on Oct 22, 2015

I wrote serde_json, and wrote the rust benchmark here a few months ago. Interestingly when I wrote this benchmark, I had my implementation as equivalent to RapidJSON on my Mac, but for some reason Kostya couldn't replicate it:

Ohttps://github.com/kostya/benchmarks/pull/44

I'm guessing gcc just has some optimizations llvm doesn't.

Rust does have some experimental SIMD, but I'm not using it yet because I want the serde libraries to be safe to use on byte streams, and reading 16 bytes ahead could block if at the end of a socket stream. Hopefully we will get specialization soon, which would let me use SIMD when I know I have at least X bytes in a buffer.

grayrest · on Oct 22, 2015

I enjoyed your blog series on serde perf.

One thing I noticed in this example was that the D example worked pretty much exactly like I want serde to work in that it was able to deserialize a subset of the overall document and the Coord struct didn't need to exhaustively cover the individual json data objects. If there's a way to do this in serde, an example in the docs would be really helpful.

erickt · on Oct 22, 2015

Thanks! I need to get back into writing on it.

I just pushed up a rust pull parser version here: https://github.com/kostya/benchmarks/pull/54. Is that what you were thinking of?

grayrest · on Oct 23, 2015

My wishes are much more prosaic. It's not clear to me just from reading your docs how I can extract data from a JSON file using the pattern this benchmark shows (top level key containing the data, other keys containing metadata about the request) without having to create otherwise useless struct to cover the outer wrapper object.

I see you have a reply to Gankro for a non-exhaustive flag and that'd work. As for the default, the current behavior is what I'd expect from a Rust lib given the correctness first mindset of the community but I will always be opting for non-exhaustive because I think most people providing JSON APIs consider additional keys to be backwards compatible (they are in dynamic languages) and I'd prefer my apps to not break in production for no apparent reason.

Gankro · on Oct 22, 2015

I'm pretty sure this is the only way that Serde works?

If you tell it do deserialize to a `Point { x: u32, y: u32 }`, it will ignore any additional fields.

erickt · on Oct 22, 2015

It doesn't quite yet. By default it errors on unknown fields, and my plan is to add an annotation to ignore it instead. Not quite sure if I got the default behavior correct though. I'm considering flipping it.

Gankro · on Oct 25, 2015

Huh. I tested it out on serde-toml; does it have a bug, then?

Tobani · on Oct 22, 2015

The parser finds the start/end of sub-structures and doesn't necessarily process them all. So for large structures of which you only need a subset, you're only doing the work you need to do. For a large structure in which you need all of the data, there is probably less of gain.

valarauca1 · on Oct 22, 2015

That plus the parsing is done in SIMD so even in UTF-8 you're processing 4 unicode points per CPU cycle [1], instead of 1 in most traditional parsers.

[1] This is a board generalization not necessarily true for all SIMD opcodes.

thechao · on Oct 22, 2015

Hm. I guess when thinking "SSE parsing" I didn't go with 4/8-wide parsing. I was thinking that they'd be grabbing 16/32-bytes, doing a compare against a fixed constant literal of 16/32-copies of, say, '{', or '}', then extracting the index of the match, exactly.

Something like this:

    d := _mm_loadu_ps <json>
    b := _mm_loadu_ps <token: '{'>
    n := _mm_cmpistrc(d, b)

You'd have to be clever to skip around strings ("... \"sonuva ... "), but once that's handled, you'd have significant speed ups to scan for ',', '{', '}', etc.

I think the double-quote escape might look something like this:

    d := _mm_loadu_ps <json>
    q := _mm_loadu_ps <token: '"'>
    e := _mm_loadu_ps <token: '\\'>
    n := _mm_cmpistrc(d, q)
    m := _mm_cmpistrc(d, e)
    if m+1 == n:
        branch-to-top
    ... process ...

Looks like cmpistrc has 1/2 reciprocal throughput. If you unrolled the loop 8 deep, you're probably looking at 10c per 16bytes scanned.