Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"It’s unrealistic to expect to have the entire input in memory" -- wrong for most applications


Most applications read JSONs from networks, where you have a stream. Buffering and fiddling with the whole request in memory increases latency by a lot, even if your JSON is smallish.


Most(most) JSON payloads are probably much smaller than many buffer sizes so just end up all in memory anyway.


On a carefully built WebSocket server you would ensure your WebSocket messages all fit within a single MTU.


Yes but for applications where you need to do ETL style transformations on large datasets, streaming is an immensely useful strategy.

Sure you could argue go isn’t the right tool for the job but I don’t see why it can’t be done with the right optimizations like this effort.


If performance is important why would you keep large datasets in JSON format?


Because you work at or for some bureaucratic MegaCorp, that does weird things with no real logic behind it other than clueless Dilbert managers making decisions based on LinkedIn blogs. Alternatively desperate IT consultants trying to get something to work with too low of a budget and/or no access to do things the right way.

Be glad you have JSON to parse, and not EDI, some custom deliminated data format (with no or old documentation) - or shudders you work in the airline industry with SABRE.


sometimes it's not your data


Usually because the downstream service or store needs it


If you're building a library you either need to explicitly call out your limits or do streaming.

I've pumped gigs of jaon data, so a streaming parser is appreciated. Plus streaming shows the author is better at engineering and is aware of the various use cases.

Memory is not cheap or free except in theory.


Here people confidently keep repeating "streaming JSON". What do you mean by that? I'm genuinely curios.

Do you mean XML SAX-like interface? If so, how do you deal with repeated keys in "hash tables"? Do you first translate JSON into intermediate objects (i.e. arrays, hash-tables) and then transform them into application-specific structures, or do you try to skip the intermediate step?

I mean, streaming tokens is kind of worthless on its own. If you are going for SAX-like interface, you want to be able to go all the way with streaming (i.e. in no layer of the code that reads JSON you don't "accumulate" data (esp. not possibly indefinitely) until it can be sent to the layer above that).


> If so, how do you deal with repeated keys in "hash tables"?

depending on the parser, behaviour might differ. But looking at https://stackoverflow.com/questions/21832701/does-json-synta... , it seems like the "best" option is to have 'last key wins' as the resolution.

This works fine under a SAX like interface in a streaming JSON parser - your 'event handler' code will execute for a given key, and a 2nd time for the duplicate.


> This works fine

This is a very strange way of using the word "fine"... What if the value that lives in the key triggers some functionality in the application that should never happen due to the semantics you just botched by executing it?

Example:

    {
      "commands": {
        "bumblebee": "rm -rf /usr",
        "bumblebee": "echo 'I have done nothing wrong!'"
      }
    }
With the obvious way to interpret this...

So, you are saying that it's "fine" for an application to execute the first followed by second, even though the semantics of the above are that only the second one is the one that should have an effect?

Sorry, I have to disagree with your "works fine" assessment.


you're layering the application semantics into the transport format.

It's fine, in the sense that a JSON with duplicate keys is already invalid - but the parser might handle it, and i suggested a way (just from reading the stackoverflow answer).

It's the same "fine" that you get from undefined C compiler behaviour.


Why do you keep inventing stuff... No, JSON with duplicate keys is not invalid. The whole point of streaming is to be able to process data before it completely arrived. What "layering semantics" are you talking about?

This has no similarity with undefined behavior. This is documented and defined.


A JSON object with duplicate keys is explicitly defined by the spec as undefined behavior, and is left up to the individual implementation to decide what to do. It's neither valid nor invalid.


last key wins is terrible advice and has serious security implications.

see https://bishopfox.com/blog/json-interoperability-vulnerabili... or https://www.cvedetails.com/cve/CVE-2017-12635/ for concrete examples where this treatment causes security issues.

the https://datatracker.ietf.org/doc/html/rfc7493 defines a more strict format where duplicate keys are not allowed.


Last key wins is the most common behavior among widely-used implementations. It should be assumed as the default.


I guess it's all relative. Memory is significantly cheaper if you get it anywhere but on loan from a cloud provider.


RAM is always expensive no matter where you get it from.

Would you rather do two hours of work or force thousands of people to buy more RAM because your library is a memory hog?

And on embedded systems RAM is a premium. More RAM = most cost.


If you can live with "fits on disk" mmap() is a viable option? Unless you truly need streaming (early handling of early data, like a stream of transactions/operations from a single JSON file?)


In general, JSON comes over the network, so MMAP won't really work unless you save to a file. But then you'll run out of disk space.

I mean, you have a 1k, 2k, 4k buffer. Why use more, because it's too much work?


Is the HTTP request body part of the input?





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: