Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
JSON Schema Store (schemastore.org)
278 points by WallyFunk on Aug 20, 2023 | hide | past | favorite | 146 comments


Lots of comments here about XML vs. JSON... but there are areas where these two don't collide. I'm thinking about text/document encoding (real annotated text, things like books, etc).

Even though XML is still king here (see TEI and other norms), some of its limitations are a problem. Consider the following text:

    Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Now say you want to qualify a part of it:

    Lorem ipsum <sometag>dolor sit amet</sometag>, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Now say you want to qualify another part, but it's overlapping with previous part:

    Lorem ipsum <sometag>dolor sit <someothertag>amet</sometag>, consectetur</someothertag> adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Of course, this is illegal XML... so we have to do dirty hacks like this:

    Lorem ipsum <start someid="part1"/>dolor sit <start someid="part2"/>amet<end someid="part1"/>, consectetur<end someid="part2"> adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Which means rather inefficient queries afterwards :-/


A strategy I've seen for dealing the inability of XML to handle overlapping tags, is to treat the tagging as an annotation layer on top of the node with the data:

  <doc>
  <data type="text">
    This is some sample text.
  </data>
  <annotations>
    <tag1 start="1" end="3" comment="foo"/>
    <tag2 start="2" end="4" type="bar" />
  </annotations>
  </doc>
The start and end are usually byte offsets from the start of the text content in the data node. It still sucks, but at least you could apply the same general stragegy to more than just text data - I've seen it used with audio/video where the offsets are treated as time offsets into the media.


Good idea. You would have to edit your annotation layer in case the text data changes though.


Now you've lost human editability/readability and could just as well encode that in a non-XML format.


Yep.

In my experience, the traditional solution for editing with these kinds of hacks is to write a buggy piece of shit custom GUI so people can edit documents. That way, the complaints shift away from your lousy data format to your lousy UI. Problem solved!


It seems klutzy and yet fully in the spirit of XML.


I would argue that the inline way of annotating things in XML is actually ok-ish if one absolutely needs human edit-ability, but otherwise bad design.

  {text: "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.",
   annotations: [{tag: "sometag", ranges: [{from: 12, to: 26}]},
                 {tag: "sometothertag", ranges: [{from: 21, to: 39}]}
Note that this also removes the limitation that annotations have to be consecutive.


crafty, but for your consideration: that places the burden upon every library author to be "accounting accurate" to any edits, and the only way anyone would know that it's not correct is to visually inspect the output text

also, as I get older I have a deeper and deeper appreciation that "offset" and "text" are words that are fraught with peril


What about using inline "floating" checkpoints for the ranges, instead of character indexes?

  {text: "Lorem ipsum {{1}}dolor sit {{2}}amet{{3}}, consectetur{{4}} adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.",
   annotations: [{tag: "sometag", ranges: [{from: 1, to: 3}]},
                 {tag: "sometothertag", ranges: [{from: 2, to: 4}]}


Fair critique, but shouldn't strings be immutable anyways? Once you bring editing into play, you'll probably either want something like a rope or some CRDT and then you have more effective means of tracking positions, than manual offset computations, as part of the data-structure.


You are absolutely right that XML is better for document structures.

My current theory is that Yjs [0] is the new JSON+XML. It gives you both JSON and XML types in one nested structure, all with conflict free merging via incremental updates.

Also, you note the issue with XML and overlapping inline markup. Yjs has an answer for that with its text type, you can apply attributes (for styling or anything else) via arbatary ranges. They can overlap.

Obviously I'm being a little hypabolic suggesting it will replace JSON, the beauty of JSON is is simplicity, but for many systems building on Yjs or similar CRDT based serialisation systems is the future.

Maybe what we need is a YjsSchema...

https://github.com/yjs/yjs/


Yjs isn’t a document structure is it? It seems to be a library for collaborative editing, but I’m not seeing something suitable for marking up a document, or am I missing something obvious?


You can add metadata to ranges inside the text.


This is actually one of the things processing instructions are useful for - but you would need to define the data within the PI, since they don't have attributes.


JSON Resume uses a defined schema. (listed on schemastore.org)

It has made writing resumes with co pilot super powerful.


Do you have an example of how you've done this?


Can you share your general process for that? Trying to do more AI for this type of thing.


(2020)?

Some previous discussion: https://news.ycombinator.com/item?id=23988269


JSON is the version of XML we deserve.


Nobody deserves XML! In all seriousness I get the idea behind XML and I have used a couple of SOAP services which were absolutely brilliant, but as someone who has spent a decade “linking” data from various sources in non-tech enterprise… Well… let’s just say that I’m being kind if I say that 5% of the services which used XML were doing it in a way that was nice to work with.

Which is why JSON’s simplicity is such a win for our industry. Because it’s always easy to handle. Sure you can build it pretty terrible, but you’re not going to do this: <y value=x> and then later do <y>x</y> which I’m not convinced you didn’t do in XML because you’re chaotic evil. And you’re not going to run into an issue where some Adobe Lifecycle schema doesn’t work with .Net because of reasons I never really found out because why wouldn’t an XML schema work in .Net? Anyway, I’m sure the intentions behind XML were quite brilliant but in my anecdotal experience it just doesn’t work for the Thursday afternoon programmer mindset and JSON does.


>In all seriousness I get the idea behind XML

followed by

>and I have used a couple of SOAP services which were absolutely brilliant

makes me doubt the first part of the statement.

If I were to guess what it means is you understand the point of SOAP, and also understand the limitations and problems especially as it relates to uses of XSD in SOAP and the various stack of Web Service specs, but you probably have not had much experience with non-XSD based validation of XML, you do not have any experience with document formats as opposed to data formats, you probably are not familiar with larger international standards like UBL, and not familiar with XML formats that are not so much data or document oriented - SVG, XSL-FO (which admittedly sucks more than is reasonable), GraphML and so forth...

A lot of the commenters here are standing up for the value of XML, and I'm not actually with this comment, there are a lot of benefits for using JSON especially when you are using JavaScript all over the place. But saying XML sucks because XSD and SOAP sucks indicates a potential lack of knowledge about the whole subject (perhaps only caused by infelicitous phrasing)


I’m simply trying to say that I think XML sucks because the people who implement it suck, and not because of any technical reasons. Hell, I’m saying that I think XML sucks because it allows people to suck when they use it, myself included.


A catchy but a meaningless phrase. JSON is a dumpster on fire. Probably in even more ways than XML is. Maybe you deserve it... I feel like I'm being punished by the stupid people who make me use it in a way similar to the sham court hearings from The Planet of Apes.


>I feel like I'm being punished by the stupid people who make me use it

What's the use-case and what alternative would you prefer?


There are multiple contradictory requirements to different things you could want from communication formats. Below are some examples:

* You could want to have a universal tool that can examine and understand the contents of the message (for debugging purposes), but you could also want not to send meta-information about the message (s.a. types or sizes etc.) that is essential for parsing the message. And you cannot have both at the same time.

* You could want a message format that maps onto the primitive types of a particular language well, but at the same time you may want it to be universal and map to many other languages well. But this is impossible because different languages will have different primitive types and the need to be generic will act against the need of being specific.

* You may want to be able to stream data, but this works against hierarchic data organization.

* You may want to be able to write messages into pre-allocated memory buffers w/o having to re-calculate the amount of memory necessary to encode a message, but this makes it very hard / impossible to add custom fields and types.

---

Given all this, I don't think that JSON is a good match for any use-case it's currently commonly used for. If I want data transfer, I'd go for something like SQL. If I want configuration, I'd go for Datalog. But then I see value in optimizing the transfer of multiple similar records, whereas someone else may see value in optimizing transfer of hierarchically structured data, which isn't necessary repeating. I tried many formats of this kind, and am yet to find a good one. I'm inclined to think that maybe trying to arrange hierarchical data with different constraints on its organization is just a bad approach to data transfer, that the organization and constraints of such data shouldn't be encoded as part of the format, but interpreted by the users of the format. But, if I really had to do this the best way I can image, I'd still go for Datalog.


Whenever two or more are gathered together, they shall argue about JSON vs XML.

Personally I like the simplicity of JSON and also the expressive power of XML. But then I tend to only use each for the task it was primarily intended: application data-on-the-wire in JSON and "documents" in XML. It seems like a lot of the recurrent discussion around these technologies happens when they're pushed to do things outside their comfort zone. And I wonder if some of this is down to siloing of developer knowledge.

There was a comment on HN a few days ago (not by me, and I can't find it now) to the effect that web development has historically attracted self-taught developers or those who have come to it by routes like bootcamps. It went on to say that they perhaps consequently lack some knowledge of existing techniques and solutions, and therefore tend to recreate solutions that may already exist (and not always well). And this drives the well-known churn in webdev tech: of which bolting schemas onto JSON is arguably an example.

I wonder what people think of this? Personally I think it has some merit, but that the "churn" has also generated (along with much wheel-reinvention) some great innovations. And I say that as someone who works mainly on back-end stuff.

Thoughts?


I'd extend this "X developers are mostly self-taught" onto all of computer development. They say, "Every developer Of a Certain Age's first programming language was BASIC" and my experience of (eventually) getting a CS degree is that there is the expectation of students to already know how to do the thing that they are trying to teach; a certain level of "self taught" is expected. To that end, I can see how in The Age of Teh Internets that the standard of self taught has moved of from BASIC to HTML/CSS/JS (or Unity or whatever sparked the young mind's attention). --- What I'm not certain of is that "self taught" means that work will be duplicated because the self taught developer doesn't know the technology that exists. I think that someone who is extremely online will very likely be more abreast of what technologies exist. I think that a formal education is better at establishing what the fundamentals underlying a programing method or paradigm... but not necessarily at exposing new programmers to what the state-of-the-art is.


I wish it had comments. And for that reason, I prefer yaml.


I prefer JSON's strictness. A Boolean cannot be confused for a string.

In yaml:

    country: no
Now your country is Boolean(false)

Now, I still prefer yaml overall.

Also, I hate that GitHub actions don't support anchors.


That's a false dichotomy. JSON could have comments and not be ambiguous like YAML.

In fact: there exists a specification called JSON5 which does include comments.

But I agree that it would've been nice if it had comments before it achieved a critical mass of adoption.


I've read the opinion that comments were omitted from JSON in order to forestall hacky round-trip conversions to/from other formats (like XML).

Can anyone confirm/deny ?


The primary reason why JSON does not support comments is that its creator, Douglas Crockford, deliberately removed them from the format to prevent misuse and keep it as a pure data-only format.

Crockford observed that some people were using comments to store parsing directives, which could break compatibility between different systems. Hence, the decision to remove comments to maintain the simplicity and consistency of the format across various programming languages and environments.

https://www.freecodecamp.org/news/comments-in-json/

I vote for the minimalist heretic.

https://www.infoq.com/presentations/Heretical-Open-Source/


I don't think they were saying that's an issue of comments, just an issue of yaml vs json.


In YAML 1.2 that gives the string "no".


Huh, that's a breaking change. yes/y/no/n/on/off are no longer boolean.

https://perlpunk.github.io/yaml-test-schema/schemas.html


Yes, it it. However, it's a breaking change that happened 14 years ago.


Except for many yaml implementations either not supporting 1.2 at all (pyyaml, ruby stdlib) or being a weird mix (goyaml) so as to keep working with older files.

So when you’re dealing with objective reality, this is still an issue today.


So that's been the spec since 2009???


This is the only example everyone points out against yaml.

For Json, there are several such syntactic problems. Unnecessary double quotes and only double quotes everywhere. No dangling comma.


These are different kinds of problems.

JSON is (arguably) too strict. YAML is (arguably) too loose. One is better for machines, the other is (usually) better for writing by humans by hand. There's no perfect compromise for every use case.


cries in TOML


There’s always JSON5


True, but the multi-line comments give me the ick, and there's still no standard for bigints.


YAML is a fine implementation of JSON with comments.


IS there an on-premise alternative? Not necessarily speaking of schemastore.org on prem but a service comparable in spirit.


The 'service' is basically hosting this file: https://www.schemastore.org/api/json/catalog.json - you could host that locally and point your software to it, modifying the other URLs where needed

It's a pity the catalog format doesn't support an 'import' or relative URLs for schemas - would have made local extensions a bit easier.


We (JSON Schema) did a case study/interview with the guy behind it https://www.youtube.com/watch?v=-yYTxLZZk58&list=PLHVhS4Tj1Y...


It is interesting that people love json (now with schema), but hate XML while loving HTML at the same time. It is all pretty boring and largely the same imo.


The absolute worst bit of XML is the confused implementations. What should be an attribute on a tag, and what should go between tags? Even worse, nothing is sanely typed without an xsd. Different systems will treat the following differently:

    <some>true</some>
versus

    <some>1</some>
Some systems require the token "true", others will only treat 1 as the boolean true.

For example, MS claims that for exchange ASD boolean values must be integer 1 or 0 [0], but then links to a W3C spec that allows for the tokens true and false [1]

At least with JSON and HTML, you don't need a separate definition file for basic, primitive data types.

[0] https://learn.microsoft.com/en-us/openspecs/exchange_server_...

[1] https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#boolean


XML is only concerned with whether a document is well-formed, not its conformity to a given schema. Schemas like XSD, DTD, etc can be plugged in later. Many systems just have an ad hoc schema.

> At least with JSON and HTML, you don't need a separate definition file for basic, primitive data types.

Unless I’m missing your meaning, this seems like an apples-to-oranges comparison. HTML is not a general-purpose format like JSON. It’s a very complicated document format that is validated with reference to an external spec.

I think XML is a great fit for a document format that can become arbitrarily complex yet still easy to author and validate. It’s obviously a really poor fit for a wire transport protocol.


I don't see XML and json as different at all - they are both markup languages that describe trees. As long as what you are describing is a tree, either is fine. Of course, I would never do an XML endpoint on a rest http service since no one would expect that and they would assume that you don't know what you are doing and your service sucks.


How is your example any better in JSON? { some: true, someOther: 1, another: "true" }


Only the first one will be treated as a boolean in JSON. The second one is a number, and the third is a string.


Well, it's the same in XML, more or less. [1] The difference is that XML cleanly separates types and data, all the type information is in the schema and there is no type information in the data, so without a schema you can not properly type the data.

JSON on the other hand does not separate types and data, the types are implicitly contained in the data. So you can get type information from the data without a schema, at least up to the point where JSON's simple type system is no longer expressive enough, then you need - just as with XML - a schema to get the correct type information, for example to distinguish actual strings from dates.

If you really need this, nobody stops you from including type information in XMLs, <have type="boolean">true</have> attributes on elements or <quote>"strings"</quote> but not numbers <numbers>123</numbers> and use that. You will of course have to do this on your own, that is just not the way XML is supposed to be used.

[1] Let me clarify this a bit. If you handle XML, you usually have a schema and therefore the type information. If you have a non-trivial JSON, you also need a schema for the types. You can only get away without a schema for simple JSONs where the implicit type information is good enough. But then you could do almost the same with XML, just parse the content and see like what type it looks. You will not get quite to what JSON can do in a sane way but it might be good enough just as JSON without a schema is sometimes good enough.


> a schema to get the correct type information, for example to distinguish actual strings from dates.

Which, in practice, is a terrible oversight. I've honestly never seen a JSON store/transport/serde in practice without dates and/or times in them. There's always some updated_at or captured_on somewhere in the API or dataset.

Of all the data-types needed, I'd say dates are amongst the most important. At least more important in practice than floats; which JSON does support for odd reasons. Especially with dates being ambigous at best and inconsistent at worst.

Pubquiz: when was/or will be, how much paid? { currency: "THB", paid_at: "04-03-2566", amount: 13.37 }¹ - JSON is neat for simple use-cases, but utterly impractical for when precision and correctness is required. Yet here we are, building around and on top of it, to get that correctness and precision.

¹I'm messing a bit, 'cause this calendar isn't used in practice for such use-cases anymore. Hardly. But I've seen this with Hijri calendars. And those silly US date-formats. I've seen it "solved" with complex structs like { created_at: { year: 2022, day: ... , timezone:xxx}". I've myself "fixed" floating-point precision issues in financial applications that used JSON by using all-strings: "{ currency: "USD", amount_in_cents: "1337" } and such.


You're complaining about people encoding dates in string and at the same time you encode numbers in string. That's funny.

There's standard to encode dates to string. It's called ISO-8601 and it's supported everywhere.

Also JSON poses no particular limits which would force anyone to encode string as number.


> You're complaining about people encoding dates in string and at the same time you encode numbers in string. That's funny.

No. I'm complaining that JSON is too limited. And that it's "type system" is lacking so much that I have to resort to hacks like encoding numbers in strings. Which I think is embarrassing for an industry.

> There's standard to encode dates to string. It's called ISO-8601 and it's supported everywhere.

It's not. Too many servers and services use formats other than ISO-8601. Should I call Visa that their export formats suck? Or that Random API that their JSON datefields should be changed to ISO-8601. It's supported in most languages. But e.g. something widely used as Google Sheets doesn't support this: If you get a JSON or CSV with ISO-8601 into Google sheets, a lot of string parsing and even regexes are needed to turn it into a proper date.

Saying "we use ISO-8601 and that solved everything" only works if you never need any service outside of yours and never interop or exchange data with other services. Which in practice is never for anything remotely successful.


> Which in practice is never for anything remotely successful.

I have yet to see Time be easy. Anywhere. At all. From daylight savings being state-dependent, system times resetting to rand, right down to CPU monotonic timing.

Using Time handling as a criticism to JSON's architecture doesn't hold water.


We are serializing our dates to ISO8601, works great. Have never had any issues.


> The second one is a number

This would've been useful if you knew what kind of number it was...

As for what goes into separate elements and what goes into attributes: a typical answer to this is that simple types (as per XSL) go into attributes, complex types go into elements.

Compare this to JSON's screwed-up definition of "hash-tables" (the things in curly braces) which doesn't require that "keys" be unique.

XML wasn't perfect. But JSON isn't really better. It sucks in a slightly different way because people keep inventing these formats w/o much thinking, and once discover problems, don't fix them.


This would've been useful if you knew what kind of number it was...

JSON isn’t ambiguous when it comes to this. Numbers are arbitrary precision decimal numbers[1].

I’m guessing your issue is with how Javascript interprets JSON numbers as 32 bit floats. But that is a (mis-)feature of JavaScript and switching your serialization format to XML would not help, because JavaScript represents all numbers as 32-bit floats.

[1] https://www.json.org/json-en.html


> The second one is a number

You mean a float, right?


A double-precision 64-bit binary format IEEE 754, if you will!


fun fact: that's JavaScript. JavaScript only supports double-precision 64-bit binary format IEEE 754.

But JSON doesn't disallow arbitrary precision numbers, that's up to the parser implementation.

   number
    integer fraction exponent
In fact not all implementations support IEEE 754 doubles, and, from my experience, when dealing with money and rounding errors, many decide to serialize numbers as exact strings and use custom code for deserialization.


> when dealing with money and rounding errors, many decide to serialize numbers as exact strings and use custom code for deserialization

That's exactly what I'm doing. And indeed another reason why I feel embarrassed by JSON. I mean, we -the industry- have been doing financial data transport over computer networks, for how long now? fifty years? And we keep "inventing" transport formats that unsolve issues that have long been solved and done. XML had this solved[1]. Hell, even the ancient MT940[2] had this solved.

[1] https://web.archive.org/web/20200618100100/https://deutscheb... (pdf warning) [2] e.g. https://financialdataexchange.org/FDX/About/OFX-Work-Group.a...


> The absolute worst bit of XML is the confused implementations. What should be an attribute on a tag, and what should go between tags?

XML is a language for marking up text. SVG uses attributes for all vector data, because the vector points are not meant to be presented to a user as raw data.

If I embed a SVG into a XHTML document and the browser does not understand SVG, the text within the graphic is still presented to the user.

> Even worse, nothing is sanely typed without an xsd. Different systems will treat the following differently:

This is not a responsibility of XML, which deals in a common well-formed markup format for various document format.

It sounds like you are dealing with a tool that has defined an XML-based data interchange format, and that they may have inconsistent tooling for their format.


> you don't need a separate definition file for basic, primitive data types.

unless you need something different from JavaScript primitive data types.

For example integers.

Or null means nothing to you.

Or you want a faithful representation of input

   Welcome to Node.js v20.5.1.
   Type ".help" for more information.
   > JSON.stringify(undefined)
   undefined

   > JSON.stringify([undefined])
   '[null]'
but then

   jq "." <<< "[null]"     
   [
     null
   ]
   
   jq "." <<< "undefined"
   parse error: Invalid numeric literal at line 2, column 0


I'm not certain I understand the point you're trying to make regarding `undefined`, or indeed the expectation that `jq` and `JSON.stringify()` follow the same rules.

`JSON.stringify()` is documented to behave exactly as you demonstrate, so there's no surprises:

undefined, Function, and Symbol values are not valid JSON values. If any such values are encountered during conversion, they are either omitted (when found in an object) or changed to null (when found in an array). JSON.stringify() can return undefined when passing in "pure" values like JSON.stringify(() => {}) or JSON.stringify(undefined).

Expecting `jq` to somehow understand that its input came from Javascript's `JSON.stringify()` and so should be parsed on that basis seems ... odd? I don't see any problem with what `jq` is doing there, but anyway I don't see a problem with JSON itself in these examples.


> undefined, Function, and Symbol values are not valid JSON values

that's the point.

The official JSON serializer from every broswer vendor and every Node installation produce invalid JSON.

Which for the JavaScriptObjectNotation is kinda hilarious.

> Expecting `jq` to somehow understand that its input came from Javascript's `JSON.stringify()`

I would expect `JSON.stringify` to give an error if trying to serialize something that naturally does not map to JSON, like many other libraries do.

You have to provide a manual override for those situations.

But JavaScript and ECMA (`JSON.stringify` is defined in the standard) decided that no, they can ignore the specs for some reason.

Problem is they can't fix it now, because too many applications rely on those wrong assumptions.

Here it is the reason why you can find <flag>true</flag> and <flag>1</flag>

Difference being XML was born to standardize the document format, JSON aspired to be a data format but failed miserably at it, even at the most basic level, like saying an int from a float. The spec is simply too vague and ambiguous to give some guarantee of interoperability, beyond numbers and strings.

Maybe we should all switch to MessagePack


> The official JSON serializer from every broswer vendor and every Node installation produce invalid JSON.

I don't think I agree with this. `JSON.stringify` isn't producing JSON when it returns `undefined`. Instead...

> I would expect `JSON.stringify` to give an error if trying to serialize something that naturally does not map to JSON, like many other libraries do.

> You have to provide a manual override for those situations.

... `undefined` is an error. As in, there's no meaningful difference between catching an exception and providing "a manual override" for `undefined`, is there?

> But JavaScript and ECMA (`JSON.stringify` is defined in the standard) decided that no, they can ignore the specs for some reason.

What part of what spec is being ignored? `JSON.stringify` conforms to its own spec, as you say; and when it returns JSON, the JSON is valid. Meanwhile the JSON spec itself is very explicit about not declaring rules for serialisation/deserialisation:

The goal of this specification is only to define the syntax of valid JSON texts. Its intent is not to provide any semantics or interpretation of text conforming to that syntax. It also intentionally does not define how a valid JSON text might be internalized into the data structures of a programming language. There are many possible semantics that could be applied to the JSON syntax and many ways that a JSON text can be processed or mapped by a programming language. Meaningful interchange of information using JSON requires agreement among the involved parties on the specific semantics to be applied. Defining specific semantic interpretations of JSON is potentially a topic for other specifications.


For what it's worth, you misread the output. The returned value was not a string, it was an error indicator:

    > JSON.stringify(42)
    '42'
    > JSON.stringify(undefined)
    undefined
    > typeof JSON.stringify(42)
    'string'
    > typeof JSON.stringify(undefined)
    'undefined'


>> What should be an attribute on a tag, and what should go between tags?

Are you ok with <a href="..">link</a>?

That was kind of my original point, people are fine with html but don't like XML. I think the real reason people don't like XML is it reminds them of Steve Ballmer.


> What should be an attribute on a tag, and what should go between tags?

I think a good rule of thumb is that attributes are for key/value pairs that are probably not user-visible and definitely not directly user-editable.

Carried to a logical conclusion, this would simplify the auto-creation of form GUIs.


JSON is a much better serialization format since XML was designed as a document format. For example, there is no standardized way to serialize a string with a null character even if you escape it (this is allowed in many programming languages). JSON just says do “\0” and calls it a day. I’m not sure if it’s better for users, but it’s certainly easier to work with as a dev.

HTML isn’t trying to serialize abstract data and is doing what XML does best in being a document/GUI format. It doesn’t matter all that much that it can’t represent null characters in a standard way because it isn’t a printable character.


> JSON just says do “\0” and calls it a day

nope

   jq "." <<< "\0"
   parse error: Invalid numeric literal at line 2, column 0

   jq "." <<< '{"name": "\0"}'
   parse error: Invalid escape at line 1, column 13
maybe you mean null, which has a lot of different issues though.

   jq "." <<< "null"          
   null

   jq "." <<< '{"name": null}'
   {
     "name": null
   }


"\0" is not a valid JSON string escape sequence.

However, "\u0000" is.


not according to jq or firefox

   >> JSON.parse("\u0000")
   Uncaught SyntaxError: JSON.parse: unexpected character at line 1 column 1 of the JSON data
       <anonymous> debugger eval code:1
at that point "null" looks like a better more compatible option.


they're not comparable at all. you can't embed a null into a string

\u0000 works fine with firefox with the proper syntax

  JSON.parse(`"\\u0000"`)
  "\u0000"
and jq supports it too

  printf '{"null":"\u0000"}' | jq
  {
    "null": "\u0000"
  }


> the proper syntax

of course! I forgot to quote those quotes! (facepalm)

that works. and uses a single byte too.

TIL.


Try using UTF-8 encoding for XML, and your problems with zero byte encoding will go away.

Your understanding of "easier" is oversimplified to the point that it's wrong. It's easier to do the wrong thing in JSON, it's harder to do the right thing in JSON (compared to XML).

JSON is a poorly thought-out format. It's problems become progressively more difficult to deal with the more you expect of your program.


JSON and XML both support UTF-8. Neither supports embedding arbitrary binary data directly, especially if that data is not valid in your current character set


You wanted to send zero byte. This is how you send zero byte. You are making unrelated claims now that has nothing to do with your original claim.


You can't safely send zero bytes over XML even with UTF-8 encoding. Not in practice:

  echo '<zero>&#0;</zero>' | xmllint -
  -:1: parser error : xmlParseCharRef: invalid xmlChar value 0
  <zero>&#0;</zero>
            ^

  printf '<?xml version="1.0" encoding="utf-8"?><zero>\0</zero>' | xmllint -
  -:1: parser error : Premature end of data in tag zero line 1
  <?xml version="1.0" encoding="utf-8"?><zero>
                                            ^
And not in theory: https://www.w3.org/TR/2006/REC-xml11-20060816/#sec-well-form...

  Character Range

  [2]    Char    ::=    [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
  [2a]    RestrictedChar    ::=    [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]


I explained how to do it, and yet you didn't follow instructions, have done something irrelevant, and are now complaining that it doesn't work... well sucks being you, I guess.


JSON is far simpler because it has no namespaces and no entities.

But I think complexity is always 90% culture. It's pretty arbitrary what kind of culture grows around a particular technology.


There are many ways in which something can be simple. I believe that the most relevant metric for simplicity of something like JSON isn't the number of language elements it has (this would mean that, eg. Brainfuck is simpler than JavaScript), but the amount of work necessary to produce a correct program. JSON is an endless pit of various degrees of difficulties when it comes to writing real-world programs. It's far from simple in that later case.

I.e. learning about namespaces would take a programmer couple of hours, including a foosball match and a coffee break, but working around JSONs bad decisions when it comes to number serialization or sequence serialization will probably take days in the best case, with a side-effect that this work will most likely have to be done on an existing product after a customer complained about corrupting or losing their data...


>I.e. learning about namespaces would take a programmer couple of hours, including a foosball match and a coffee break

It's not about the time it takes to learn about namespaces. I'm talking about the complexity that namespaces and entities add to the data model and the requirement to actually handle them throughout the entire stack.

You can normalise and compare arbitrary pieces of JSON using only information available locally in that same sequence of UTF-8 bytes. You cannot do that with XML. You have to consider the whole document context and resolve all namespaces and entities before actually comparing anything.

The JSON specification is ~5 pages and most of that is diagrams. The XML specification is ~40 pages long and it imports ~60 pages of URI specification.

I'm not saying that it's impossible to use only the simple parts of XML unless and until you actually need what namespaces have to offer. But that's culture, and you have no control over other people's culture.


> I'm talking about the complexity that namespaces and entities add to the data model

I've worked a lot with XML, and I have no idea what complexity are you talking about. This just wasn't complex / difficult. Once you've learned what this was about, this was your second nature. Eg. I spent a lot of time working with MXML -- that is an XML format for Adobe Flex markup similar to XAML and a bunch of others of the same kind. It used XML namespaces a lot. But that was the least of my problems using it...

Again, I've never had anyone who learned how and why to use XML namespaces complain about it. All complaints about this feature were coming from people discovering it for the first time.

> You can normalise and compare arbitrary pieces of JSON

Dream on. No, you cannot. It depends on parser implementation. For example, you have two 20-digit numbers where 15 most significant digits are the same. Are these numbers the same number or a different number in JSON?

The fact that it's 5 pages means nothing... it's 5 pages that define a bad language that creates a lot of problems when used. So what if it only took 5 pages to write it? You can probably squeeze Brainfuck definition into half a page? -- So what, it's still a lot harder to use than JavaScript.


I worked with XML extensively for many years starting back in the 1990s. When I'm saying that namespaces add complexity to the data model I'm not complaining about them being difficult to use or understand.

>Dream on. No, you cannot. It depends on parser implementation. For example, you have two 20-digit numbers where 15 most significant digits are the same. Are these numbers the same number or a different number in JSON?

That's just a mildly interesting interoperability edge case that can be worked around. I agree that it's not good, but it is a problem on a wholly different level. XML elements not being comparable without non-local information is not an edge case and not an oversight that can be fixed or worked around. It's by design.

I'm not criticising XML for being what it is. XML tries to solve problems that JSON doesn't try to solve. But in order to do that, it had to introduce complexity that many people now reject.

Edit: I think we're talking past each other here. You are rightly criticising the JSON specification for being sloppy and incomplete. I don't dispute that. I'm comparing the models as they are _intended_ to work. And that's where XML is more complex because it tries to do more.


> XML namespaces .. All complaints about this feature were coming from people discovering it for the first time.

"XML Namespaces: Giving developers the vapors since 1999."


I don't understand the problem you're describing. When would using JSON lead to data loss/corruption?


Here's a thing that happened in the wild. Neo4j database encodes ids of stored entities as 128-bit integers and it has a JSON interface. When queried from Python, the Python client interprets digit sequences longer than what could possibly fit into 2^32 as floats (even though the native kind of integer in Python is of arbitrary size).

So, for a while there weren't too many objects, ids appeared to be all different... until they weren't. It's easy to see how this led to data corruption, I suppose?

---

Here's a hypothetical example: few people are aware that JSON allows key duplication in "hash-tables", also, even if they consider such a possibility they might not know that JSON doesn't prescribe which key should win, should there be many of them. They might assume that the definition requires that the first chronologically wins, or last, or... maybe some other rule, but they hope that it's going to be consistent across implementations.

Obviously, to screw with developers, JSON doesn't define this. So, it's possible that two different parsers will parse the same JSON with the same fields differently. Where this could theoretically explode? -- Well, some sort of authentication which sends password with other data that can be added by user, and the user intentionally or accidentally adds a "password" field, which may or may not later be overriden and may or may not later be interpreted on the other end as an actual password.

---

There are many other things, like, for example, JSON has too many of the "false" values. When different languages generate JSON they may interpret things like "missing key" and "key with the value null" as the same thing or as a different thing. Similarly, for some "false" and "null" are the same thing, while for others it's not.


>few people are aware that JSON allows key duplication in "hash-tables"

I would say it's the other way around. Many people seem to think that duplicate keys are allowed in JSON, but the spec says "An object is an unordered set of name/value pairs". Sets, by definition, do not allow duplicates.

https://www.json.org/json-en.html

>There are many other things, like, for example, JSON has too many of the "false" values. When different languages generate JSON they may interpret things like "missing key" and "key with the value null" as the same thing or as a different thing. Similarly, for some "false" and "null" are the same thing, while for others it's not.

I don't see how this is a JSON issue. There's only one false value in JSON. If some application code or mapping library is hellbent on misinterpreting all sorts of things as false then there is no way to stop that on a data format level.

What I do agree with is your critcism of how the interpretation of long numbers is left unspecified in the JSON spec. This is just sloppy and should be fixed.


if you are making JSON data to use in your language and application you will probably not have any problem. But as in any thing there can interoperability issues between implementations and programming languages - especially if your JSON is being generated and consumed by your JavaScript site

some potential issues https://bishopfox.com/blog/json-interoperability-vulnerabili...

on edit: not parent commenter of course, just what I think they might have meant.


That's very interesting! Thanks for the link.


Neither did XML originally. XML schema was sort of bolted on via some conventions of defining a schema in the root element. The XML 1.0 spec doesn't mention those. XML Schema is a separate standard that came later. Likewise namespaces are a separate specification as well and not part of the XML specification.

The XML specification does have Document Type Definitions (DTD), which were sort of inherited from SGML. This is an optional declaration with its own syntax that defines a DTD. I don't think they were that widely used. XMl Schema started out as an attempt to redefine those in XML.

The nice thing with XML Schema was that you could usually ignore them and just use them as documentation of stuff that you might find in a document. Typically, schema urls wouldn't even resolve and throw a 404 instead. More often than not actually. My go-to tool was xpath in those days. Just ignore the schema and cherry pick what comes back using xpath. Usually not that hard.

The culture around Json is that it emerged out of dynamic language communities (Javascript, Ruby, Python, etc.) with a long tradition of not annotating things with types and a natural aversion against using schemas when they are not needed. Also, they had the benefit of hindsight and weren't looking to rebuild the web services specs on top of json but were actively trying to get away from that.


>XML schema was sort of bolted on via some conventions of defining a schema in the root element.

I know, and I'm not talking about XML Schema at all (partly because it hurts my brain to even mention the absolute worst specification ever written).

I mean just the complexity of the XML data model itself, including namespaces, entity references and the ridiculously convoluted URI spec. That's more than enough to make XML far more complex than JSON.

To be fair, XML solves problems that JSON doesn't solve. JSON is not a better XML. JSON's creators simply decided that many of problems that XML solves don't need solving or should not be solved by a data format specification.


I like DTDs. For all their weirdness, they solve a problem rather simply. Reminiscent of BNF. (Altho admittedly they are clueless about namespaces and other fancy bolt-ons.)


Yeah, culture is a big one. See dotnet vs java. The latter picked up many of c# feature over the years but is still much more verbose, e.g. because their developers still abhor var


Hey, I work with Java, at work no one has ever complained about a var. Must be your inner circle only?


Quite possibly. We would need to look through a few OSs projects to determine what's a common style.


It's very easy to understand why people prefer JSON. 95% of developers know exactly what JSON is without ever having read anything technical about it. It's obvious.

XML on the other hand... Who here can say they actually know anything substantial about XML besides the syntax? My guess is <10%.


XML suffers from too many options and useless bells and whistles. E.g. the attribute vs Parameter topic is a source of confusion, without adding much value, especially if the source and target are object oriented and/ or a relational db. What's the point?

Then there are namespaces, sure there are probably lots of places where you need to use them. But I never encountered a place where they are really needed, but because they are the default you need to work with them or your queries do not work. Super confusing for beginners and annoying as heck.


Why is "how hard it is for beginners to understand a concept without reading a reference" a useful metric for measuring anything? So what if it's hard? -- Spend an hour with the reference document, and your problems will go away.

In the days when XML was popular I've been more active in several Web forums that helped novice users with particular technology (and that included XML). Not a single confusion about XML namespaces came from someone who read the reference. Quoting the reference would be also a very efficient way to clear the confusion.

Bottom line: it's not a problem worth mentioning. In the grand scheme of things an hour you'd have to spend reading the specification is a drop in a bucket compared to all the time you'd have to work with XML. It's a fixed-size effort that you have make once. Compare this to having to deal with bad "number" serialization that you have to deal in JSON every time in a new program that deals with JSON.


> Why is "how hard it is for beginners to understand a concept without reading a reference" a useful metric for measuring anything?

Two reasons:

1) Because it's unnecessary complexity. When you add unnecessary complexity into fundamental technology that everything uses, you've now made everything worse. It's like polluting the lake, and then ignoring the fact that beginners need to learn how to boil the water properly drinking it.

2) Because that prevents the technology from being adopted. Whether you think it's justified or not, beginners will choose the tech that's easier to use, and it will succeed.

The market of technology adoption forces us to make things simple for beginners, and in the end, that's good for all of us.


The issue there is the complexity is necessary for some cases which are not met in trivial cases. E.g. When including element names from two sources. That is not a common use in json but with schemas you will come across it


And that could not be handled by making an optional namespace which can be used when it is actually needed?

Even without namespaces, it's trivial to handle { "NamespaceA": {....}, "NamespaceB": {....}, }

There is seldom need to mix it in the same object, and if you need that, you should think long and hard if you are on the right track.


Isn't that what XML did - you only put the namespace in if it was needed.

As for multiple ones in the same object it makes sense if you want to reuse a definition used elsewhere e.g. to add an address using a predefined address type. It is like using structures/records in programming languages but with no pointers for composition.


> Why is "how hard it is for beginners to understand a concept without reading a reference" a useful metric for measuring anything? So what if it's hard? -- Spend an hour with the reference document, and your problems will go away.

Or spend 0 hours reading the JSON reference to reach the same result.


I manage a team of reporting analysts who look at XSLT transforms all day. None of them have programming backgrounds and they have never found XML namespaces to be a problem.


Which is better xml design for a pure data payload (not textual content)?

    <foo>something</foo>
Or

    <foo value="something"/>

When you get back with a coherent universal argument, we'll revisit the json vs xml question.


If these are the sort of questions that are tripping you up, just use <foo>something</foo> for everything.


Isn't this very easy ? Short, Succinct and a Simple No-frills string ? Put it in the attribute. Big, Long and Arbitrary Length Data ? Put it in the content of the element. Now gimme money.


Look, with JSX you can put anything as props.

JSON is just pure data.


A repository of over 700 JSON schemas for various file types. Quite useful.


Oh my good. This Semantic Web stuff is going live !


Took me an embarrassingly long time to figure out you could scroll that list


Your comment made me realize that you can scroll past the list on mobile. I didn’t see Autocompletion section yesterday.

Wonders of modern “clean” design.


Same,I like how clean it looks but it needs some visual indicator that it scrolls.



I see a scroll bar on my browser


..which only appears when you scroll.


Sounds like a problem in your browser. Like rajamaka, I can see scrollbars when they're present.


i think he can see them when they are present, as well


Same... but hey, what do I know, I'm not a "web designer".


JSON Schemas are great.

These are actually what IntelliJ uses to validate all sorts of config files behind the scenes.

For work, we even do code generation off of the Meltano (ETL tool) spec and use it to validate reads and writes to the file (which we edit at application runtime) to catch errors as close to when they actually occur.


Does anyone know of a typescript translation for each of those validation models?

Or maybe even a way to discover related statically typed definitions based on the validation rules?

It would be really nice to not define parts of a data model that provide little to no business value - but where you can easily “stub your toe”.


use quicktype: https://quicktype.io/


Awesome tool, thank you for sharing. I made use of it already!


Most languages have some code generation tool requiring a compile step, but most of the specs in here change infrequently enough you can just do it once and commit to VC. I personally have a use case where I modify the Meltano (ETL tool) spec at runtime and use a generated scheme to validate reads and writes to the file, helping catch bugs early.


You could use this[0] package but you would need to download the schema first into a folder say "schemas" and then add a build step as a script in your package.json '"compile-schemas": "json2ts -i schemas -o types"' to export to a "type" folder

[0] json-schema-to-typescript


If it's a one-off you can just use http://borischerny.com/json-schema-to-typescript-browser/ or https://transform.tools/json-schema-to-typescript (they both use the same library).


I've been asking chatgpt to do it for me.


There are some YAML based schemas there too. How does this work, is there a canonical YAML->JSON transformation, or does JSON schema spec have explicit YAML support?

edit: skipping the theoretical foundations, there seems to be at least this tool that claims to validate yaml against json schema: https://github.com/json-schema-everywhere/pajv


YAML is effectively a superset of JSON although the syntax used in YAML is often different. So you can't translate all YAML to JSON, but all JSON can be represented as YAML.


Does this have any relation to https://jsonapi.org/ ?


No.


Heh I remember we did similar registry for XML things.


but XML allowed an easy way to have distributed schemas not needing a central place. The schema. URI could be made a resource that existed at the URI.

The wording gets complex as the URI does not need to exist on the web and that need for exact wording is I suspect a reason for XML to be perceived as complex


See, Steve Ballmer was right after all.


can JSON achema be used to describe say the schema of a RDBMS table?

is these some standardization here so I might use a JSON schema that already covers a lot of the fields that are needed to describe columns, constraints etc?

can JSON schema capture relations between fields?


I would check out the OpenAPI specification[0]. Specifically, look at the "Components" section. It might help you out.

[0]: https://swagger.io/specification/


is there a good tool or library to create a JSON Schema manually/programmatically ?


Its autocomplete story is a mess, but https://github.com/Kong/insomnia#readme at least allows one to visualize any schema authored in the document (it generates examples as well as a schema browser). It's possible that other OpenAPI tools behave similarly, I just happen to have the most hand's on with Insomnia

for example:

    openapi: 3.0.0
    info:
      title: this is my title
      description: a long description goes here
      version: v1
    servers:
    - url: http://127.0.0.1:9090
      description: the local server
    paths:
      /thingy:
        get:
          responses:
            "200":
              description: ok
              content:
                application/json:
                  schema:
                    $ref: '#/components/schemas/Thingy'
    components:
      schemas:
        Thingy:
          type: object
          properties:
            alpha:
              type: boolean
as for the "create automatically," I'd guess that's a genuinely hard problem although if your example documents are simple/homogeneous enough you may get away with it

ok:

  [{"alpha": true}, {"alpha":false}]
problematic:

  [
   {"alpha":{"beta": ["charlie", 3.1415, null]}},
   {"alpha":[{"beta":null}]},
   {"alpha": null}
  }



You can find a suitable tool on the official website https://json-schema.org/implementations.html#schema-generato...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: