I must disagree here. Once you define complex enough text protocol, there are wa...

MichaelGG · on April 3, 2012

This is really more of a problem with the horrible, horrible SIP specification. In fact, look at RFC4475[1], which seems to delight in all the strange ways a message can still be valid.

It even uses the word "infer" several times. "An element may attempt to be liberal in what it receives and infer the missing quotes."

SIP was based on HTTP because at one point they thought it would interop with HTTP. Of course, that didn't happen, and they still kept HTTP's crazy parsing rules, while adding all sorts of their own things.

The date format, for instance, really annoys me. Who would ever look at "Sat, 13 Nov 2010 23:29:00 GMT" and say "yea, that makes sense for a protocol"? Why is it done that way? Oh, that's defined in another RFC. If you trace it back, you end up decades ago in RFC500-something, and the justification is basically "well there's some mail programs out there, and people seem to be writing their headers like this, so here's sort of a standardised version of what they're doing". And since then, they just carry it forward, instead of thinking "wait, does this make sense"?

So, that legacy thinking, combined with a stubborn insistence that Postel's Law makes sense (instead of just making things much worse), and denial of reality (such as the hilarious attempts at dealing with NAT), yea, that's why SIP is horrible. If they had their own binary format, well it would have been much worse. (And SIP's problems are still deeper, even after you finish writing the horribly convoluted code require to parse such a "simple" format -- so ASN.1 would have just helped a bit; you can bet there'd also be embedded strings, with precisely the same problems.)

1: http://tools.ietf.org/rfc/rfc4475.txt

adavies42 · on April 3, 2012

> combined with a stubborn insistence that Postel's Law makes sense (instead of just making things much worse)

has anyone written a definitive "Postel's Law considered harmful" yet? we're dearly in need of one....

d0mine · on April 3, 2012

Your examples demonstrate underspecified text protocols. Surely it leads to problems, but if they were underspecified binary protocols then debugging/reverse-engineering would be much harder.

So your comment proves the opposite of what you've intended.

viraptor · on April 3, 2012

This is an opinion of course. I always found binary protocols to be harder to tweak in a way that a shortcut in coding provides valid-ish result (working for your specific implementation). It was always much easier, even without intending to, remove some whitespace in text protocol only to discover later someone somewhere relied on it being there.

YMMV... maybe you had different experience in the past.

jonhohle · on April 3, 2012

The example you XML you included is from a property list [0]. While there is more than one implementation, there is essentially a defacto implementation. XML is only one format, and data is relatively strongly typed. All of the types provide documentation regarding exactly how they are serialized, deserialized.

In this case, the string value in the dictionary is specific to OmniGraffle. It's format is proprietary, but the only producer is also the consumer. Omni may or may not have a formal spec indicating the format of this value. You'd have to work there to find out.

[0] http://en.wikipedia.org/wiki/Property_list

lilyball · on April 3, 2012

The value appears to be the output of the NSStringFromRect() Foundation function, which has the corresponding NSRectFromString() to go the other way.

zwp · on April 3, 2012

> JSON is a bit less-wrong than XML here, because you know what's a number, what's a string, what's a list

http://www.w3.org/TR/xmlschema-2/#typesystem

viraptor · on April 3, 2012

True. But since it's optional, what is the number of people who actually use it, or parsers that care? It's a bit like with xml namespaces.

luriel · on April 3, 2012

XML Schema is so incredibly complex than even XML advocates avoid it.

adavies42 · on April 3, 2012

> So how long until someone comes up with implementation that doesn't use spaces between the numbers? What's the supported precision? What happens when one tuple is skipped? Do you have to parse exponential numbers correctly?

why doesn't omnigraffle store "Bounds" as

    <dict>
        <key>Bounds</key>
        <array>
            <array>
                <float>25.4278</float>
                <float>76.3008</float>
            </array>
            <array>
                <float>104.75</float>
                <float>91.8751</float>
            </array>
        </array>
    </dict>

?

jonhohle · on April 3, 2012

I can't verify right now, but that might be the serialized representation of an NSRect. It might not be a format Omni created.

pak · on April 3, 2012

As a nitpick, your JSON example uses single quoted keys, and it is mandatory for keys (and all strings) in JSON to use double quotes. Some parsers don't care, for instance if they are part of a full JS engine that can interpret beyond JSON's subset of JS object literal notation, but others will choke.

I guess this actually helps prove your point since text formats do tend to invoke intuitions that may cause one to stray outside of the strict specification.

gbog · on April 3, 2012

So, according to you Web pages should have been some binary blobs instead of the beautiful soup we all love to hate? Then we would not have seen this eruption of amateur pages with animated gifs? And all the Web would be clean and straight as early Frontpage HTML? Can't agree less.

viraptor · on April 3, 2012

No, maybe I should've been more specific - I mean only the formats which are well-defined and which have their own structure. Especially things like image file formats, communication protocols, etc.

For example, I wouldn't mind it if HTTP was a binary format. I don't think HTML has to be... it's a markup language not a strict protocol. It was a bit arbitrary by design. Same thing with markdown. They were supposed to work with existing text as much as possible. HTTP doesn't need that - noone's writing HTTP by hand and even if they kind-of do that, they use things like Curl which provide an abstraction anyway.