Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I must disagree here. Once you define complex enough text protocol, there are ways to mess it up, or be incompatible with. Say what you want about ASN.1 and similar formats, but if you have a correct parser, you know the values are exactly the ones you expect.

Real-life example I kept running into was SIP implementation. (think someone knew only HTTP and decided to create internet telephony) First, you have line length when parsing - some implementations will limit it, some won't. If it's limited, some implementations will allow you to wrap the lines according to the protocol. But then others will say it's too complicated and unlimited line-length is the way to go. Then you have alternative names for sip uris: you can say "me" <number@ip>. Or just <number@ip>. Or number@ip. Or <number@ip>;some_parameter. Or someone else decides to go with <number@ip;some_parameter>. Some parameters have associated values, some don't - guess how many implementations don't support both ways...

Before you know it, there's 1000 conventions and everyone supports some minimal common core, but fails for at least one other specific implementation.

So you say - there's always structured text - json/xml/... But look at the example in the blog post:

    <dict>
      <key>Bounds</key>
      <string>{{25.4278, 76.3008}, {104.75, 91.8751}}</string>
So how long until someone comes up with implementation that doesn't use spaces between the numbers? What's the supported precision? What happens when one tuple is skipped? Do you have to parse exponential numbers correctly? How do you deal with duplicate keys in the dict?

I see the appeal in text formats and then remember... no - it's not the right way. JSON is a bit less-wrong than XML here, because you know what's a number, what's a string, what's a list. You'd probably do {'Bounds': [[25.4278, 76.3008], [104.75, 91.8751]]}.

Until someone comes out with a popular implementation that does case-insensitive matching for keys... and switches from 'Bounds' to 'bounds' in some version.



This is really more of a problem with the horrible, horrible SIP specification. In fact, look at RFC4475[1], which seems to delight in all the strange ways a message can still be valid.

It even uses the word "infer" several times. "An element may attempt to be liberal in what it receives and infer the missing quotes."

SIP was based on HTTP because at one point they thought it would interop with HTTP. Of course, that didn't happen, and they still kept HTTP's crazy parsing rules, while adding all sorts of their own things.

The date format, for instance, really annoys me. Who would ever look at "Sat, 13 Nov 2010 23:29:00 GMT" and say "yea, that makes sense for a protocol"? Why is it done that way? Oh, that's defined in another RFC. If you trace it back, you end up decades ago in RFC500-something, and the justification is basically "well there's some mail programs out there, and people seem to be writing their headers like this, so here's sort of a standardised version of what they're doing". And since then, they just carry it forward, instead of thinking "wait, does this make sense"?

So, that legacy thinking, combined with a stubborn insistence that Postel's Law makes sense (instead of just making things much worse), and denial of reality (such as the hilarious attempts at dealing with NAT), yea, that's why SIP is horrible. If they had their own binary format, well it would have been much worse. (And SIP's problems are still deeper, even after you finish writing the horribly convoluted code require to parse such a "simple" format -- so ASN.1 would have just helped a bit; you can bet there'd also be embedded strings, with precisely the same problems.)

1: http://tools.ietf.org/rfc/rfc4475.txt


> combined with a stubborn insistence that Postel's Law makes sense (instead of just making things much worse)

has anyone written a definitive "Postel's Law considered harmful" yet? we're dearly in need of one....


Your examples demonstrate underspecified text protocols. Surely it leads to problems, but if they were underspecified binary protocols then debugging/reverse-engineering would be much harder.

So your comment proves the opposite of what you've intended.


This is an opinion of course. I always found binary protocols to be harder to tweak in a way that a shortcut in coding provides valid-ish result (working for your specific implementation). It was always much easier, even without intending to, remove some whitespace in text protocol only to discover later someone somewhere relied on it being there.

YMMV... maybe you had different experience in the past.


The example you XML you included is from a property list [0]. While there is more than one implementation, there is essentially a defacto implementation. XML is only one format, and data is relatively strongly typed. All of the types provide documentation regarding exactly how they are serialized, deserialized.

In this case, the string value in the dictionary is specific to OmniGraffle. It's format is proprietary, but the only producer is also the consumer. Omni may or may not have a formal spec indicating the format of this value. You'd have to work there to find out.

[0] http://en.wikipedia.org/wiki/Property_list


The value appears to be the output of the NSStringFromRect() Foundation function, which has the corresponding NSRectFromString() to go the other way.


> JSON is a bit less-wrong than XML here, because you know what's a number, what's a string, what's a list

http://www.w3.org/TR/xmlschema-2/#typesystem


True. But since it's optional, what is the number of people who actually use it, or parsers that care? It's a bit like with xml namespaces.


XML Schema is so incredibly complex than even XML advocates avoid it.


> So how long until someone comes up with implementation that doesn't use spaces between the numbers? What's the supported precision? What happens when one tuple is skipped? Do you have to parse exponential numbers correctly?

why doesn't omnigraffle store "Bounds" as

    <dict>
        <key>Bounds</key>
        <array>
            <array>
                <float>25.4278</float>
                <float>76.3008</float>
            </array>
            <array>
                <float>104.75</float>
                <float>91.8751</float>
            </array>
        </array>
    </dict>

?


I can't verify right now, but that might be the serialized representation of an NSRect. It might not be a format Omni created.


As a nitpick, your JSON example uses single quoted keys, and it is mandatory for keys (and all strings) in JSON to use double quotes. Some parsers don't care, for instance if they are part of a full JS engine that can interpret beyond JSON's subset of JS object literal notation, but others will choke.

I guess this actually helps prove your point since text formats do tend to invoke intuitions that may cause one to stray outside of the strict specification.


So, according to you Web pages should have been some binary blobs instead of the beautiful soup we all love to hate? Then we would not have seen this eruption of amateur pages with animated gifs? And all the Web would be clean and straight as early Frontpage HTML? Can't agree less.


No, maybe I should've been more specific - I mean only the formats which are well-defined and which have their own structure. Especially things like image file formats, communication protocols, etc.

For example, I wouldn't mind it if HTTP was a binary format. I don't think HTML has to be... it's a markup language not a strict protocol. It was a bit arbitrary by design. Same thing with markdown. They were supposed to work with existing text as much as possible. HTTP doesn't need that - noone's writing HTTP by hand and even if they kind-of do that, they use things like Curl which provide an abstraction anyway.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: