Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
In praise of... text files and protocols (jgc.org)
176 points by jgrahamc on April 3, 2012 | hide | past | favorite | 88 comments


So... for the one occasion out of a million where somebody needs to "debug" a piece of data, it's necessary to suffer the bloat of a text format for every other piece of data we transmit?

How about we just standardize on a binary data representation (eg. MessagePack), and use common tools to export/import to/from a human-readable format? Best of both worlds.

And, as an aside - why are we using XML? It's ok as a markup language, I guess, but as a container for data? We could hardly have picked a worse format:

It's crazy verbose - even when compared to other text formats (eg. JSON). Compare:

  values: [1,2,3]
with:

  <values>
    <value>1</value>
    <value>2</value>
    <value>3</value>
  </values>
Its verbosity makes it hard to read, and hard to edit.

It has a poor mapping to the structures we actually use while programming - it has no built-in notion of arrays. It has superfluous node "attributes" that don't map well to common run-time constructs.


plaintext + gzip is preferable to binary.

The early adopters of XML saw their choices as semistructured text vs nice parseable XML. The third choice, creating grammars, was ignored.

Grammars for most configuration or data transfer or protocols are trivial. Certainly since ANTLR 2.x. Much more trivial than any equivalent XML-based parse and validation tool stack.

FWIW, ASN.1 is worse that XML. Not a defense of XML; all uses of XML are incorrect.

For my own work, I use a format descended from VRML that I call ARON (A Righteous Object Notation).

It concisely describes groves (trees annotated with key/values) and supports most commonly used datatypes. So it's a bit more concise that JSON or YAML and a lot more strongly typed.

Here's an example test file:

http://code.google.com/p/aron/source/browse/trunk/test/cronk...

I use ARON for all my own projects, as you can see, it's not really polished enough for others (yet). As this example shows, I mostly use it to loft Java object graphs. I haven't reimplemented VRML's prototyping (DEF / USE) functionality in this branch (yet).


XML has XPath. JSON and YAML do not have xpath. They have those lame language-specific constructs that aren't recursive, aren't traversable and throw null pointer errors when they can't match.

You can format XML with xmllint and query it with xmlstarlet. You can't do that for json - good luck if you've got unformatted json.

Tooling is everything.


I added lightweight path expressions to LOX (Lightweight Objects for XML). It's a modern XML object model, that's easy to inspect, with built in support for xpath-like expressions (globbing, really), without all the suckage.

http://code.google.com/p/lox/

Here's some example of path expressions.

http://code.google.com/p/lox/source/browse/trunk/test/lox/te...

Works fantastically. I use LOX for all my own XML work.

My ARON project doesn't support path expressions in the same way. ARON's grammar only supports drill down dot notation, like "parent.child.grandchild". And, thus far, I've only used ARON to loft Java object graphs, so I haven't needed an object model with path expressions.

I'd love to see a high level, statically typed language with built-in path expression. Groovy's GPath is closest to the mark that I've seen.

(I really need to polish these open source side projects. And publicize them.)


Making your own is cool, but where's the value added compared to XPath?


Thanks! It was a lot of fun and proven very useful.

Biggest reason is conciseness (less code), followed by ease of debugging.

When using JXPath or Jaxen, you have to use contexts. Pseudo code (from memory):

Node n = new Node( "ugh" ); // add some children here JXPathContext c = new JXPathContext( n ); NodeList list = c.eval( "child" ); for( int i = 0; i < list.length(); i++ ) { Node m = (Node) list.get( i ); ... }

Whereas LOX does it like this:

Element e = new Element( "ugh" ); // Add some children here for( Element e : e.find( "child" )) { ... }

I do a lot of ETL work (XML -> SQL). The ability to debug (interactive inspection) speeds development. I'm sure you've tried to debug XPath expressions. Not easy.

Happily, all of the LOX's object implement toString() method to render their XML content. And the evaluation of path expressions, while not easy, is feasible. Whereas with JXPath or Jaxen, it's damned near impossible. (I've written a few object adapters for JXPath; getting them right is black magic.)


No you don't. Please look up Dom4j which simply lets you do List<Element> nodes = element.selectNodes(XPATH) Element singleNode = element.selectSingleNode(XPATH) it also has .valueOf(XPATH) and even .numberValueOf(XPATH)


But then I'd have to use DOM4j. More seriously, I didn't even think to look at DOM4j again, so didn't know it wrapped JXPath.

I had been using XOM for a while, a HUGE improvement over JDOM and DOM4j. Elliot Harold's FAQ and slides explain his improvements.

LOX has an even more simple object model, uses JDK 1.5 idioms, and is MUCH easier to interactively debug. (LOX also doesn't recognize namespaces, which keeps it simple.)

When it came time to do path expressions, I initially used JXPath and then Jaxen. Both are very hard to debug.

After a hard rethink, wanting an API that supports how I like to work, I dropped XPath and came up something a lot more simple, a lot smaller, and closer to how I work.

LOX's path expressions have only 1-axis (to use a XPath-ism), parent to child. I've never used any of the other axis, so no big loss.

LOX also supports globbing (wildcarding) syntax, which is a huge win over XPath.


>>> plaintext + gzip is preferable to binary.

I'm not sure I agree with you on that. You're imposing a load of extra CPU (and, I suspect some bandwidth) overhead where it's not necessary for all except the most infrequent cases, and you're inheriting all the weaknesses of text as a data format.

Plus, since your data is now gzipped, it's no longer human-readable on the wire. In order to read it, you need to pipe it through a decoder (gunzip) - why not use a sensible binary protocol, and pipe it through the decoder for that?


It is human readable if you use firebug or vim. And that's what you use.


> In order to read it, you need to pipe it through a decoder (gunzip) - why not use a sensible binary protocol, and pipe it through the decoder for that?

I don't have your decoder. I have gunzip.

gunzip is not threatened by a patent. gunzip doesn't cause a Drama Meltdown. gunzip won't be a proven attack vector for remote execution exploits. gunzip does not require a contract in my hand or money in my bank. While your decoder is being debugged, gunzip will be live.

(To the tune of "The Revolution Will Not Be Televised")


Oh, I'm just advocating using a standard binary serialization format like MessagePack that is far more efficient, can easily represent binary data, and has a far more obvious mapping to runtime data structures.


In most parsing comparisons I have seen, tnetstrings[1] is probably one of the faster textual formats to parse (and rather "safe" too). I have started using tnetstring[2] (python module) for some backend message formats where I commonly use json, and it has been pretty great so far. (esp with gzip, snappy, bzip, lzma, or whatever compression fits your needs/tradeoffs)

[1]: http://tnetstrings.org/ [2]: http://pypi.python.org/pypi/tnetstring [3]: https://github.com/j2labs/cerealization


In most parsing comparisons I have seen, tnetstrings[1] is probably one of the faster textual formats to parse (and rather "safe" too). I have started using tnetstring (python module) for some backend message formats where I commonly use json, and it has been pretty great so far. (esp with gzip, snappy, bzip, lzma, or whatever compression fits your needs/tradeoffs)

[1]: https://github.com/j2labs/cerealization


>"standard binary serialization format"

Mmmm, yes, good luck with that :)


A man can dream.


I think there are two issues with XML/JSON/etc vs. roll-your-own format/grammar. The first issue is the difficulty of creating it, which you describe as "trivial". I think it's an "easy once you know how", but not easy beforehand, not easy for everyone, and not every aspect of it is easy even then. Another way of looking at this is that designing and writing and testing and debugging and rewriting and documenting a grammar is always going to be more work than using pre-existing code. For example, although you've put in enough work to be able to use ARON in several projects, it's not ready for others. IMHO it's a non-trivial undertaking. (BTW: does ANTLR 2.x automatically take care of left-recursive grammars?) My experience with parsing is that there be gotchas; this is one reason I'm very impressed with the principles behind XML Schema. (It's a shame it's so soul-destroyingly horrific to actually use. Also XML namespaces: righteous concept, diabolical realization.)

The second issue is familiarity/standardization vs. specific-use formats (a kind of DSL), purely in terms of ease-of-use and adoption. I think familiar formats really are easier to read - your eye and brain have internalized short-cuts for interpreting them, so you can quickly work on them. Similar to your finger-memory for your editor. An alternative must be significantly better to overcome this unfair advantage of the incumbent.

However, counter-example: for both these reasons (the first is a kind of mechanized version of the second), I thought it would be impossible to displace XML, much as ASCII seemed impossible to displace (there used to be competing character encodings; the present standard, unicode, can be seen as a superset of ASCII). Although JSON seemed wonderfully better (I first saw it in the ActionScript version of ECMAscript), I thought it couldn't overcome XML's incumbency. Yet, JSON is making inroads, and not just in web-clients, but in APIs. It's interesting that JSON lacks analogs to XML Schema, XSLT, XPath; although proposals have been made, they aren't adopted. I wonder if this is because people are now figuring out how to do without these extras? Or if, when they do need them, they just use the XML stack...

I think new formats/grammars have the best chance of adoption in niches, when they are so specialized to the task that their benefits far outweigh the advantages of the standardized incumbent - in that particular use-case. For example, most programming languages define their own grammars e.g. python, ruby, java, etc.

  BTW: in your other comments with code, just indent each line
  by two spaces for nicer formating. :-)


(Love your metaphors/adjectives.)

Grammars aren't easy, but not currently no harder than XSD. Further, DSLs are all the rage. Personally, I prefer to use a pro tool like ANTLR to some in language voodoo with Scale (or some such).

LALR (lex/yacc) kicked my ass. Just could not understand it.

LL(k) (ANTLR 2.x, JavaCC) was a struggle. But I slogged thru it.

LL(*) (ANTLR 3.x) is a pleasure to work with. I really enjoy it. For example, the delta between SQL's (idealized) BNF and my SQL ANTLR grammar is very small.

ANTLR 4.x, currently in progress, will do left recursion, which I haven't played with yet.

I'm not too worried about adoption. My tools make me more productive. If my competitors choose to use rocks for hammering, I'm okay with it.

Lastly, I indent with tabs. Old habits die hard. :)


I think there is a key distinction here between messages we pass between applications and data we store on disk. These problems have different tradeoffs, and a conflict probably occurs when the same bits are used for both purposes.

I think the fundamental difference is that messages seem to have these sorts of properties:

- temporary and transient; exist primarily to move information from one program to another

- may be merely copies/representations/serializations of information that already exists in a running program

- may specify implementation ("the following is a hashmap of strings to strings consisting of k1,v1;k2,v2;,...")

- may use a common format/language (e.g. XML) to encode arbitrary data structures

- therefore, can use commonly available parsers

- not usually read/edited by humans

It seems like binary data is very often a good choice here, especially since performance seems to matter.

But I think the author's use case is very different. S/he's dealing with code. (Data, code, tomato, tomato.) It has different properties:

- permanent: stored to disk, possibly copied to other disks, but not "created and destroyed"

- is the real true representation of the actual object in question; as the object is changed, the data is edited/overwritten on disk

- usually descriptive of problem rather than giving implementation details ("line (7,2) -> (8,4)" rather than "<class>[line] <member>[start] <value>[<pair>[7,2]],...")

- is usually domain-specific rather than an arbitrary common language like XML

- thus, tends to require a custom parser to read

- primitive values transparently map to changes in the object

In this latter case (think HTML, LaTeX, and all programming language files), we get tons of good benefits from plain text. It can be read in any editor, it's universally and trivially portable, it can be manipulated with tools like grep and find/replace, it can be generated or altered with simple scripts and programs, etc. And finally, it's primary purpose is to be compiled into a representation which is presented to the user (such as a webpage or pdf document or so on).

Those are the two extremes, I think. Data vs messages. So we can argue about which option is better for cases that seem somewhere in between, but this is the landscape as I understand it.


Your comparison of data vs. messages is interesting, although I think what you really should contrast is messages/data vs. markup.

HTML and LaTeX are clearly markup languages (or "code"). It's entirely reasonable to expect somebody to edit HTML and LaTeX in a plain old text editor. Totally agree with you on this.

Where I disagree, however: The OmniGraffle file in the article, isn't code - it's a serialization of an internal data structure. The primary method for editing this data is not a text editor, nor should it be. While it's cool (I guess) that the author was able to hack around inside it, I don't think XML is a good choice here, for the reasons I described earlier. Nor do I think it's a good choice for most object serializations (including both transient messages, and persistent data).

Now, the OmniGraffle file in question is pretty simple, so you could argue that maybe XML isn't so terrible. But, consider cases like http://en.wikipedia.org/wiki/COLLADA . Storing 3d object vertex data in XML is, if you ask me, insane. If you have an object with tens of thousands of vertices, you will never edit this file by hand. What is XML gaining anybody here? Yay - you can use an off-the-shelf XML parser! But you then have to copy XML's graph into your own vertex structures in order to do anything useful! So, you really haven't gained anything except vastly increased memory and CPU usage.

See http://collada.org/public_forum/viewtopic.php?f=12&t=25&... for some discussion on this.


Cool info, and thanks for the links!

I agree about XML. I think one reason XML is annoying is that it lives in both worlds -- human-readable, but able to encode arbitrary machine structure. But with OmniGraffle, for example, it seems like XML is a suboptimal choice because you shouldn't need all that XML structure getting in the way -- if you know what data to put where in the document, why throw all these tags around it?

So like you say, XML's main advantage seems to be that you can use an off-the-shelf parser; but it doesn't really seem worth it. However, I still think plain text can be a good choice, I just would be less afraid to define my own specification. But I feel like I would rather use an extreme -- binary data (which would require custom serialization) or plain text (which would require custom parsing) -- over something general-purpose and verbose.

Disclaimer: I am not a professional software engineer and do not have experience with large systems. My opinion might change after I got my hands dirty.


There's no excuse for XML - sexps and json are just as general as XML, and much less verbose.


The biggest problem with XML is that people don't understand how to implement it properly and end up with stuff like you talk about above. Namespaces in particular can help with the problem, but I see tons of APIs failing to implement them in ways that actually alleviate the bloat, and they instead opt for insanely verbose Java-style naming conventions that do implement namespaces and attributes that ultimately just make concise XPath a nightmare.


We use XML as a container because it supports namespaces and XPath / XQuery. There are no comparable JSON standards which are widely interoperable across different languages and systems.

Advanced database engines like IBM DB2 can store XML documents internally in a compressed binary format suitable for fast searches. So there's no extra overhead for tags.


This reminds me of the angst that came when Windows shifted from ".ini" files with clear name=value lines to the "registry", which paid many the bills of consultants, utility programmers, and "fixit-guys" via the fun that is Regedit.

+1 for more simple text files... and hey, devs: if you don't need a nested object format, perhaps even leave out the JSON or XML and just make a simple file...


JSON looks pretty good even if it's a flat list, so I don't see a problem there.


One annoyance with JSON as a config-file format is that you apparently can't put comments in it. At least not out-of-band comments:

http://stackoverflow.com/questions/244777/can-i-comment-a-js...

You can put in in-band comments, by defining a JSON format that has a bunch of '_comment' keys sprinkled through it, but that's annoying because you've got to design that in, and decide which elements deserve comments and which don't, and that's just a big hairy yak which sits there daring you to shave it, practically begging you to do a bunch of YAGNI up-front design of the comment protocol, and then even when you're done you still have to fret about how those keys might get misinterpreted by clients from now until the end of time. Even human readers will have to intuit the the semantics of your '_comment' key ("oh, the computer never reads this, this is just for me!"), whereas humans generally recognize the significance of real, standard comment syntaxes like # or ;; or // or what have you.

Or you could use a JSON preprocessor that strips the comments, but that's a little trap that you build for yourself, because now your JSON is no longer universally-parseable by any language's JSON library. You could strip the comments at build time, but now your production JSON files on the server don't have the comments, and this is likely to be just where you need them - you put them there for the troubleshooter, who will find them at 3am when the server just crashed.

But I've sometimes used JSON for these things anyway. Life is too short for perfectionism, and it is a nice clean format. Almost too clean.


You could try YAML as a superset of JSON with comments.


Indeed. This is pretty much my plan for the future.


> can't put comments in it.

Exactly! With 25 years of sysadmin / devops / whatever kids call it these days, I can attest to the fact that config files are for comments/documentation. That they happen to also configure some software is a minor side-effect.


Since JSON is (a subset of) YAML, and YAML was actually designed to be human-readable, I'd say it's the better choice.

http://yaml.org/


YAML's significantly more complex (and probably harder and more expensive to parse) tough.


I agree, and I wouldn't suggest it as your general-purpose data serialization language. However, for configuration files, which are loaded either once or at most infrequently, the cost is negligible (IMO) compared to the benefit to human readability (and comments).


Agreed! I write a lot of fuzzers and often need to represent a large API with lots of interconnections. YAML makes this trivial.


One drawback to JSON files, and XML files and a number of other text file formats as well, is that they allow variable formatting. That makes them difficult to work with using tools like diff, which takes away a lot of the power of version control.

For example, you can take a JSON file and format it two ways: all on one line with little or no whitespace, and pretty-printed with multiple lines and indenting. Semantically, the two files are identical because they'll be parsed into the same data structure when you read it in. Textually they're completely different, which becomes a problem when two people with two different editors that use the two different formats when writing the file try to work on it. If person A commits the file pretty-printed, and person B makes a small change then commits it minimized, how does person A determine what's changed? It's a hassle.

XML suffers from this as well, and can be even worse. In some XML DTDs/schemas elements can be reordered without changing the semantic meaning of the document.

I've also had this problem with CSS files, which are partially order-dependent. If you have a big CSS file with lots of rules that cover different parts of your site, there are clumps of rules that are related and must be in the right order to cascase properly, but the clumps are mostly independent and their order doesn't matter.

In the past I've written tools that canonicallize formats like these just for doing diffs. It's a hassle, but it was necessary to allow my team to manage changes in shared resource files, and it worked well. Today, I'd use a tool like that as a git commit hook so that the repository always contained the format I wanted (assuming it's still usable in that format, which wasn't always the case) but back then I didn't have that option.


Textually they're completely different, which becomes a problem when two people with two different editors that use the two different formats when writing the file try to work on it. If person A commits the file pretty-printed, and person B makes a small change then commits it minimized, how does person A determine what's changed? It's a hassle.

Granted, but the exact same problem exists for most programming languages. I don't bring this up to dismiss your complaints, but to admit I'm surprised you found it to be a problem. While you mention a technical solution at the end of your post, I figured most people handled this with a non-technical solution: agreed upon formatting guidelines.


My source code editors rarely reformat my code to suit their tastes, unless I explicitly ask for them to do that.

Configuration file editors will often reformat the output as they like, because "you're not editing the config file directly anyway". A real-world example I've been dealing with lately are the sln, csproj, and xml files used by Visual Studio. VS is constantly rearranging entries in these files when I modify my projects. Add a dependency to one project, and VS will resort all of the dependency guids for all of the other projects in the solution. When I check my change in, the sln file looks like I've made hundreds of changes (I've got a lot of projects in my solutions) when I've only made one change.

Of course, VS shouldn't be doing that. But it's probably not even intentional; I'll bet the VS developers thought "Hey, we should use XML for our sln files because it's textual and readable", and they chose an XML serialization library that was already part of the platform. Then the completely separate group responsible for the XML serialization library probably thought "Hey, when we're serializing this particular type of data structure, let's sort the list elements by something arbitrary, like their memory address", and now your sln file gets re-arranged every time you modify it. This can happen to any xml-based configuration file that doesn't have strict ordering as part of the schema.


Ah, that's the answer: automated tools are making the changes, not human developers.


I don't really think it's the format of the registry that causes issues. It's still a pretty simple key value at each node, and you have multiple nodes with access control much like having multiple .ini files in different places in the file system.


There are some benefits to the registry including fine grained control of permissions and a common model for all applications.


I've seen the value of text formats personally.

I work mainly on an iPhone project. In order to handle customers requests for this or that custom UI feature, I invented a tool that generates something like a simplified XIB file from an image and a chunk of CSV. I made the end result of the tool a text file that the iPhone code parses.

Working in text saved a lot of time. Since I invented my own tool, naturally there were things to debug and tweak in the results, but I could do that with a text editor and commit the changes as easily diff-able deltas in git. Although I work on iPhone code, everything also has to be implemented in Android - but it's no problem for the Android developer to use the text file since it's just text.

I've also written tools for migrating chunks of customer data from an old back-end system to a new one using the new server's customer-facing API. This was partly for dogfooding purposes, since the API was new and had very suspect stability.

I wrote the tools to generate text files where each line is a JSON payload that would be sent to the server. It made everything easier to debug - examining the payloads lets you distinguish between errors in the export tool and errors in the API. The text files themselves could then become test cases if there was a bug in the API, or be quickly hacked to contain correct payloads via find-and-replace if the export tool was wrong but we still needed the migration to finish right away.


I'm building a tool very similar to what you describe for Android right now. It involves a simplified XML description of the GUI which is sent over a network to an Android device, then "inflated" to actual GUI code.

Could you elaborate about your system a bit?


Also important - you can version control text files easily.

One of the more brain damaging einvironments I've worked in (oracle forms) uses binary format for source code. There is some utility app to change binary files into source code, but the primary files deveopment is using are binary.

It means there's no way to merge changes, no easy way to see what changed in which commit, simple text search in the whole project is hard - you have to open all files at once in IDE and search from there, and it's slow and awkward.

If text files are mediocrity - let's us wait before something better comes, because binary formats are not better.


What can be even worse that using binary files for development artefacts is storing code in the underlying database with no straightforward way to map to/from files.


One of the main goals of the Redis protocol was to try to find a point in the middle between text and binary, so the protocol is completely text based, but designed to also handle binary payloads without problems. So far we never saw the protocol as a bottleneck in Redis performances.


I must disagree here. Once you define complex enough text protocol, there are ways to mess it up, or be incompatible with. Say what you want about ASN.1 and similar formats, but if you have a correct parser, you know the values are exactly the ones you expect.

Real-life example I kept running into was SIP implementation. (think someone knew only HTTP and decided to create internet telephony) First, you have line length when parsing - some implementations will limit it, some won't. If it's limited, some implementations will allow you to wrap the lines according to the protocol. But then others will say it's too complicated and unlimited line-length is the way to go. Then you have alternative names for sip uris: you can say "me" <number@ip>. Or just <number@ip>. Or number@ip. Or <number@ip>;some_parameter. Or someone else decides to go with <number@ip;some_parameter>. Some parameters have associated values, some don't - guess how many implementations don't support both ways...

Before you know it, there's 1000 conventions and everyone supports some minimal common core, but fails for at least one other specific implementation.

So you say - there's always structured text - json/xml/... But look at the example in the blog post:

    <dict>
      <key>Bounds</key>
      <string>{{25.4278, 76.3008}, {104.75, 91.8751}}</string>
So how long until someone comes up with implementation that doesn't use spaces between the numbers? What's the supported precision? What happens when one tuple is skipped? Do you have to parse exponential numbers correctly? How do you deal with duplicate keys in the dict?

I see the appeal in text formats and then remember... no - it's not the right way. JSON is a bit less-wrong than XML here, because you know what's a number, what's a string, what's a list. You'd probably do {'Bounds': [[25.4278, 76.3008], [104.75, 91.8751]]}.

Until someone comes out with a popular implementation that does case-insensitive matching for keys... and switches from 'Bounds' to 'bounds' in some version.


This is really more of a problem with the horrible, horrible SIP specification. In fact, look at RFC4475[1], which seems to delight in all the strange ways a message can still be valid.

It even uses the word "infer" several times. "An element may attempt to be liberal in what it receives and infer the missing quotes."

SIP was based on HTTP because at one point they thought it would interop with HTTP. Of course, that didn't happen, and they still kept HTTP's crazy parsing rules, while adding all sorts of their own things.

The date format, for instance, really annoys me. Who would ever look at "Sat, 13 Nov 2010 23:29:00 GMT" and say "yea, that makes sense for a protocol"? Why is it done that way? Oh, that's defined in another RFC. If you trace it back, you end up decades ago in RFC500-something, and the justification is basically "well there's some mail programs out there, and people seem to be writing their headers like this, so here's sort of a standardised version of what they're doing". And since then, they just carry it forward, instead of thinking "wait, does this make sense"?

So, that legacy thinking, combined with a stubborn insistence that Postel's Law makes sense (instead of just making things much worse), and denial of reality (such as the hilarious attempts at dealing with NAT), yea, that's why SIP is horrible. If they had their own binary format, well it would have been much worse. (And SIP's problems are still deeper, even after you finish writing the horribly convoluted code require to parse such a "simple" format -- so ASN.1 would have just helped a bit; you can bet there'd also be embedded strings, with precisely the same problems.)

1: http://tools.ietf.org/rfc/rfc4475.txt


> combined with a stubborn insistence that Postel's Law makes sense (instead of just making things much worse)

has anyone written a definitive "Postel's Law considered harmful" yet? we're dearly in need of one....


Your examples demonstrate underspecified text protocols. Surely it leads to problems, but if they were underspecified binary protocols then debugging/reverse-engineering would be much harder.

So your comment proves the opposite of what you've intended.


This is an opinion of course. I always found binary protocols to be harder to tweak in a way that a shortcut in coding provides valid-ish result (working for your specific implementation). It was always much easier, even without intending to, remove some whitespace in text protocol only to discover later someone somewhere relied on it being there.

YMMV... maybe you had different experience in the past.


The example you XML you included is from a property list [0]. While there is more than one implementation, there is essentially a defacto implementation. XML is only one format, and data is relatively strongly typed. All of the types provide documentation regarding exactly how they are serialized, deserialized.

In this case, the string value in the dictionary is specific to OmniGraffle. It's format is proprietary, but the only producer is also the consumer. Omni may or may not have a formal spec indicating the format of this value. You'd have to work there to find out.

[0] http://en.wikipedia.org/wiki/Property_list


The value appears to be the output of the NSStringFromRect() Foundation function, which has the corresponding NSRectFromString() to go the other way.


> JSON is a bit less-wrong than XML here, because you know what's a number, what's a string, what's a list

http://www.w3.org/TR/xmlschema-2/#typesystem


True. But since it's optional, what is the number of people who actually use it, or parsers that care? It's a bit like with xml namespaces.


XML Schema is so incredibly complex than even XML advocates avoid it.


> So how long until someone comes up with implementation that doesn't use spaces between the numbers? What's the supported precision? What happens when one tuple is skipped? Do you have to parse exponential numbers correctly?

why doesn't omnigraffle store "Bounds" as

    <dict>
        <key>Bounds</key>
        <array>
            <array>
                <float>25.4278</float>
                <float>76.3008</float>
            </array>
            <array>
                <float>104.75</float>
                <float>91.8751</float>
            </array>
        </array>
    </dict>

?


I can't verify right now, but that might be the serialized representation of an NSRect. It might not be a format Omni created.


As a nitpick, your JSON example uses single quoted keys, and it is mandatory for keys (and all strings) in JSON to use double quotes. Some parsers don't care, for instance if they are part of a full JS engine that can interpret beyond JSON's subset of JS object literal notation, but others will choke.

I guess this actually helps prove your point since text formats do tend to invoke intuitions that may cause one to stray outside of the strict specification.


So, according to you Web pages should have been some binary blobs instead of the beautiful soup we all love to hate? Then we would not have seen this eruption of amateur pages with animated gifs? And all the Web would be clean and straight as early Frontpage HTML? Can't agree less.


No, maybe I should've been more specific - I mean only the formats which are well-defined and which have their own structure. Especially things like image file formats, communication protocols, etc.

For example, I wouldn't mind it if HTTP was a binary format. I don't think HTML has to be... it's a markup language not a strict protocol. It was a bit arbitrary by design. Same thing with markdown. They were supposed to work with existing text as much as possible. HTTP doesn't need that - noone's writing HTTP by hand and even if they kind-of do that, they use things like Curl which provide an abstraction anyway.


Sadly, we're moving away from text protocols. SPDY and WebSockets are binary.

I've telneted many times to debug HTTP issues, and I wonder what am I going to do with SPDY problems.


I don't think that binary is necessarily a bad thing. It doesn't work in all cases, but one of the major reasons that people don't want binary is because of proprietary formats. If the binary format is open, it can be easy to write open-source/well-polished tools for debugging/viewing them. Maybe not as simple and ubiquitous as netcat or telnet, but it works.

The biggest downside is lack of easy transparency which can results from a format/protocol being proprietary, or just obscure to the point that no one remembers the details and the documentation is lost.


note:

> Only use binary protocols where the performance is so sensitive that it's worth the implementation and debugging downside

SPDY was created explicitly to improve performance.


There are surely huge wins from having single TCP/IP stream and gzipping all of it including headers, but I wonder how much difference binary format makes on top of that?

Maybe you save 100B per request? On a large page that makes 100 requests? 10KB. When you've got one stream that eliminates roundtrip time/TCP/IP slow start that's simply 80milliseconds on a basic 1Mbit broadband.


At Google's scale, you might want to multiply that by a trillion or two.


Most of SPDY's "speediness" comes from fundamental design properties, not its regrettably atextual implementation.

From Google's perspective, this level of optimization is appropriate -- they have scaling problems most of us could only dream of. From everyone else's perspective, however, it's very over-engineered.


I use to worry too, but then I learned to use Wireshark and all was well again.


For HTTP, Fiddler is great too. It can already handle parsing a bunch of formats into an easier-to-read display, so I imagine it'll handle SPDY and WebSockets too if it doesn't already.


> Sadly, we're moving away from text protocols. SPDY and WebSockets are binary.

Nothing prevents your WS payload from being textual, as far as I know.


You can use textual payload, but the WS protocol itself (aside from pretend-HTTP handshake) is binary and your payload may be masked on the wire too.

http://tools.ietf.org/html/rfc6455#section-5.2

     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    +-+-+-+-+-------+-+-------------+-------------------------------+
    |F|R|R|R| opcode|M| Payload len |    Extended payload length    |
    |I|S|S|S|  (4)  |A|     (7)     |             (16/64)           |
    |N|V|V|V|       |S|             |   (if payload len==126/127)   |
    | |1|2|3|       |K|             |                               |
    +-+-+-+-+-------+-+-------------+ - - - - - - - - - - - - - - - +
    |     Extended payload length continued, if payload len == 127  |
    + - - - - - - - - - - - - - - - +-------------------------------+
    |                               |Masking-key, if MASK set to 1  |
    +-------------------------------+-------------------------------+
    | Masking-key (continued)       |          Payload Data         |
    +-------------------------------- - - - - - - - - - - - - - - - +


For performance-critical applications, I think binary encoding is warranted. After all Ethernet/TCP/IP are binary as well.


In some cases, yes, text is very nice to be able to read. But the only reason we find it easier than binary data is because we don't have built-in arbitrary binary parsers for the data.

When you read a text format or protocol your brain is doing the job that a good debugger or parser should be doing. Because these tools don't exist for your protocol, you think "boy how handy that I can just parse this file with my brain! It's a good thing I learned <INSERT NATURAL LANGUAGE> and that this format/protocol was written with it, and that I have decades of experience groking it." If your only option for debugging is to open the raw format or protocol and pick at it by hand or try to eyeball it looking for some anomaly, you're just lacking a real tool to help you do the job quicker and more effectively.

Consider two separate formats: ELF and PostScript.

ELF is a binary format which is flexible and extensible. The data it comprises is almost entirely non-human-specific; you aren't going to read it because it's not for you, it's for the computer only. Yet a wide range of platforms have adopted it as their code file format. It could have easily been implemented in text, but what would be the point? Easier debugging? Plenty of tools exist to examine, dump and compare the properties of these files, making an inherent textual representative redundant.

Now look at PostScript. A (relatively) simple, readable text control language to determine the output of a complex document. The problem is PostScript is more of an interpreted programming language than a file format. Instead of using a method to craft a pre-processed document for the printer, the printer had to add a costly interpreter for the document format in order to produce documents on the fly. Early laser printers had microprocessors faster than the Macintosh computers that connected to them. Was it easy to grok, edit, debug? Yes. But it also made for a more costly device to handle all that built-in textual flexibility.

To me, the best way to tell if you need a text format/protocol or not is determining the humanity of your program. Will I need to interface with the internals of this on a regular basis? Might it become so complex and large in the future that a tool to debug it might become necessary? And how much work would it really be to just write it once and enjoy it forever?


See also "The Power of Plain Text" in "The Pragmatic Programmer." Why people think binary formats are the be-all-end-all (or even the correct solution for the majority of problems) continues to confound me. But then again, I don't get NIH syndrome either, and many who insist that binary is better usually want to invent their own binary format.


One thing that seems to be missed here is that text formats tend to be self documenting. That is, if I'm handed a blob of text vs a blob of binary, I am quite a bit more likely to be able to hack the text than the binary. This is usually couched in terms of being able to process the text form using standard tools, but it goes way, way beyond that. It matters most in the situation outlined above -- when third parties need to get at the format without having to rely on some provided tool.

Most protocols and file formats are not documented sufficiently. Encoding data in binary (unnecessarily) is unfriendly because there is a huge difference between "ACK" and 0x06 when you're a third party looking at the data with no reference. Sure, you can probably figure it out given enough time, trail and error. Or perhaps beg the developer for specs, but it's not particularly efficient. Most developers don't have time or the inclination to publish a public spec for all binary formats used in their product.

You can make illegible text formats, of course, but I'd argue that then you're simply making a binary format that's confined to the range of 7-bit ASCII. Similarly, when the goal is to obfuscate (e.g. algorithm IP), binary formats work well to dissuade casual investigations.


Text files are great, especially if one's the Unix shell skills to quickly manipulate and query them without having to write a program. But XML is often overkill compared to a simpler text format.


Of all those improvements on which we spend our newly gained processor power and memory bandwidth each year, textual formats are probably one of the most efficient priorities.

The window where binary formats are absolutely required has shrunk down to the most low levels. With data formats we're basically where we've been with "scripting" languages for years: we can afford to start with the highest possible level and trickle down towards compiled code only where necessary.

Further, whenever I've ever had to create a binary format I've written a translator first. The translator is a program that can read the binary format and write the same information out in editable text form, and that can also parse the text and write out the equivalent binary. Then I only work and debug using the text format and just convert to binary when needed.

Usually this approach implies defining an API that you can use to construct the text or binary message. The API becomes automatically tested when I start by working with the text representation and when I finally move on to writing natively binary only (for performance reasons, obviously), I can trust that it works as well as during development. And I can use the API to generate text, too, so I can easily compare and see what's going on.


The "text" part is important and has been under attack by all proprietary formats for a long time, but right now it is the "file"part that is under attack, by the cloudy services we are using more and more. I think it may be an even greater threat under individuals' control over their own properties.


I have a great big hammer in the *nix text tools, and I try very hard to buy nothing but text nails.


Off-topic, but still: Probably the reason JGC is having problems with CMYK colours is that he specifies them in RGB. Those are not the same.


No. Those are examples from the samples that OmniGraffle give away. I didn't want to show the actual file I was working on.


I think about how great text-based protocols are every day, and every time I use them (multiple times per day.)

This is my biggest turn-off to Microsoft products and the Windows world. I actually like XML sometimes over undocumented JSON because it's so easy to figure out what everything does (in the case of succinct XML, which is admittedly rare.)


Disclaimer: I am not a programmer

I've written little scripts to produce diagrams in DXF and in PS format for astronomical maps and in maths teaching. Often quicker, and neater, than plotting with a mouse.

Your typical end user hacks, nothing that would be of any use for published software.


100% agreed! Currently I'm trying to get access to a iTSM tool that uses HTTP as it's transport mechanism. It uses ActiveX controls to gather data into a grid-like mechanism.

As I particularly hate ActiveX, I have started reverse engineering what these controls do. So far so good, except that the format used for the data that the controls receive is an application/octet-stream binary format.

Now I've worked out how the format works, and by using JDataView I'm parsing the format. But you know what? Internet Explorer takes null characters in strings and will not go any further, even though ECMA-262 states that:

"The String type is the set of all finite ordered sequences of zero or more 16-bit unsigned integer values... All operations on Strings (except as otherwise stated) treat them as sequences of undifferentiated 16-bit unsigned integers; they do not ensure the resulting String is in normalised form, nor do they ensure language-sensitive results."

If they had passed the data back in something saner like JSON, or heck even XML!, then things would have be fine. As it is, I've decided to skip Internet Explorer as it's just not worth my time to get around this issue, and every other browser works fine with JDataView.


Let's hear it for 1970s technology. May we ever be stuck with Unix mediocrity!


In the 1970s when a megabyte cost many thousands of dollars, there was much more reason to use binary formats than there is now. If it made sense then, it makes much more sense now.


I've been waiting for something better to come along, for a long time.

Still waiting.


Everything old is new again in this industry. Your "1970's technology" might as well be "2020's technology" for all we know. The only thing constant in this world seems to be how cyclical it is.


This is what I actually believe.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: