I have minimal experience with YAML (100% of which configuring for CI environments because apparently that's what all the modern ones prefer). I don't really know the ins and outs of the format, and try to stick to extremely simple representations.
Today I learned that what I perceived as somewhat complicated syntax is actually overwhelmingly complex.
I know that my preference for s-expressions is not shared by everyone, but the more complex a syntax I encounter, the more I wonder if simpler alternatives were even considered.
Genuine question: apart from inertia, and apart from recursive references (noted in comments before I wrote this), is there a use case for YAML that isn't solved by simpler and less ambiguous formats like EDN or Transit?
Just one data point, but I had never heard of transit or edn before today. Both look awesome.
On the other hand I picked up YAML in less than a day (many years ago) and despite its complexity can't remember that ever causing me any issues. Not 100% sure if missing from your suggested formats (which do support multi line strings) but I enjoy use of yaml's folded strings.
I think all the cited bugs I'd also fail in a code review. YAML is great, provided you pretend most of the spec doesn't exist and treat it as JSON with some syntactic sugar. I don't ever want to see !! even in output for machines.
The big problem is when the useless features that you should, never, ever use (and if you do you are doing it wrong) bite you on the arse. Because it tried to do too much, like XML, even when we knew that was a bad idea. But is is still the best 'JSON with better layout' out there, which is what we want when we want readable documents and maintainable code. EDN, Transit and TOML all fail in that somewhere, IMO, and of course popularity is actually a hugely important feature for any data interchange format.
Human writable? I have little experience with it, but if what Amazon uses for its CodeDeploy configuration files is representative for the format, I disagree.
Yes, part of that is CodeDeploy, but saying "tab characters must not be used in indentation" for me, disqualifies a standard as "human writable".
CodeDeploy makes that horrendous by requiring five or so clicks to figure out what causes an error (try forgetting adding an app spec.yml file or saving a file with a BOM)
You're hearing about it now because 99+% of people who use YAML have no idea about how horrible the spec is. They just assume that it's like what strictyaml strives to be and don't understand why "such an elegant, simple and readable language isn't used more".
In other words, people don't understand the need for it. Those that do use other formats - json or toml, namely.
Writing movie scripts in YAML using unquoted strings? That's pretty contrived. Using literal style is easy when it is potentially needed (e.g. programmatic output), and any decent editor can highlight inferred types in helpful ways. I've used YAML in a variety of contexts and never been bitten by this one, and I don't think that any of his examples are still problems in YAML 1.2 (from 2009).
The Ruby security problem they reference is also absurdly misattributed. The problem there is with trusting serialized data to mark its own types, and having no limits on what types can be deserialized into. That's a depressingly common security problem in many web frameworks, and YAML as an interchange format isn't a unique source of vulnerability. Any data format is dangerous on the web if you trust it to create arbitrary types.
What I find more interesting is that YAML allows you to define self-referential / circular data structures:
foo: &foo
bar: *foo
In PyYAML, this will give you a self-referntial dictionary. Powerful but pretty catastrophic if you use (naive) recursion to analyze a user-submitted data structure.
I would imagine this feature is primarily useful if you want to serialize a whole bunch of objects that all reference each other, but it makes me feel a bit icky. It feels like it breaks an intuitive assumption you have about hierarchical formats, which is that there should be no cycles.
Maybe that's just my personal bias, but I feel like the relative simplicity of JSON is a strong feature in its favor. As a developer, I have very clear understanding of the data I'm reading, and with that I can more easily make safer and more stable code.
That's certainly true, but at least that's not going to cause an infinite loop while wandering down the structure, it's just gonna end at a leaf "ref" or "id".
In my view this is more a question of simple interface vs. simple implementation. Having cross-references at the language level gives a uniform and simplified interface at the cost of more having to develop careful tools.
Not saying refs/id are bad, but at least with a uniform syntax and language support, you don't have to reimplement custom cross-reference resolvers.
You are right for most use cases - which is why json is so popular. Sometimes however there is very good reason to have non-hierarchical data - and YAML fits this use case.
I'm not sure what your search query is supposed to prove. That something is rare doesn't mean it doesn't exist. I have personally used this YAML feature, albeit very rarely, obviously.
> YAML is what people turn to who need these kind of advanced feautures
might be literally true, it is also true that
YAML is what many people turn to even if they don't use those features
So the original concern of
> Powerful but pretty catastrophic if you use (naive) recursion to analyze a user-submitted data structure.
is probably reasonable.
There is a pretty good likelihood that many YAML users (i.e. the app developers) are unaware of the power exposed in the format, and the possible consequences of parsing & walking user input.
It brings back all sorts of memories of XML entity attacks, etc. When developers build simple data interchange methods on top of complex formats that they don't fully understand, all sorts of issues emerge.
It's not YAML's problem per se - it's very helpful that there is a commonly used, complex data format for people who need such features. It would be far worse if everyone was inventing their own solution for this - but the reality on the ground is still pretty messy.
It does seem to be a rarely used feature though. Furthermore it confuses non-programmers who otherwise are happy with YAML (I've tested this, it really does).
Given all of this, and given that what it does can be easily achieved in other ways that don't require the feature, I think it's a net negative overall to have it in the spec.
I may be rarely used, generally speaking, but it's used a lot in CI configs. The linked example is the typical use case, and it's documented in both CI services' config section.
I'm certain this feature is a major reason why YAML was chosen for CI config over TOML or JSON, and the way it's used in those config files is, as I said, _not_ recursive, so the issue discussed here isn't relevant. Rejecting recursive definitions can be a sensible improvement.
What other ways do you have in mind that would be a viable replacement?
>What other ways do you have in mind that would be a viable replacement?
Where there is not very much repetition required, simply use repetition instead (e.g. in the above example I would just copy & paste use-db to the relevant locations). The increased readability makes it worth the repetition where there's not a lot of it.
Where there is a lot of repetition I'd consider that a bug or feature required in the schema and would refactor the YAML schema and the parsing code so that less repetition is necessary for most configs.
I don't consider the above CI config particularly easy to understand and even with the node anchors and references there's a bunch of repetition (e.g spinach 1, spinach 2, spinach 3, etc.). That's partly because of the node anchors and references. It would look better if they refactored the schema.
Without a looping or sequence construct, this will be hard to avoid, and once you go that route, you might as well adopt a Lisp as your config syntax.
Before there was XML, there was SGML, and SGML has DSSSL, which was a Scheme dialect, and that was a brilliant idea which has been ditched for XML and its surrounding specs. S-Expressions are natural for the task. There are still prominent uses of S-Expressions, even in the OCaml world, where Jane Street uses it for configuration and serialization.
The full power of a Lisp may be too much and risky, but you can restrict the spec and allow only certain constructs, so that you can ensure it will evaluate to a result in a quick and deterministic way.
The little "templating" feature used in gitlab-ci.yml is a huge benefit, and as long as parsers limit the feature, it's a good compromise between complexity and comfort.
There is a lot of experience with using Lisp for configuration, and that experience can be leveraged to come up with a good S-Expressions based config format that's flexible, deterministic, and known to terminate in constant time. If you're careful with the spec, you could even validate the config fully.
So, I'm actually surprised no Lisp dialect has gained popularity in the modern web stacks for configuration. Maybe if Clojure were more popular and less XML-influenced from the JVM world. Just thinking out loud.
That's certainly my use case. Things like easy multi-line strings make it very useful, and readable by less tech-inclined folks. Though for that audience ArchieML is better still:
This is an opportunity to plug my private project "WSL" [1] which is a clean text serialization format for relational databases. The scope is somewhat different and it's not really released (but beginning to stabilize), but I'll be happy to hear what you think.
But it's completely unusable for the Real World (tm) because strings cannot contain the '[' and ']' characters AND YET there is no mechanism for escape sequences. What if my data legitimately contains those two characters???
That's just the default string type. I started out with escaping but noticed it's a lot of complexity that is rarely needed (not for my own use case, which is accounting, inventory, and some web apps which don't need it).
The advantage of not having escaping is easier seds and greps which don't miss the field boundaries.
The important concept though is that arbitrary datatypes elements can be added by the user of an API implementation (the python library already offers that). The datatypes define their lexical syntax, like in perl6. I will also declare more "default" datatypes and might include a C-like string after enough consideration.
This can be useful but to avoid confusion, I wouldn't call it a string type at all. Maybe it's an identifier or a symbol or a label or something like that? If you're excluding square brackets, there are probably other special characters you want to exclude too?
Thanks for the feedback, and I'm glad to see other people worry about these details, too!
ASCII control characters are forbidden in the entire WSL file. Then apart from [] everything is allowed.
For practical applications, by far the most important requirement of strings is being able to include space characters to make a short sequence of words, like [Trinidad and Tobago].
I don't know a better word for "sequence of words" than "String". Maybe "Words", but technically it really is a string (containing an arbitrary sequence of the allowed characters). Even approaching the enforcement of more structure would be a lot of work with little returns. And you can't include a literal newline in a C string literal, and you can't even have a NUL character in the interpretation (memory layout) of it, right?
I actually started out with a C-like string as default type (so named it "String") but noticed
- A big problem with string culture is that "" strings use identical start and end markers. A problem which Joe Armstrong mentioned as well.
- Escaping means significantly higher complexity of parsing out the interpretation from the literal, while it's not really needed for most applications.
Both these problems make data unnecessarily hard to process with dirty one-off scripts. So after considering some other options I'm now with [this style] because [] are not too often needed or when needed can often be substituted with (), and are very pleasing on the eye in most fonts.
In conclusion I guess it will stay "String", and other less frequently needed types will be called "CString", "Base64", and "BinaryString". Or optional parameterization will be created for "String" to declare the escaping style without needing a separate metatype.
If you're designing a language it's probably okay to have string literals that can't contain certain characters, so long as there is some other way to do it. It's a bit different for a serialization format.
The question is what you do when you're converting some data from some other format (for example, dumping a database) and there are strings that actually contain these special characters. Even if it's a bit ugly, it's good to have some way to represent the data so that it can be read back in again without any loss. In this kind of tool, you can't just say "don't do that" because the data has already been saved - you're just converting it. (So my idea of not calling it a string type probably doesn't make sense, on second thought, if you want to be able to interoperate.)
Square brackets are an interesting choice. If you just want to do simple lossless escaping, it doesn't seem that hard:
\\ means a literal backslash
\] means a literal ']'
Anything else gets written as-is. (But, what if the database actually does contain control characters in some of its strings?)
You can't create a serious serialization format that fails its primary job of serializing data.
What you implemented there isn't a serialization format, it's a SQL DSL. Any time you implement a DSL, the first question should always be: "Have you considered S-Exprs?". If s-exprs do the job, then you just saved yourself having to implement a ton of stuff, including 90% of a spec.
Well, the primary job of WSL isn't serializing arbitray binary data. And it doesn't fail there either, it only encourages clean syntax by default. But it can very easily be made to serialize anything.
The conventional wisdom is "All problems in computer science can be solved with another level of indirection".
Nevertheless, as stated I will very likely add C-style string literals to the required set of datatypes.
As to "it's a SQL DSL". It's very much not. You missed the point. And S-Exprs is completely out of scope (no schema support, no nice and canonical syntax, etc.)
Piling on to the general theme of the rest of the comments here:
I really wish there was a more popular middle-ground between YAML and JSON. People use YAML because it's the next step up from JSON if you need comments, etc., but I think most purposes would be better served by the likes of JSON5, HJSON, or TOML (for example) if only any of those were as popular as JSON and YAML.
I implemented a YAML parser from spec last month. It (YAML) goes to great lengths to provide human-friendly features, trading off computer-friendliness to an fairly extreme extent.
Eliminating 'plain' scalars (unquoted strings-as-values), folded multiline literals, tags, anchors/aliases, and possibly directives, as a sort of reduced yaml would make the language a lot less silly for the kinds of things a lot of people end up using it for.
Right. My hypothesis here is that, if there were, instead of JSON and YAML to choose from, three standards, one of which were something slightly less human-friendly but much simpler than YAML, I think it would be widely adopted.
I think YAML is more widely-used than it would otherwise be, in a broader variety of domains, simply because the only popular alternative doesn't have comments.
I'm not drawing from this any sort of conclusion that we should try to push one of these alternatives as viable, just lamenting the state of things.
I use YAML as human-editable JSON - especially for non-programmers (although it's surely more pleasant to write even for programmers).
I would like a clearly documented YAML subset that left out some of the more complex stuff and avoided a few 'more than one way to do it' features. That would go a long way to removing some of the criticism of YAML with very little cost in functionality.
My experience with YAML is that it's very temperamental with respect to /whitespace/ and that your editor might try to get too smart and damage the document.
JSON, if you see a pattern and follow the pattern, is likely to work.
JSON CAN be stored in a 'pretty' way, with extra whitespace, which makes it even more obvious how to format a document; that's frequently how I write out small bootstrap config files (IE the database connection string to get the main config from).
You see, as much as you have strong feelings about whitespace (to the extent of reaching for your caps-lock key), so do I. I personally feel it's a great loss to code style and readability that Python become an outlier in terms of arguing for significant whitespace.
But this isn't the time for that tired old debate. I won't convince you any more than I'll convince someone about the One True Brace Style or Vim vs EMACS (actually - hold on. I can't stand either of them).
Python's mistake is not including a default-coding-style formatting system with the language (or if it does, making it so obscure I don't know about it).
The formatter should ALSO, when cleaning up the code, see if it compiles with four spaces equal to a tab, and then if not, if eight spaces is a tab.
I'd prefer that levels of intent be a tab (not space; if you're going to make presentation of indent important make it something the client CAN modify without changing the meaning of the code).
Can you give an example of where an editor has damaged your documents? I've never had that experience when editing YAML with Vim, Emacs, or TextMate… which editor are you using?
Focusing on the editor won't help because then you're adding an unexpected constraint on the user.
I think I happened to experience it using kwrite, or maybe also notepad++ when I had to edit something from a non-default computer once. The point still remains that editors which /try/ to be smart and helpfully correct, convert, or automatically indent things, won't always do it the way you might have wanted. What works in one context won't work in others, and having context sensitive whitespace handling can bite you when the program isn't aware of which context it's supposed to be crafting the document in (EG when you start a new one or copy an example from the web).
"Unexpected constraint" is a very curious choice of words. In my experience, when you teach novice programmers, braces and semicolons are the unexpected constraints. Line breaks and indentation is a bit more expected to the novice programmer. ("Optional semicolons" like in JavaScript seem to be the worst of both worlds.)
Programmers with experience will expect things to behave like the languages they know. If you are experienced with Python, Haskell, or YAML, they won't seem like surprises.
Funny that Notepad++ and Kwrite have given you problems, those are usually fairly good editors and should be able to write Python / Haskell / YAML just fine. I wonder what happened to cause those problems.
I hate it when languages and formats care about whitespace beyond just whether there is some between two tokens.
Not being able to use tabs in Haskell is annoying for example. And I find having to indent stuff so it is off the page to the right to get stuff to compile unsatisfactory.
Same with YAML. I detest a format where indenting matters. It shouldn't matter.
I get that you hate Haskell, Python, and YAML. Could you explain why you are sharing this hate with everyone here? I mean, what is the point? There are lots of things I hate, but I don't revel in that hatred.
server
path: /core/www/
host: example.com
port: 80
service: true
proxy
host: proxy.example.com
port: 8080
authentication: plain
description
:Primary web-facing server
:Provides commerce-related functionality
server
...
proxy host="proxy.example.com" port=8080
authentication: plain
There's no character escaping, brackets, distinction between attributes and elements, data types (up to the one parsing the document how to decode values), etc. BOM-free UTF-8 is mandatory.
It's the bare minimum for an extensible tree-based data structure, but I threw in three ways to assign values to nodes to help craft smaller documents. Turns out it's hugely important to be able to assign multiple child nodes on the same line as the parent for certain classes of documents.
My implementation of BML uses my string library, so I can't offer a portable parser. But it shouldn't take more than an hour or two to write a BML parser.
> Downside is that I have no clue how to market it, and I hate writing documentation, so there's only a few people using it.
You could always write up a quick demo program (something anyone can just clone, compile, and execute), throw in a few example files, and submit your repository as a Show HN. That might get some attention at least.
>I would like a clearly documented YAML subset that left out some of the more complex stuff and avoided a few 'more than one way to do it' features.
I looked for one some months ago and the closest I found was https://github.com/crdoconnor/dumbyaml, which is a parser implementation, not a standard. Out of projects approaching the issue from the opposite angle ("smarter JSON"), I think HOCON (https://github.com/typesafehub/config/blob/master/HOCON.md) is the most promising. Compared to Hjson (http://hjson.org/), it is less likely to confuse the user with how trailing commas are treated in unquoted strings and generally appears to not compromise on strictness for convenience.
That said, I agree with TillE: as soon as a configuration file format I work on needs any logic or expressions I will strongly consider an embedded scripting language like Lua or Tcl in order to avoid instantiating Greenspun's tenth rule for configuration.
I completely agree with this. I use YAML as a way to write structured data that may be serialized into JSON at another endpoint (mostly because JSON is much more universal and what does it matter to me how exactly my data is serialized by another machine). I know of its other features but haven't found the need to try them, yet every security hole those features make possible ends up limiting the acceptance of YAML.
We need this so much. YAML is by far the best choice for human-readable and editable serialization, but the bells and whistles are really unnecessary and hold back innovation among the parser packages.
Not TillE, but I suspect the point is you can exchange data with Lua tables just as easily as JSON and because you would do it by just embedding the Lua interpreter you can also send Lua functions around with the data and/or process the data in Lua before making it accessible from the actual program.
Thanks. I was looking at the process to process data sharing that you would use YAML or JSON for and how Lua would be used. It's never occurred to me to serialize a table and send it. I'll put that on the stack of things to research.
Today I learned that what I perceived as somewhat complicated syntax is actually overwhelmingly complex.
I know that my preference for s-expressions is not shared by everyone, but the more complex a syntax I encounter, the more I wonder if simpler alternatives were even considered.
Genuine question: apart from inertia, and apart from recursive references (noted in comments before I wrote this), is there a use case for YAML that isn't solved by simpler and less ambiguous formats like EDN or Transit?