I used to be like you. I believed in the proper correctness of of markup; proper closing tags, proper nesting. But I've come to see the light. The WWW succeeded and flourished because of it's faults and it's lazy error checking. Thousands of non-technical people writing their own html. Thankfully it didn't have to be perfect and it worked.
I still like tidy clean code, but I don't agonize over it's perfection.
I hear that repeated, but I don't find it convincing. A simple grammar would make it easy to find errors and kick them out immediately. Instead, we ended up with shitty ambiguous standards (common in "friendly" text-based protocols) and still have to deal with cross-browser compatibility.
If HTML had error checking and kicked out unspecified/ambiguous syntax, people may have left off tags (decided not to bold or make a list), omitted some images or something.
It's hard enough writing a spec - there will be unforeseen combinations resulting in conflicting behaviour. The answer isn't to give up and make the spec loose.
HTML5 isn't loose -- it has a well-defined procedure for handling errors.
Which is worlds better than XML's "every error is a fatal error" approach, since real-world XML is often non-well-formed (and, when validity checking is possible, invalid), and tools ignore that to varying degrees and recover or ignore just like they do with older versions of HTML.
(my favorite example of all time, with that, is the ability of XHTML documents to have their well-formedness status depend entirely on the HTTP Content-Type header, and at the time none of the major toolchains actually handled it)
Can you detail this often non-well-formed XML? I've not seen any XML parsers that handle invalid XML. Except for people who wrote their own XML parser and think a simple regex is enough.
Validation is another issue, and I don't think you'll find anyone saying that the myriad XML addons are simple or easy :).
The mixing of HTTP and HTML also seems like a bit of strange hack to me. And let's not start talking about well-formed HTTP; I'd be surprised to find many real-world clients or servers actually following the inane HTTP spec. Just like mail clients don't always handle comments in email addresses.
Well, the classic example is XML + rules about character encoding. Suppose I send you an XHTML document, and I'm a good little XML citizen and in my XML prolog I mention that I've encoded the document UTF-8. And let's say I'm also taking advantage of this -- there are some characters in this document that aren't in ASCII.
So I send it to you over HTTP, and whatever you're using on the other end -- web browser, scraper, whatever -- parses my XML and is happy. Right?
Well, that depends:
* If I sent that document to you over HTTP, with a Content-Type header of "application/xhtml+xml; charset=utf-8", then it's well-formed.
* If I sent it as "text/html; charset=utf-8", then it's well-formed.
* If I sent it as "text/xml; charset=utf-8", then it's well-formed.
* If I sent it as "application/xhtml+xml", then it's well-formed.
* If I sent it as "text/xml", then FATAL ERROR: it's not well-formed.
* If I sent it as "text/html", then FATAL ERROR: it's not well-formed.
Or, at least, that's how it's supposed to work when you take into account the relevant RFCs. This is the example I mentioned in my original comment, and as far back as 2004 the tools weren't paying attention to this:
These are the kinds of scary corners you can get into with an "every error is a fatal error" model, where ignorance or apathy or a desire to make things work as expected ends up overriding the spec, and making you dependent on what are actually bugs in the system. Except if the bug ever gets fixed, instead of just having something not quite look right, suddenly everyone who's using your data is spewing fatal errors and wondering why.
Meanwhile, look at things like Evan Goer's "XHTML 100":
HTML has strict implementation requirements and loose authoring requirements. I recall that it is a goal of HTML that a significant percentage of "anyone" can create useable documents with it, but the closest I can come to a citation at the moment is this: http://wiki.whatwg.org/wiki/FAQ#Why_does_this_new_HTML_spec_...
One of the things I really like about HTML5, actually, is that it recognizes that real-world HTML is not perfect... and then specifies exactly how parsers should deal with imperfections.
It worked because the rendering engines picked up the slack - Gecko, Trident, and Webkit are all magnitudes more complex for having to reinterpret pages for the nebulous correctness.
One of the big differences between HTML4 and HTML5 is that implicit closing tags are defined in the spec, and not just a consequence of browser implementations. So "error handling" in HTML4 has essentially become a feature in HTML5
For XHTML, one of the big ideas was that you could use an XML parser, and embed custom XML. Since an XML parser errors on invalid input, it can be smaller and faster. Having an XML parser also means embedded XML is easy to deal with. However, all this falls down when you consider that nearly all XHTML was sent as HTML, so the XML parser never kicked in. All this meant you required properly formatted files.
Sadly, Microsoft deserve a fair amount of blame for this, for not ever really supporting XHTML in IE back when it was so dominant. Oh, I mean, they "supported" it in that it would render, but they didn't support the application/xhtml+xml content-type, which mean that, in turn, nobody served their XHTML as application/xhtml+xml, and so on.
I won't say the lack of widespread adoption of XHTML was all Microsoft's fault, but they definitely played a role.
Plenty of things do. Such as web analytics tools & plugins that can only work in a non-xhtml-compliant way.
My favourite was a google tool (can't remember what it was - google website optimizer?) that required you to use some godawful <script> construction that was necessarily broken. And you'd have thought google would know better.
That space before the closing slash is actually not allowed in XML, but was required for browsers that couldn't interpret XHTML. XHTML was broken from the get-go; the only virtue it had was that it taught a generation of web developers to be consistent in their markup.
(By the way, since sibling nodes have no specified order in XML, there's no reason why one paragraph should have followed another on a web page consistently, and the <ol> was an oxymoron.)
Because maybe we would not have to reinvent the wheel (making it oval, by the way) for each and every "new" feature that come along HTML5 (I'm looking at you, Web components).