Malformed HTML? None. Browsers attempting to be lenient in what they accept? Loads.
To take this article as an example, according to the HTTP specification, the `Content-Type` header is supposed to have the final say in what media type is being served. Internet Explorer decided it would be better to use heuristics. I think the idea was that if a web host was misconfigured, rather than have the web developer fix their bug, it would try to guess its way out of the error.
Which kinda worked. The problem was, it opened it up to abuse. If you had a web host that allowed untrusted people to upload images (e.g. profile photos), you could construct an image that tricked Internet Explorer into thinking that it was an HTML document, even if the server explicitly told clients that it was an image. The main difference between images and HTML, of course, is that HTML can contain JavaScript, which would now execute in the security context of your web page.
So all of these web hosts, thinking they were only giving people the ability to upload images, were now letting people execute JavaScript on their domain – simply because Internet Explorer tried to be lenient.
The workaround ended up being forcing downloads with `Content-Disposition` headers instead of displaying inline. That's why, for example, visiting the URL of an image on Blogger directly triggers a download instead of showing the image.
Other examples that spring to mind:
Netscape interpreting certain Unicode characters as less than signs. People were correctly escaping `<` as `<` but the Unicode characters slipped through and caused XSS vulnerabilities in that browser.
Browsers ignoring newlines in pseudo-protocols. Want to strip `href="javascript:…"` out of comments? No problem… except some browsers also executed JavaScript when an attacker placed a newline anywhere within the `javascript` token.
Being lenient in what you accept has caused security vulnerabilities over and over again and there's no reason to think that it will stop now.
> To take this article as an example, according to the HTTP specification, the `Content-Type` header is supposed to have the final say in what media type is being served. Internet Explorer decided it would be better to use heuristics. I think the idea was that if a web host was misconfigured, rather than have the web developer fix their bug, it would try to guess its way out of the error.
Which kinda worked. The problem was, it opened it up to abuse. If you had a web host that allowed untrusted people to upload images (e.g. profile photos), you could construct an image that tricked Internet Explorer into thinking that it was an HTML document, even if the server explicitly told clients that it was an image. The main difference between images and HTML, of course, is that HTML can contain JavaScript, which would now execute in the security context of your web page.
So all of these web hosts, thinking they were only giving people the ability to upload images, were now letting people execute JavaScript on their domain – simply because Internet Explorer tried to be lenient.
This is an interesting example, though I think this is a fair bit different than being lenient on HTML interpretation.
The topic was strict HTML. Accepting malformed HTML doesn't seem to pose much of a problem. Blindly executing a non-executable file seems like a much different problem.
> Netscape interpreting certain Unicode characters as less than signs. People were correctly escaping `<` as `<` but the Unicode characters slipped through and caused XSS vulnerabilities in that browser.
This doesn't sound like being lenient. This just sounds like a bug.
> Browsers ignoring newlines in pseudo-protocols. Want to strip `href="javascript:…"` out of comments? No problem… except some browsers also executed JavaScript when an attacker placed a newline anywhere within the `javascript` token.
Huh? I don't understand the scenario being described here. It again sounds like a bug rather than lenient acceptance of data, though.
> I think this is a fair bit different than being lenient on HTML interpretation.
It's not. There are two areas where the leniency was a problem here. Firstly, the leniency in rendering one media type as a completely different media type because the browser heuristic thought it was being lenient. Secondly, the leniency in parsing HTML out of an image file – you can't do that with valid HTML.
> Accepting malformed HTML doesn't seem to pose much of a problem.
I've literally just given three specific examples of it causing security vulnerabilities.
> This doesn't sound like being lenient. This just sounds like a bug.
No, it was intentional. It was specifically Unicode characters that looked like less than and greater than signs, but weren't.
> I don't understand the scenario being described here.
Somebody noticed that href="java\nscript:…" wasn't being parsed as JavaScript, and it was causing some malformed pages to fail to work properly. Rather than let it fail, they tried to fix it by stripping out the whitespace, and caused a security vulnerability.
If these three examples aren't enough, take a look at OWASP's XSS filter evasion cheat sheet. There's plenty of examples in there of lenient parsing causing security problems:
> It's not. There are two areas where the leniency was a problem here. Firstly, the leniency in rendering one media type as a completely different media type because the browser heuristic thought it was being lenient. Secondly, the leniency in parsing HTML out of an image file – you can't do that with valid HTML.
I think you can argue the first is a problem. You have an example demonstrating as much. Arguing that the second is a problem is much harder. Lenient HTML acceptance been hugely advantageous to the adoption of the web. There may have been some issues from this, but it's valuable enough that the effort to "fix" it was abandoned and the W3C and WHATWG returned to codifying what leniency should look like.
> I've literally just given three specific examples of it causing security vulnerabilities.
Well, at least one example. Coercing a file served as an image to HTML isn't an issue of accepting malformed HTML, nor would I agree that the JS example is a problem with leniency.
> No, it was intentional. It was specifically Unicode characters that looked like less than and greater than signs, but weren't.
Okay, I reread your last comment. I initially thought you were saying that Netscape was treating '<' as '<'. So Netscape decided to treat some random unicode chars that happen to look kind of like the less-than symbol (left angle bracket: ⟨, maybe?) as if they're the same as the less-than symbol? This seems amazingly short-sighted and pointless. How was this issue not seen, and was this even solving a problem for someone?
> Somebody noticed that href="java\nscript:…" wasn't being parsed as JavaScript, and it was causing some malformed pages to fail to work properly. Rather than let it fail, they tried to fix it by stripping out the whitespace, and caused a security vulnerability.
So the issue here is incompetent input sanitization. I don't think the browsers being lenient here is the issue.
A few of these are interesting in the context of browsers being lenient. e.g. This one requires lenience as well as poor filtering:
<IMG """><SCRIPT>alert("XSS")</SCRIPT>">
Most of these are just examples of incompetence in filtering, though, and a great example of why you 1) should not roll your own XSS filter if you can avoid it, and 2) why you should aggressively filter everything not explicitly acceptable instead of trying to filter out problematic text.
> Arguing that the second is a problem is much harder. Lenient HTML acceptance been hugely advantageous to the adoption of the web.
Wait, that's a completely different point. The argument here is that it caused a security vulnerability, and it did. If the lenient HTML parser didn't try to salvage HTML out of what is most certainly not valid HTML, then it wouldn't be a security vulnerability.
> This seems amazingly short-sighted and pointless. How was this issue not seen, and was this even solving a problem for someone?
You could ask the same of most cases of lenient HTML parsing. It's amazing the lengths browser vendors have gone to to turn junk into something they can render.
> So the issue here is incompetent input sanitisation.
No, it isn't. That code should not execute JavaScript. The real issue is that sanitising code is an extremely error-prone endeavour because of browser leniency – because you don't just have to sanitise dangerous code, you also have to sanitise code that should be safe, but is actually dangerous because some browser somewhere is really keen to automatically adjust safe code into potentially dangerous code.
Take the Netscape less than sign handling. No sane developer would think to "sanitise" what is supposed to be a completely harmless Unicode character. It should make it through any whitelist you would put together. Even extremely thorough sanitisation routines that have been worked on for years would miss that. It became dangerous through an undocumented, crazy workaround some idiot at Netscape thought of because he wanted to be lenient and parse what must have been a very broken set of HTML documents.
This is not a problem with incompetent sanitisation. It's a problem with leniency.
You have some compelling examples of problems from leniency. I think in some cases the issues are definitely magnified by other poor designs (bad escaping/filtering) but you've demonstrated that well-intentioned leniency can encourage and even directly cause bugs.
Malformed HTML may escape sanitization on input in a vulnerable web app, and still render on the victim's browser because their browser wants to be helpful.
(Yes, the output should have been escaped, but that is sadly not always the case)
I don't see how this has anything to do with malformed HTML or lenient rendering rules. In the scenario you're describing, well-formed but malicious HTML could also escape sanitization.
How many security flaws are the result of malformed HTML?