Why aren't browsers strict about HTML? [closed]
Asked Answered
S

3

17

It's a well known fact that browsers will accept invalid HTML and do their best trying to make sense out of it. If you create a web page containing only the following code:

<html>
    <head>
        <title>This is bad HTML</title>
    <body>
        <h1>Bad HTML</h2>
        <p>This is a paragraph
    </body>

then you will get a webpage parsed in a way that will show an acceptable view. Whether it is what you meant or not, depends on each browser's understanding of your mistakes.

This, to me, is the same as if Javascript could be written like this:

if (some_var == 1) {
    say_something("some text');
else {
    do_something_else();
// END OF CODE

which, a Javascript compiler written with the same effort to make sense out of invalid code could proably parse as you meant - or make its own sense but run it after all.

I've seen several articles and questions regarding the question "Is it even worth it writting valid HTML?", which present several opinions on the pros and cons of writting valid HTML. However, what this really makes me wonder is:

Why are browsers accepting invalid HTML in the first place?

NOTE: The following questions are not more questions, but a way to give context to the only question I'm asking here:

  • Why aren't browsers strict?

  • Why don't they reject with errors invalid code, just like any other programming language? (not that I'm calling HTML a programming language, but you get the point)

  • Wouldn't that force all developers to write HTML code that will be interpreted exactly the same in any browser?

  • If browsers refused to parse invalid markup, wouldn't that effectively result in valid markup everywhere and from anyone wanting to publish content in the web?

  • If this comes from historical reasons and backward compatibility, isn't it time already to change when we already see sites like adsense.google.com refusing compatibility with IE < v10?

EDIT: Those voting to close this question, please reconsider. This is not a broad question neither is a opinion based one. It's a very specific question on a very specific subject, completely related to the programming world and that can definitely be answered with a real answer by those who actually know it. Thanks.

Sibie answered 29/8, 2014 at 0:33 Comment(24)
Oh geez. Waaayyy too broad and impaired by history (it ain't pretty, brutha, it's downright quirky). And, uh, early Javascript (first probably 10 years worth) was tragically awful.Dishpan
Honestly that is a question that we developer types have been wondering for years...Labe
@JaredFarrish: I honestly don't see what is "too broad" about it. It is a very specific question on a very specific subject.Sibie
Actually, browsers (at least Firefox) are strict for XHTML.Intermingle
@SLaks: No, they are not. They won't refuse to parse invalid code, they will still parse it and try to make sense out of it, even if you specify a strict DOCTYPE.Sibie
@FranciscoZarabozo: If you send as application/xhtml+xml, I believe it is strict.Intermingle
I'd be really interested to learn some of the historical context for this, but I have to agree with Jared. It's not that the question is broad, but the scope of any reasonable answer would be too much for this formatSheri
It's too broad because one, it's five questions, and two, it's a history that includes SGML, XML, XHTML, multiple browser rendering systems, the browser wars, quirks mode, word processing, desktop publishing, et cetera. It's a real history, so it's not opinion (unless someone's guessing), but it is long and tedious and much too much for this type of forum. Might be a book somewhere on it.Dishpan
If you need an answer, it's because it ultimately doesn't matter. The browser needs to display whatever it can so it can be consumed. Strictness as a rule only implies that the format of the data is meaningful, and in many (vastly most, probably all) of the cases of the webpage-content-driven world, that just matters far less than giving the user something other than a stack trace. It's more a policy, meaning it's somewhat politically driven, and hence who cares, unless you're into that kinda thing. Content online is dirty. So it has to be dealt with as so.Dishpan
@JaredFarrish: The only reason "it doesn't matter" is all the work behind the parsers to make sense out of invalid code. It doesn't matter because "it will be fixed for you". Whatever you feed the browser with, it will become an strictly organized scheme after parsing, internally. That cannot be a good reason to say it doesn't matter.Sibie
I would like to answer that question, but unfortunately I don't have enough time to write the whole complete answer for that. That's why I agree it's too broad, altough it would be a great addition for the community.Steddman
I'm not arguing with you (nor am I agreeing). You're tilting at windmills. The content online is not all clean and neatly structured (and will never, ever be), the world has dealt with it, it's perfectly imperfect, the world carries on. If you want to make semantically perfect and syntactically strict and valid websites, go for it. Nobody is stopping you.Dishpan
Here's an interesting (if you're into sports) corollary on perfect symmetry in complex systems is not always desirable, and instead can be detrimental.Dishpan
Not only too broad but the actual example given seems to be incorrect, as the HTML spec actually allows the HTML tag to be omitted, see w3.org/TR/REC-html40/intro/sgmltut.html and w3.org/TR/REC-html40/sgml/dtd.htmlSapless
@rfernandes: The example omits closing head, closing p and closes h1 with an h2.Sibie
@JaredFarrish: It's not five questions, it's one question and I'm trying to give context to what I'm asking. I edited the question to reflect that. I really don't think is very fair to have this question closed as "too broad", because it's really not. Saying it's difficult to answer is not the same as the question being too broad.Sibie
@Francisco If you look at the documents I linked, you will see that HEAD can also be omitted, and P can have its end tag omitted. REad section 3.2.1.Sapless
When browsers were first written and were trying to win the browser war, would you want to use the browser that displays that page, or the one that says invalid HTML? That's why. It's not a program that needs perfect instructions, it was just trying to display a documentKapor
while it's a very broad question, it's a very good one IMO.Twoedged
@Intermingle is correct. If you're not sending it as application/xhtml+xml, you're never sending XHTML; assuming the default is text/html you'll always be sending tag soup.Squishy
For the record, the only thing invalid about the example you have given is the h1/h2 tag mismatch. If you fix that, everything else is valid HTML 4 and HTML5. Yes, including the missing end tags for html, head and p.Squishy
This is off-topic since it calls for speculation and discussion. It is also unclear what you are asking, i.e. it would not even be a suitable start for a discussion. But perhaps foremost, it’s far too broad.Consolute
@JukkaK.Korpela: I honestly don't see how it's unclear what's being asked - it's a very specific question and it probably has a very specific answer. I agree it calls for speculation, but that's more in the nature of people wanting to always say something just because it involves a subject they are familiar with, even when they don't know the real answer, not because the question itself is a bad one.Sibie
@FranciscoZarabozo, to begin with, “why” questions are generally unclear. Does it ask for a cause (in the causal sense), or about the motivation of browser vendors, or the purpose, or a reason that would be acceptable (to some people), or something else? (And it’s hard to find any interpretation that would make it a practical programming question suitable for SO.)Consolute
P
6

"Why are browsers accepting invalid HTML in the first place?"

For compatibility reasons, and in the case of newer browsers, because HTML5 dictates an algorithm for parsing even invalid documents.

Earlier HTML specifications were ambiguous on many situations, such as what happens when the wrong tag is seen, or inconsistent nesting of tags, such as <b><i></b></i>. Even so, many documents "just work" because some earlier browsers ignore unexpected tags or even "correct" incorrect nesting.

But now the HTML5 specification includes a much less ambiguous algorithm for parsing HTML documents. Note that the algorithm includes points where "parse errors" can occur. But these parse errors usually don't stop a modern browser from displaying an HTML document, although the browser is free to display parse errors in its developer tools if it chooses to:

[U]ser agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification. [Emphasis added.]

But again, no modern browser, to my knowledge, aborts parsing a document this early because of parse errors (barring extraordinary situations, such as running out of memory).

On the adsense.google.com situation: This probably has nothing to do with invalid HTML, but rather, perhaps, because IE9 and earlier's DOM support is not sufficient for adsense.google.com's needs.

Paramedical answered 29/8, 2014 at 13:45 Comment(0)
G
5

I don't know why they allowed it from the start, but here is why they cant switch now: Legacy Support. If a browser forced strict html, huge parts of the internet would just break, and yes some people would update their code, but some pages would just be lost. There is no incentive for browsers to do this because it would seem to the consumer that browser just doesn't work on some pages and would switch to another that still supports less optimal html.

Basically because it was allowed from the beginning, now it has to be allowed now.

Goalie answered 29/8, 2014 at 0:50 Comment(6)
The standards groups could devise a horribly strict version of HTML that kneecaps any wayward content not toeing the line of it's parse rules, but ironically the powers that be made HTML 5 less strict than HTML 4(.1) precisely because strictness doesn't matter in the context of the web browser space.Dishpan
@falsarella: Browsers could start placing a big red bar stating there are errors in the markup because it was poorly written, and still show the contents. That would make the user aware of it, take responsibility out of the browser itself, and put a lot of shame on bad HTML writters, which would quickly lead to coders making an effort to write valid markup and actually feel proud of their correctness.Sibie
@FranciscoZarabozo: Users don't care if the HTML is valid or not.Melodics
@thirtydot: They don't care only because it's fixed by the browser. They care about websites not working properly and wonder if it's their browser's or computer's fault.Sibie
@Jared Farrish: You can always challenge yourself by writing strict polyglot markup and serving it as application/xhtml+xml ("XHTML5").Squishy
@Squishy - I'll get back to you on that. ;)Dishpan
S
2

To avoid opinion-based answers, this type of question requires an answer based on an authorative reference with credible and/or official sources.

The following excerpts are quotes from W3C Validator Help & FAQ that addresses Why are browsers accepting invalid HTML in the first place? and some other demonstrated concerns related to that.


About Markup

Most pages on the World Wide Web are written in computer languages (such as HTML) that allow Web authors to structure text, add multimedia content, and specify what appearance, or style, the result should have.

As for every language, these have their own grammar, vocabulary and syntax, and every document written with these computer languages are supposed to follow these rules. The (X)HTML languages, for all versions up to XHTML 1.1, are using machine-readable grammars called DTDs, a mechanism inherited from SGML.

However, Just as texts in a natural language can include spelling or grammar errors, documents using Markup languages may (for various reasons) not be following these rules.

[...]


Concepts

One of the important maxims of computer programming is: "Be conservative in what you produce; be liberal in what you accept."

Browsers follow the second half of this maxim by accepting Web pages and trying to display them even if they're not legal HTML. Usually this means that the browser will try to make educated guesses about what you probably meant. The problem is that different browsers (or even different versions of the same browser) will make different guesses about the same illegal construct; worse, if your HTML is really pathological, the browser could get hopelessly confused and produce a mangled mess, or even crash.

That's why you want to follow the first half of the maxim by making sure your pages are legal HTML.

[...]


Validity might not mean quality, and invalidity might not mean poor quality

A valid Web page is not necessarily a good web page, but an invalid Web page has little chance of being a good web page.

For that reason, the fact that the W3C Markup Validator says that one page passes validation does not mean that W3C assesses that it is a good page. It only means that a tool (not necessarily without flaws) has found the page to comply with a specific set of rules. No more, no less. This is also why the "valid ..." icons should never be considered as a "W3C seal of quality".


Unexpected browser behavior might mean that they actually don't accept invalid markup

While contemporary Web browsers do an increasingly good job of parsing even the worst HTML “tag soup”, some errors are not always caught gracefully. Very often, different software on different platforms will not handle errors in a similar fashion, making it extremely difficult to apply style or layout consistently.

Using standard, interoperable markup and stylesheets, on the other hand, offers a much greater chance of having one's page handled consistently across platforms and user-agents.

[...]


Compatibility problems

Checking that a page “displays fine” in several contemporary browsers may be a reasonable insurance that the page will “work” today, but it does not guarantee that it will work tomorrow.

In the past, many authors who relied on the quirks of Netscape 1.1 suddenly found their pages appeared totally blank in Netscape 2.0. Whilst Internet Explorer initially set out to be bug-compatible with Netscape, it too has moved towards standards compliance in later releases.

[...]


Relying too much on 3rd party tools

The answer to this one is that markup languages are no more than data formats. So a website doesn't look like anything at all! It only takes on a visual appearance when it is presented by your browser.

In practice, different browsers can and do display the same page very differently. This is deliberate, and doesn't imply any kind of browser bug. A term sometimes used for this is WYSINWOG - What You See Is Not What Others Get (unless by coincidence). It is indeed one of the principal strengths of the web, that (for example) a visually impaired user can select very large print or text-to-speech without a publisher having to go to the trouble and expense of preparing a separate edition.

Steddman answered 29/8, 2014 at 13:42 Comment(4)
I feel like this doesn't really answer the question "Why aren't browsers strict" but "Why should we follow W3C rules", which is kind of different.Dogger
@ClémentMalet: Although I almost feel the same as you, the second paragraph gets very close to answering it. Far closer than any other comment or answer in the question so far: Specifically the part that says: One of the important maxims of computer programming is: "Be conservative in what you produce; be liberal in what you accept." Browsers follow the second half of this maxim by accepting Web pages and trying to display them even if they're not legal HTML.Sibie
@ClémentMalet I have updated my question to be focused on the 'on-topic' part of it.Steddman
@FranciscoZarabozo That 2nd paragraph is being discussed a lot anyway : W3C is very slow to validate new tools, tags, methods... In such a way that nobody want to try to validate their pages because they would have to remove their super-new-features HTML 14.0Dogger

© 2022 - 2024 — McMap. All rights reserved.