Why aren't browsers strict about HTML? [closed]

Asked 29/8, 2014 at 0:33 Answered 29/8, 2014 at 13:45

It's a well known fact that browsers will accept invalid HTML and do their best trying to make sense out of it. If you create a web page containing only the following code:

<html>
    <head>
        <title>This is bad HTML</title>
    <body>
        <h1>Bad HTML</h2>
        <p>This is a paragraph
    </body>

then you will get a webpage parsed in a way that will show an acceptable view. Whether it is what you meant or not, depends on each browser's understanding of your mistakes.

This, to me, is the same as if Javascript could be written like this:

if (some_var == 1) {
    say_something("some text');
else {
    do_something_else();
// END OF CODE

which, a Javascript compiler written with the same effort to make sense out of invalid code could proably parse as you meant - or make its own sense but run it after all.

I've seen several articles and questions regarding the question "Is it even worth it writting valid HTML?", which present several opinions on the pros and cons of writting valid HTML. However, what this really makes me wonder is:

Why are browsers accepting invalid HTML in the first place?

NOTE: The following questions are not more questions, but a way to give context to the only question I'm asking here:

Why aren't browsers strict?
Why don't they reject with errors invalid code, just like any other programming language? (not that I'm calling HTML a programming language, but you get the point)
Wouldn't that force all developers to write HTML code that will be interpreted exactly the same in any browser?
If browsers refused to parse invalid markup, wouldn't that effectively result in valid markup everywhere and from anyone wanting to publish content in the web?
If this comes from historical reasons and backward compatibility, isn't it time already to change when we already see sites like adsense.google.com refusing compatibility with IE < v10?

EDIT: Those voting to close this question, please reconsider. This is not a broad question neither is a opinion based one. It's a very specific question on a very specific subject, completely related to the programming world and that can definitely be answered with a real answer by those who actually know it. Thanks.

Sibie answered 29/8, 2014 at 0:33 Comment(24)

Oh geez. Waaayyy too broad and impaired by history (it ain't pretty, brutha, it's downright quirky). And, uh, early Javascript (first probably 10 years worth) was tragically awful. – Dishpan 29/8, 2014 at 0:35

Honestly that is a question that we developer types have been wondering for years... – Labe 29/8, 2014 at 0:35

@JaredFarrish: I honestly don't see what is "too broad" about it. It is a very specific question on a very specific subject. – Sibie 29/8, 2014 at 0:37

Actually, browsers (at least Firefox) are strict for XHTML. – Intermingle 29/8, 2014 at 0:37

@SLaks: No, they are not. They won't refuse to parse invalid code, they will still parse it and try to make sense out of it, even if you specify a strict DOCTYPE. – Sibie 29/8, 2014 at 0:38

@FranciscoZarabozo: If you send as application/xhtml+xml, I believe it is strict. – Intermingle 29/8, 2014 at 0:40

I'd be really interested to learn some of the historical context for this, but I have to agree with Jared. It's not that the question is broad, but the scope of any reasonable answer would be too much for this format – Sheri 29/8, 2014 at 0:40

It's too broad because one, it's five questions, and two, it's a history that includes SGML, XML, XHTML, multiple browser rendering systems, the browser wars, quirks mode, word processing, desktop publishing, et cetera. It's a real history, so it's not opinion (unless someone's guessing), but it is long and tedious and much too much for this type of forum. Might be a book somewhere on it. – Dishpan 29/8, 2014 at 0:51

If you need an answer, it's because it ultimately doesn't matter. The browser needs to display whatever it can so it can be consumed. Strictness as a rule only implies that the format of the data is meaningful, and in many (vastly most, probably all) of the cases of the webpage-content-driven world, that just matters far less than giving the user something other than a stack trace. It's more a policy, meaning it's somewhat politically driven, and hence who cares, unless you're into that kinda thing. Content online is dirty. So it has to be dealt with as so. – Dishpan 29/8, 2014 at 0:56

@JaredFarrish: The only reason "it doesn't matter" is all the work behind the parsers to make sense out of invalid code. It doesn't matter because "it will be fixed for you". Whatever you feed the browser with, it will become an strictly organized scheme after parsing, internally. That cannot be a good reason to say it doesn't matter. – Sibie 29/8, 2014 at 0:59

I would like to answer that question, but unfortunately I don't have enough time to write the whole complete answer for that. That's why I agree it's too broad, altough it would be a great addition for the community. – Steddman 29/8, 2014 at 1:1

I'm not arguing with you (nor am I agreeing). You're tilting at windmills. The content online is not all clean and neatly structured (and will never, ever be), the world has dealt with it, it's perfectly imperfect, the world carries on. If you want to make semantically perfect and syntactically strict and valid websites, go for it. Nobody is stopping you. – Dishpan 29/8, 2014 at 1:2

Here's an interesting (if you're into sports) corollary on perfect symmetry in complex systems is not always desirable, and instead can be detrimental. – Dishpan 29/8, 2014 at 1:6

Not only too broad but the actual example given seems to be incorrect, as the HTML spec actually allows the HTML tag to be omitted, see w3.org/TR/REC-html40/intro/sgmltut.html and w3.org/TR/REC-html40/sgml/dtd.html – Sapless 29/8, 2014 at 2:15

@rfernandes: The example omits closing head, closing p and closes h1 with an h2. – Sibie 29/8, 2014 at 2:24

@JaredFarrish: It's not five questions, it's one question and I'm trying to give context to what I'm asking. I edited the question to reflect that. I really don't think is very fair to have this question closed as "too broad", because it's really not. Saying it's difficult to answer is not the same as the question being too broad. – Sibie 29/8, 2014 at 2:29

@Francisco If you look at the documents I linked, you will see that HEAD can also be omitted, and P can have its end tag omitted. REad section 3.2.1. – Sapless 29/8, 2014 at 2:31

When browsers were first written and were trying to win the browser war, would you want to use the browser that displays that page, or the one that says invalid HTML? That's why. It's not a program that needs perfect instructions, it was just trying to display a document – Kapor 29/8, 2014 at 2:31

while it's a very broad question, it's a very good one IMO. – Twoedged 29/8, 2014 at 2:55

@Intermingle is correct. If you're not sending it as application/xhtml+xml, you're never sending XHTML; assuming the default is text/html you'll always be sending tag soup. – Squishy 29/8, 2014 at 3:14

For the record, the only thing invalid about the example you have given is the h1/h2 tag mismatch. If you fix that, everything else is valid HTML 4 and HTML5. Yes, including the missing end tags for html, head and p. – Squishy 29/8, 2014 at 3:16

This is off-topic since it calls for speculation and discussion. It is also unclear what you are asking, i.e. it would not even be a suitable start for a discussion. But perhaps foremost, it’s far too broad. – Consolute 29/8, 2014 at 7:38

@JukkaK.Korpela: I honestly don't see how it's unclear what's being asked - it's a very specific question and it probably has a very specific answer. I agree it calls for speculation, but that's more in the nature of people wanting to always say something just because it involves a subject they are familiar with, even when they don't know the real answer, not because the question itself is a bad one. – Sibie 29/8, 2014 at 8:54

@FranciscoZarabozo, to begin with, “why” questions are generally unclear. Does it ask for a cause (in the causal sense), or about the motivation of browser vendors, or the purpose, or a reason that would be acceptable (to some people), or something else? (And it’s hard to find any interpretation that would make it a practical programming question suitable for SO.) – Consolute 29/8, 2014 at 10:20

"Why are browsers accepting invalid HTML in the first place?"

For compatibility reasons, and in the case of newer browsers, because HTML5 dictates an algorithm for parsing even invalid documents.

Earlier HTML specifications were ambiguous on many situations, such as what happens when the wrong tag is seen, or inconsistent nesting of tags, such as <b><i></b></i>. Even so, many documents "just work" because some earlier browsers ignore unexpected tags or even "correct" incorrect nesting.

But now the HTML5 specification includes a much less ambiguous algorithm for parsing HTML documents. Note that the algorithm includes points where "parse errors" can occur. But these parse errors usually don't stop a modern browser from displaying an HTML document, although the browser is free to display parse errors in its developer tools if it chooses to:

[U]ser agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification. [Emphasis added.]

But again, no modern browser, to my knowledge, aborts parsing a document this early because of parse errors (barring extraordinary situations, such as running out of memory).

On the adsense.google.com situation: This probably has nothing to do with invalid HTML, but rather, perhaps, because IE9 and earlier's DOM support is not sufficient for adsense.google.com's needs.

Paramedical answered 29/8, 2014 at 13:45 Comment(0)

I don't know why they allowed it from the start, but here is why they cant switch now: Legacy Support. If a browser forced strict html, huge parts of the internet would just break, and yes some people would update their code, but some pages would just be lost. There is no incentive for browsers to do this because it would seem to the consumer that browser just doesn't work on some pages and would switch to another that still supports less optimal html.

Basically because it was allowed from the beginning, now it has to be allowed now.

Goalie answered 29/8, 2014 at 0:50 Comment(6)

The standards groups could devise a horribly strict version of HTML that kneecaps any wayward content not toeing the line of it's parse rules, but ironically the powers that be made HTML 5 less strict than HTML 4(.1) precisely because strictness doesn't matter in the context of the web browser space. – Dishpan 29/8, 2014 at 0:54

@falsarella: Browsers could start placing a big red bar stating there are errors in the markup because it was poorly written, and still show the contents. That would make the user aware of it, take responsibility out of the browser itself, and put a lot of shame on bad HTML writters, which would quickly lead to coders making an effort to write valid markup and actually feel proud of their correctness. – Sibie 29/8, 2014 at 1:4

@FranciscoZarabozo: Users don't care if the HTML is valid or not. – Melodics 29/8, 2014 at 1:6

@thirtydot: They don't care only because it's fixed by the browser. They care about websites not working properly and wonder if it's their browser's or computer's fault. – Sibie 29/8, 2014 at 1:10

@Jared Farrish: You can always challenge yourself by writing strict polyglot markup and serving it as application/xhtml+xml ("XHTML5"). – Squishy 29/8, 2014 at 3:21

@Squishy - I'll get back to you on that. ;) – Dishpan 29/8, 2014 at 12:2

To avoid opinion-based answers, this type of question requires an answer based on an authorative reference with credible and/or official sources.

The following excerpts are quotes from W3C Validator Help & FAQ that addresses Why are browsers accepting invalid HTML in the first place? and some other demonstrated concerns related to that.

About Markup

Most pages on the World Wide Web are written in computer languages (such as HTML) that allow Web authors to structure text, add multimedia content, and specify what appearance, or style, the result should have.

As for every language, these have their own grammar, vocabulary and syntax, and every document written with these computer languages are supposed to follow these rules. The (X)HTML languages, for all versions up to XHTML 1.1, are using machine-readable grammars called DTDs, a mechanism inherited from SGML.

However, Just as texts in a natural language can include spelling or grammar errors, documents using Markup languages may (for various reasons) not be following these rules.

[...]

Concepts

One of the important maxims of computer programming is: "Be conservative in what you produce; be liberal in what you accept."

Browsers follow the second half of this maxim by accepting Web pages and trying to display them even if they're not legal HTML. Usually this means that the browser will try to make educated guesses about what you probably meant. The problem is that different browsers (or even different versions of the same browser) will make different guesses about the same illegal construct; worse, if your HTML is really pathological, the browser could get hopelessly confused and produce a mangled mess, or even crash.

That's why you want to follow the first half of the maxim by making sure your pages are legal HTML.

[...]

Validity might not mean quality, and invalidity might not mean poor quality

A valid Web page is not necessarily a good web page, but an invalid Web page has little chance of being a good web page.

For that reason, the fact that the W3C Markup Validator says that one page passes validation does not mean that W3C assesses that it is a good page. It only means that a tool (not necessarily without flaws) has found the page to comply with a specific set of rules. No more, no less. This is also why the "valid ..." icons should never be considered as a "W3C seal of quality".

Unexpected browser behavior might mean that they actually don't accept invalid markup

While contemporary Web browsers do an increasingly good job of parsing even the worst HTML “tag soup”, some errors are not always caught gracefully. Very often, different software on different platforms will not handle errors in a similar fashion, making it extremely difficult to apply style or layout consistently.

Using standard, interoperable markup and stylesheets, on the other hand, offers a much greater chance of having one's page handled consistently across platforms and user-agents.

[...]

Compatibility problems

Checking that a page “displays fine” in several contemporary browsers may be a reasonable insurance that the page will “work” today, but it does not guarantee that it will work tomorrow.

In the past, many authors who relied on the quirks of Netscape 1.1 suddenly found their pages appeared totally blank in Netscape 2.0. Whilst Internet Explorer initially set out to be bug-compatible with Netscape, it too has moved towards standards compliance in later releases.

[...]

Relying too much on 3rd party tools

The answer to this one is that markup languages are no more than data formats. So a website doesn't look like anything at all! It only takes on a visual appearance when it is presented by your browser.

In practice, different browsers can and do display the same page very differently. This is deliberate, and doesn't imply any kind of browser bug. A term sometimes used for this is WYSINWOG - What You See Is Not What Others Get (unless by coincidence). It is indeed one of the principal strengths of the web, that (for example) a visually impaired user can select very large print or text-to-speech without a publisher having to go to the trouble and expense of preparing a separate edition.

Steddman answered 29/8, 2014 at 13:42 Comment(4)

I feel like this doesn't really answer the question "Why aren't browsers strict" but "Why should we follow W3C rules", which is kind of different. – Dogger 29/8, 2014 at 13:48

@ClémentMalet: Although I almost feel the same as you, the second paragraph gets very close to answering it. Far closer than any other comment or answer in the question so far: Specifically the part that says:

One of the important maxims of computer programming is: "Be conservative in what you produce; be liberal in what you accept."  Browsers follow the second half of this maxim by accepting Web pages and trying to display them even if they're not legal HTML.

– Sibie 29/8, 2014 at 13:52

@ClémentMalet I have updated my question to be focused on the 'on-topic' part of it. – Steddman 29/8, 2014 at 14:2

@FranciscoZarabozo That 2nd paragraph is being discussed a lot anyway : W3C is very slow to validate new tools, tags, methods... In such a way that nobody want to try to validate their pages because they would have to remove their super-new-features HTML 14.0 – Dogger 29/8, 2014 at 14:5

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags