Doctypes and HTML Versions

Standard Generalised Markup Language (SGML) is a language that was developed many years ago for defining markup languages. One of the tags that SGML uses is the DOCTYPE tag which defines which particular markup language that a particular SGML document uses.

HyperText Markup Language (HTML) was not originally defined using SGML but this was rectified with version two. Every HTML document written for any version of HTML that is based on SGML can use an SGML DOCTYPE tag at the start of the document so as to both recognise tat it is an SGML document and also to identify that the particular type of SGML used is HTML.

There are multiple parts in a doctype tag some of which are optional. Simply to identify that an SGML document is HTML all you need is:

<!DOCTYPE HTML>

The optional parts of a doctype identify which version of HTML that you are using. So to identify the document as HTML 2.0 you would use:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

To identify the document as HTML 3.2 you would use:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">

And to identify the document as HTML 4.01 you would use:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">

There were so many changes in HTML between versions 3.2 and 4.01 that it was thought to be unreasonable that people be required to convert pages completely in one go and so a doctype for pages that are using both 3.2 and 4.01 in the same page also exists to cover those pages that are part way through being converted from 3.2 to 4.01. The doctype for pages that contain both HTML 3.2 and HTML 4.01 is:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd">

These SGML doctypes serve to identify which particular version or versions of HTML that a given web page is using to anyone examining the page source. The doctype is also used by validators that validate the page content against a particular SGML standard in order to determine if the document actually does comply with the standard that it claims to be written in.

The earliest browsers created before HTML became a type of SGML did not expect to receive a doctype at all and so a web page with a doctype ended up displaying that doctype in the actual web page itself. Once HTML became a type of SGML browsers were amended to ignore the doctype tag if it was present since the actual way in which the browsers determined how to interpret the HTML was (and still is) built into the browser itself and the browser processes the HTML as being XYZ browser proprietary HTML equivalent rather than as whatever version of HTML that the page is actually written in. Since the HTML that current browsers implement includes both HTML 4.01 and HTML 3.2 as a subset, any pages written in either version of HTML generally are able to be processed by all current browsers without too many quirks regarding tag recognition.

The earliest browsers supported a form of CSS that was supplied by the person using the browser to determine how they wanted the web pages they were viewing to look. As the web became more popular this method of defining page appearance was considered to be impractical and work began on a new CSS language that could be attached to web pages to define to everyone how the author thought the page should look. Development of this language was slow and browsers implemented their own appearance tags such as <font> and <center> to define page appearance. It was the adoption of these tags into the standard that led to HTML 3.2 and their removal again once CSS was finished that led to HTML 4.01.

Even before CSS was finalised as a standard browsers started implementing support for CSS. Unfortunately this meant that in some cases the actual way in which certain CSS was meant to be interpreted changed between the way some browsers originally implemented it and the way the final standard said they should implement it. This wouldn't be a problem except that web pages were developed using the browser proprietary version of CSS that would be broken if the standard CSS rules were applied and other web pages were implemented that follow the standards that would be broken if proprietary CSS rules were applied.

The browser writers needed a solution to this CSS problem. The one thing that had happened in between all the people writing web pages to use the proprietary CSS and those writing web pages using standard CSS was that people writing to the standards also tended to include a doctype at the top of their document. While the presence or absence of a doctype wasn't a perfect indication of which CSS that a particular page expected to follow, sing it to flag which CSS to use would reduce to a minimum those web pages that would be broken by browsers using that to determine which of the two CSS variants to use.

Browsers today therefore check whether a web page does or doesn't have a doctype in determining how to process the CSS for the page. If there is no doctype present then the page is considered to be an old one that expects the browser's proprietary implementation of CSS. This is called quirks mode and generally it means that the page will appear differently depending on which browser the page is viewed in. Most such pages were written for Internet Explorer 5 and so generally only display correctly in Internet Explorer (which wasn't a problem when IE had almost all of the market but it does mean that there are a growing number of old pages that appear broken now that the majority of people are using other browsers). The presence of any of the doctypes I have listed above means that the browser applies the standard CSS rules to the page and it should generally look the same in all browsers.Note that this use of the doctype tag for CSS mode switching is not what the doctype tag is there for, it is just that the presence of the tag was considered to be a useful indication that the standards were expected to be applied.

Browsers can vary in how they implement their CSS mode switch and while some may just look for the presence or absence of the tag, others look at the complete content of the tag and may implement yet a third intermediate mode where a doctype is present but isn't an exact match for any of those I have listed (such as by leaving out some but not all of the optional parts).

Internet Explorer not only had problems with implementing some of the CSS the wrong way before the standard was finalised but they also implemented some of the Document Object Model standards into Jscript to soon as well so that their Jscript didn't treat the DOM the way it was supposed to. Microsoft decided to use the doctype as a switch for the DOM processing in Jscript as well (IE is the only browser to run Jscript with all other browsers running JavaScript instead - the two languages are similar enough that you can combine them together to support all browsers using feature sensing to detect most differences and Jscript conditional comments to detect the rest). The DOM switch is not dependent on the presence or absence of the doctype, it is dependent on whether the page is considered to be HTML 4.01 or not. IE considers a page to be expecting the standard DOM when either an HTML 4.01 strict doctype is used or the short doctype that doesn't specify an HTML version is used. For all other situations the proprietary DOM is used by Jscript instead.

Those involved in developing a new HTML 5 standard have decided to abandon having HTML as a type of SGML document and have decided to do their own thing instead. This means that they no longer have an SGML doctype to identify what type of SGML language that the page is written in. In so far as the real purpose of the doctype is concerned this doesn't matter since if HTML 5 isn't SGML it makes no sense having an SGML tag at the front identifying what type of SGML it is. Removing the doctype tag though would break the CSS and DOM switching that browsers have built in to overcome the shortcomings of how the browsers incorrectly implemented the standards by rushing in before the standards were completed. The people working on that standard have therefore had to implement the doctype tag as a part of their new HTML in order for it to be able to be used to correctly force browsers to apply the standards. Since this is purely there as a switch to tell the browsers to follow the standards and does not identify the type of SGML they have simply implemented the shortest form of the tag. Should the HTML 5 standard ever be adopted then that will mean that it will be impossible to tell from the short version of the doctype as to whether the page is using a version of HTML that is properly defined using SGML or is using HTML 5. The development time for new standards means that the date when this confusion will become official is so many years in the future that they will hopefully redefine HTML 5 to remove this confusion prior to that date.

The important thing to remember is that browsers don't care what version of HTML that you claim to be using and they don't care whether your HTML matches the standard you claim to be following. Whatever version of HTML you use the browsers will interpret using their own proprietary HTML standard and will only use the presence or absence of the doctype to determine which mode to use for CSS and Jscript.

 

This article written by Stephen Chapman, Felgall Pty Ltd.

go to top

FaceBook Follow
Twitter Follow
Donate