There are articles about XML in every major tech journal and at every web developer's site. It has already been said that XML will revolutionize the web and change the way internet works. So, what is XML and why all the fuss?

It may be easier to define what XML isn't rather than what it is. XML is not a markup language. Yeah, I know they call it that, it is two thirds of the name and yet other than some rules common to SGML and HTML there is nothing about XML that defines it as a language unless it's the term meta language. If you're expecting to learn a new language, XML will surprise you. XML wants you to define your own language using XML terms. XML isn't a display language. It doesn't display content like HTML, it defines the content and relies on another language to render it.

XML, rather than being a language, is a database. Now I know you are wondering, "A database, all this fuss over a database?" Well, yes. But rather than having to make the data conform to a given set of rules, with XML you define the rules the data conforms to. This allows for some freedoms that just aren't possible with HTML.

Okay, okay, get on with the XML...

Before jumping into the XML structure there are a few basics that must be addressed.

Tags

XML tags MUST be paired. Recall the img tag from HTML? There's no </img> tag. In XML tags that don't come in pairs must take the empty form. A <br> tag in XML must be expressed as <br/>. Tags must also be nested properly.

<p><b>Incorrect Nesting</p></b> is incorrect and must be expressed;

<p><b>Correct Nesting</b></p>

XML is case sensitive

A <TITLE> tag is seen as an entirely different tag than a <title> tag.

Whitespace

XML, unlike HTML, deems whitespace in content as intentional and leaves the space intact. XML browsers may collapse this but XML does indeed count returns and spacebar entries as important.

The Types of XML Documents

There are two types of XML documents. Valid ones and well-formed ones. A valid document contains everything you need to know about the content and structure of the document. A well-formed document contains the bare essentials while adhering to the XML rules. Either one is acceptable. The major difference is that a VALID document must either contain or point to a DTD, (Document Type Definition).

DTDs contain the information needed to allow processing applications to display the document. They include a definition of the <tags> that are used to mark up the document. The DTD contains the rules defining the structural relationship for <tags>. For example, a <piston> tag may exist within <engine> tags but an <engine> tag cannot exist between <piston> tags. The DTD also specifies the sequence of tags, <year> before <manufacturer> and <make> and <model>. The DTD is where you specify the tag attributes as well, <font="bold> for your <emphasis> tags for example. You can also include any other information needed about the structure or grammar of the document you intend to create.

Do I have to have a DTD?

Quite simply, no. A well-formed document is enough.

Why take all the time to make one then?

CONTROL. The DTD allows for complete control over the processing of your document. Since everything is specified, there can be no arbitrary representation of the document from the processing application.

Okay, we have some rule for documents, but what about the documents themselves?

XML documents consist of elements. Each element may contain another element, this may or may not be content accessible to the viewers, and within each sub-element other elements may exist and so on. XML documents are ordered as Prolog, Root Element and Epilog, although epilog may not exist in the same form in future XML protocols.

A well-formed document need only contain the Root Element while a Valid document needs the Prolog and the Root Element. The Epilog is never a requirement.

What does the Prolog consist of?

First, you don't need a Prolog at all. If you aren't interested in creating DTDs, a document with nothing but a Root Element is perfectly acceptable. The prolog is an introduction of the document to whatever application is processing it and it can serve to provide info to the surfer as well.

There are 4 parts to the prolog.

1. XML Declaration. This is nearly the same thing as an <HTML> tag and will look like this in its simplest form. <?xml version="1.0"?> The "?" marks are delimiters for Processing Instructions. This lets the program or browser know that the folllowing document is an XML document and may need special handling and that it complies with version 1 of the XML rules. The encoding information follows; <?xml version="1.0" encoding="iso-8859-1"?>. This merely tells the application or browser what character set is used to render the document. This can be anything you specify. Finally, there is the standalone document declaration. This value is either yes or no. <?xml version="1.0" encoding="iso-8859-1" standalone="yes"?> If you specify yes this tells the application that ALL the rules for this document may be found within the document of within a file you specify. At this time the standalone document declaration appears to on the way out. It isn't needed. If you are using a DTD it can be specified elsewhere.

2. Comment. Much the same as HTML comment. <!--Any characters or text can be placed within <comment> tags and will be ignored by the application processing the document-->

3. Processing Instructions, or PIs <quicktime version="6.0" bitrate="32Kbps"?> this lets the browser know that files expected to be played by the quicktime application will be rendered at 32Kpbs. If the browser doesn't understand what Quicktime is, more than likely, the PI will be ignored.

4. The document type declaration. This is not the same as DTD, (document type definition) and there are three components. This merely tells the application what kind of XML document is being used, this could be for an auto manufacturer and named automl.

Syntax is as follows: <!DOCTYPE name externalDTDpointer internalDTDsubset>

Only name is required. <!DOCTYPE automl>

ExternalDTDpointer tells the aplication where to find the DTD. <!DOCTYPE automl SYSTEM "http://www.auto.com/autuoml.dtd">

The internalDTDsubset is optional and supplements the external DTD. The internal DTD overrides the external. The syntax is <!DOCTYPE automl SYSTEM "http://www.auto.com/automl.dtd" []>

The Root Element is the heart and soul of XML and as such, can represent a well-formed document without any mention of a Prolog or an epilog. An Element consists of a single pair of tags. What exists between the tags is added markup. For our automakers site our root element might consist of:

<autoinfo>

<make>Ford</make>

</autoinfo>

This is a very limited view of what a Root Element is and there is MUCH more information to be covered in the next section.

Epilog

The epilog can contain any information that is covered in the Prolog except for DOCTYPE and XML declarations. I suggest using the Prolog to cover all the instructions and info needed and foregoing the use of the Epilog. Once again, the Epilog's fate is yet to be determined in the XML protocols and looks decidedly uncertain.

This section has covered the basics of an XML document, Prolog, Root Element and Epilog including some basic syntax and some markup differences that are different from HTML. In the next section we'll get into some detail about the Root Element, Attributes, Entities and more.