How Xml Accommodates Human Authored Content
Posted on 10 Nov 2002
by Doug Domeny (dougdomeny)
Rated 3.91 (Ratings: 1)
- More articles in Code
XML is gaining acceptance today, not because it is a great technology looking for a problem, but because today's problems require its flexibility and simplicity. XML enables you to create structured and semi-structured documents that can be transferred and read by people and programs in multiple formats (for example, pages that can be read on the web, handheld devices and print). This "multi-use" of content is the driving force behind the adoption of XML technology.
Today, most of the world's information is locked in paper, unsearchable documents with proprietary file formats, or web pages where search engines return too much data and not enough information. Just think about how much your company has spent to create documents that can't be easily found or distributed because they are unstructured.
XML lets business users create structured documents that can be leveraged for multiple purposes in-house and exchanged to people and businesses around the world. XML breaks new ground by connecting the front office business users with the back office developers.
Bill Trippe, in his article "Do XML Editors Matter?" (Transform October 2001, page 27), makes this point by saying, "You can view XML as the bridge between the two worlds of structured (relational) and unstructured (document) data." He continues, "On one hand, you have a growing need for content to be tagged at its source and maintained in a structured form. On the other hand, users are resistant to more complex tools and processes."
Like a telephone line, which carries both voice and data, XML can carry information suitable for computers and people.� Computer-generated XML is dynamically created by a program for B2B ecommerce or other server-to-server transaction. These applications are addressed by XML standards such as ebXML and SOAP.� Human-authored content uses XML for improved search capabilities, multi-channeled publication, and syndication. These applications are addressed by standards such as MathML, NewsML, VoiceXML, and any number of custom XML dialects. This article focuses on how to apply XML to human-authored content.
How XML Accommodates Human-Authored Content
While highly structured data is independent of the style used to present it, unstructured data is full of style and format. Contrast plain text (no style) with rich text (full of style).
Text documents meant for human authoring and reading have design needs that only XML can address. Examples of semi-structured documents include catalogs, press releases, news reports, and technical documentation. Even highly structured data becomes semi-structured if it includes comments, descriptions, or instructions meant to be read by people.
XML supports the development of semi-structured documents that contain both relational meta data (the structure) and free-form (unstructured) formatted text. The meta data (that is, the XML tags) meets the programmatic need for structure. Without meta data, a computer program cannot understand the content. Formatted text meets the human and business need to express richly styled content. Without style, the content is dry and unattractive.
The paragraph you are reading now is an example of formatted text. Most document editors display content (unstructured data) as WYSIWYG (what you see is what you get). For a business user to comfortably create semi-structured textual documents, a document editor must allow the author to add style to the text.
Variations of Structured and Unstructured Data
Two kinds of semi-structured data exist between highly structured and unstructured data:
- highly structured data
- structured data with unstructured elements
- unstructured documents with tagged meta data
- unstructured documents
Structured data with unstructured elements is commonly used in web forms, where most fields are tightly constrained (for example, "State" must be selected from a list and "ZIP" must be all digits), yet a 'comment' field is available for human-readable content.
<product>��� <name>Deluxe Widget</name>��� <listprice units="usd">$19.95</listprice>��� <radius>6mm</radius>��� <description>��� This <em>deluxe <strong>gold</strong> plated</em> product fits most attachments.��� </description></product>
For this kind of document, use a DTD or schema to validate the structure, and include an unstructured element (for example, description) that allows both text and tags. In a DTD, this element would typically be defined as
<!ELEMENT description ANY>
Unstructured documents with tagged meta data are less common but offer the best promise for content that can be effectively searched. HTML provides some meta tags, like
<CODE>, but XML provides the flexibility to create custom tags.
<owner studentid="2456">Jim Smith</owner> owns a <automobile model="OCC96">Cutlass Ciera</automobile>.
<my:conditional value="birds"><my:reference><my:author>Joe Kluck</my:author> in his article <my:title type="article">Why Chicken have Wings</my:title> <my:bibliography>(<my:source><my:periodical>Poultry Monthly</my:periodical> <my:issue>September 2001</my:issue></my:source>, page <my:page>9</my:page>)</my:bibliography> dispels the usual stereotypes of flightless birds."</my:reference></my:conditional>
This kind of document must be well formed to allow processing by an XML parser but is usually not validated against a DTD or schema. For such a document, XHTML is a natural choice because it is well formed, has extensive formatting capability, and custom XML tags can be added without causing display problems in browsers. Note the namespace "my" was used to distinguish the custom XML tags from standard HTML tags.
Tips for Designing an XML DTD or Schema
XML transfers information between two parties, whether human or machine. Just as two people must know the same language, both parties must speak the same XML dialect. The dialect, defined in the DTD (data type definition) or schema, is the vocabulary and grammar used to describe the information being transferred.
The producer and the processor of XML information must share a common DTD or schema. Because the DTD or schema is vital to the success of XML, this article provides guidelines for designing a DTD or schema. Even if you are not designing a DTD or schema, it is worthwhile to understand the rationale behind their design, since it is the structure of XML data that gives it meaning. This structure changes a random sequence of unintelligible words to speech, that is, it transforms data to information.
When designing a DTD or schema for XML data, analyze the nature of the data and how it is created and processed. Consider how data is stored in a relational database, with a clearly defined structure of records, fields and tables.
Before you begin your design, decide whether to store data as the value of an attribute or as a text element (even if numeric) within tags. Generally, it is better to store data in elements, as this approach is more flexible when used with XSL. (XSL is a specification for transforming XML to HTML or some other XML structure.)
Be careful not to design solely from a developer's perspective. Consider who produces the XML data. If it is produced and processed programmatically, a developer-friendly perspective is appropriate. In fact, XML for B2B transactions should be designed from this perspective to generate fast, reliable and efficient transfer of information.� However, if a human will author or read the XML data, consider those needs when designing a DTD or schema.
Elements and Attributes
An attribute is the name-value pair that immediately follows a tag name. An element is a tag along with its attributes and all the text and elements that it encloses. Elements within another element are called child elements. Consider the following example.
<tag_name attr_name1="value1" attr_name2="value2">��� <child_tag attr_name3="value3" />��� <child_with_text>This is some text</child_with_text>��� This text is part of the tag_name element</tag_name>
As seen in this illustration, the tags are
child_with_text. The attribute
attr_name1 has a value of "
value1". The element,
tag_name, consists of the following attributes and child elements:
Note that, in XML, every attribute value must be quoted with single (') or double quotes ("). Also, every tag must have a closing tag or end with "/>".� Since the child_tag element has no child elements or text, the tag ends with "/>" instead of a closing tag, for example, "</child_tag>".
Michael C. Daconta, in his article "Are Elements and Attributes Interchangeable?" (XML Journal volume 2 issue 7, page 42), presents eight practical rules for deciding whether to use elements or attributes. Some rules depend on whether the design is implemented in a DTD or schema. DTDs cannot enforce constraints between attributes and elements as extensively as schemas can. As a result, the decision to use an attribute may depend on whether a value is constrained.
Elements vs. Attributes with Semi-Structured Documents
When creating semi-structured data and content-oriented documents, place human-readable text in elements, not attributes. This is because attributes are part of the structure, not the content.� If you can separate structure from content, you can extract content without tags while retaining the human-readable information.
Text within an element should be considered "viewable."� Attribute values, on the other hand, are either invisible or rendered in some other way by a graphical object. Use attribute values to modify or further identify specific elements.
<prompt type="boolean">Do you want the information?��� <choice value="true">Yes, please send the information</choice>��� <choice value="false">Don't send me the information</choice></prompt>
<photo width="x" height="y" src="URL" alt="Text if photo not rendered or on mouse-over">This is the caption for the photo.</photo>
If you follow this rule, the value-of XSL tag or nodeValue property in the XML DOM (or text property in the Microsoft XML DOM) can easily recondition the content for publication on an unformatted device, as illustrated below.
This following XSL statement
produces the following text:
"This is the caption for the photo."
Elements vs. Attributes with Database Oriented Data
To contrast attributes with elements, here are two examples of student record data that are traditionally stored in a database. The first example primarily uses elements (element-centric) to store data values. The second example primarily uses attributes (attribute-centric).
<students>��� <student id="2456">������� <name>Jim Smith</name>������� <grade>10</grade>������� <gpa>3.5</gpa>��� </student>��� <student id="2457">������� <name>Mary Jones</name>������� <grade>12</grade>������� <gpa>3.4</gpa>��� </student>��� :</students>
<students>��� <student id="2456" name="Jim Smith" grade="10" gpa="3.5" />��� <student id="2457" name="Mary Jones" grade="12" gpa="3.4" />��� :</students>
With relational database data, the choice between attributes and elements does not appear all that important. Only unique keys, which establish a link between elements (such as student id), must be attributes to facilitate the linking of records (that is, other elements). With the attribute-centric approach, each element is a record, and each attribute is a field.
Although either approach works, it is generally recommended to use elements instead of attributes. For instance, to distinguish between first and last name, the element-centric approach can be changed to:
<name>��� <first>Jim</first>��� <last>Smith</last></name>
The attribute-centric approach is less favorable because the attribute must be split into two attributes.
<student ... firstname="Jim" lastname="Smith" .../>
Only the element-centric approach is compatible with XSL.
Element-centric XSL transform:
The transform above results in "Jim Smith" in both element-centric approaches, but the attribute-centric approach requires two different transforms.
Attribute-centric XSL transform for one attribute:
Attribute-centric XSL transform for two attributes:
<xsl:value-of select="@firstname"/> <xsl:value-of select="@lastname"/>
Elements vs. Attributes with Object Oriented Data
Besides relational database data, we should also consider object-oriented data, which describes a physical object, such as a car or a wooden barrel. Like the student record, the data is highly structured. Every part and subassembly relate to the others.
For object-oriented data, the relationship between parts and subassemblies is best described using the element approach. For example,
<automobile modelno="OCC96" class="midsize">��� <name>Cutlass Ciera</name>��� <engine size="3.0l">������� <cylinders count="4" />��� </engine>��� <wheels count="4" />��� <doors>������� <door>driver����������� <mirror />����������� <lock type="4 button combination" />����������� <window />������� </door>������� <door>front passenger����������� <mirror>OBJECTS IN MIRROR ARE CLOSER THAN THEY APPEAR</mirror>����������� <lock type="key" />����������� <window />������� <door />������� <door>left rear����������� <lock type="child safety" />����������� <window openable="no" />������� </door>��� </doors></automobile>
Example of an object-oriented approach for a 10 gallon wooden barrel:
<barrel capacity="10g" material="wood">��� <hoops width="2in" dia="2.5ft" material="iron" /></barrel>
As you can see, both database-oriented and object-oriented data have little text. The data is highly structured and can be easily expressed in a tabular or hierarchical format. However, these highly structured examples would become semi-structured if the student record included teachers' comments or the automobile object included part descriptions and assembly instructions.
Widespread adoption of XML technology depends on DTD and schema designs that provide a structure convenient for humans. In other words, people will use XML only if it is easy and solves a problem.
XML may solve database problems by providing a common format to exchange data. But databases have been around for a long time, and most database problems have been solved. XML may improve B2B commercial transactions, although B2B is not new either--standards like EDI have been in use for some time. XML may also be beneficial for mathematical equations, CAD-CAM, UML (software design and modeling), and other graphically oriented content.
Managing semi-structured content for multiple audiences across multiple media (for example, the web, text-enabled handheld wireless devices, and print) is a new problem. Because of its unique design, XML is perfect for managing such content.