A Quick Introduction to XML
This document provides a quick introduction to some of the terms and concepts used in the analysis of
the XML documents in the Tutorial section of the CellML website. The terms are taken from the original XML specification [http://www.w3.org/TR/1998/REC-xml-19980210] published in February 1998 by
the World Wide Web Consortium.
The following online resources provide more thorough documentation on XML:
http://www.w3.org/XML/ [http://www.w3.org/XML/]  the W3C's XML page.
http://www.ucc.ie/xml/  the official XML FAQ.
http://www.xml.com/axml/testaxml.htm  the annotated XML specification.
http://www.oasis-open.org/cover/xml.html  the XML Cover pages.
The following list of terms is by no means exhaustive, and the definitions are in some cases incomplete:
XML
XML stands for eXtensible Markup Language, and it is a standard
for structured text documents developed by the World Wide Web
Consortium [http://www.w3.org/] (W3C). The W3C represents
about 500 paying member companies and is responsible for many of
the standards relating to the internet, including HTML. XML can be
used to structure text in such a way that it is readable by both humans and machines, and it presents a simple format for the exchange
of information across the internet between computers. As such, electronic commerce is the principal application area for XML.
XML is a simplification (or subset) of the Standard Generalized
Markup Language (SGML) which was developed in the 1970s for
the large-scale storage of structured text documents.
XML document
An XML document contains a prolog and a body. The prolog consists of an XML declaration, possibly followed by a document type
declaration. The body is made up of a single root element, possibly
with some comments and/or processing instructions. An XML document is typically a computer file whose contents meet the requirements laid out in the XML specification. However, XML documents
may also be generated "on the fly" by a computer responding to a request from another computer. For instance, an XML document may
be dynamically compiled from information contained in a database.)
XML Declaration
The first few characters of an XML document must make up an
XML declaration. The declaration is used by the processing software to work out how to deal with the subsequent XML content. A
typical XML declaration is shown below. The encoding of a document is particularly important, as XML processors will default to
UTF-8 when reading an 8-bit-per-character document. This will
cause characters to be rendered incorrectly if the document uses Latin encoding (iso-8859-1). XML processing applications are required
1
A Quick Introduction to XML
to handle 16-bit-per-character documents in the Unicode encoding,
which makes XML a truly international format, able to handle most
modern languages.
<embeddeddtd><?xml
version="1.0"
encoding="iso-8859-1"?></embeddeddtd>
Document Type Declaration
A document author can use an optional document type declaration
after the XML declaration to indicate what the root element of the
XML document will be and possibly to point to a document type
definition. A typical document type declaration for a CellML document is shown below. Note that the document type declaration facility defined in the XML specification provides a lot more functionality than what is discussed or shown here.
<embeddeddtd><!DOCTYPE
model
SYSTEM
"http://www.cellml.org/cellml/cellml_1_1.dtd"></embeddeddtd>
Start / End Tag
The simplest way of encoding the meaning of a piece of text in
XML is to enclose it inside start and end tags. A start tag consists of
the tag-name in between less-than and greater-than signs, and the
matching end tag has a slash preceding the tag-name, as shown below. A well-formed XML document has an end-tag that matches
every start-tag.
<my_tag> the text data </my_tag>
Element
The combination of start-tag, data and end-tag is known as an element. The data may be plain text (as in the example above), further
elements (sub-elements), or a combination of text and sub-elements.
A document is usually made up of a tree of elements with a single
root element as shown below.
<root_element>
<sub_element_1> data for sub-element 1 </sub_element_1>
<sub_element_2> data for sub-element 2 </sub_element_2>
</root_element>
Attribute
Another way of putting data into an XML document is by adding attributes to start tags. The value of the attribute is usually intended to
be data relevant to the content of the current element. Whitespace is
used to separate attributes from the tag-name and each other. Each
attribute has a name followed by an equals sign and the value of the
attribute. The value of the attribute is enclosed in single or double
quotes. In the example below, <my_tag> has two attributes:
att_1 and att_2.
<my_tag att_1="1" att_2="2"> the text data </my_tag>
Empty Element
If an element has no content, the end-tag can be left out. In this case,
a slash is added to the end of the start-tag to indicate that this is an
empty element. Element content is anything that the XML specification allows to appear between a start-tag and an end-tag, such as
text, sub-elements, comments and processing instructions. An empty
element may still have attributes, as shown below.
<my_empty_element att_1="1" att_2="2" />
Document Type Definition
2
A Quick Introduction to XML
The Uniform Resource Identifier (URI) in a document type declaration can point to a document known as a document type definition
(DTD). The format for a DTD is defined in the XML Specification
and is not the same as for an XML document. A DTD may contain a
set of rules that specify how the different tags in an XML document
can be used together and the attributes that may belong to each tag.
Most XML processors provide checking of XML documents against
a DTD, allowing applications to quickly and painlessly check that
the structure of an XML document is roughly correct.
DTDs do not allow the specification of constraints on element and
attribute content like the value of the att_1 attribute must be a
number. This kind of validation can be handled by using XML
Schema [http://www.w3.org/XML/Schema], the successor to DTDs
which defines an XML-based file format.
Comment
A document author can place comments in XML documents to add
annotations intended for other humans reading the document. The
contents of a comment are not regarded as part of the document's
data. A comment is started with a less-than sign, exclamation mark,
and two hyphens, and is ended with two-hyphens and a greater-than
sign, as shown below. Comments may not be placed inside start- or
end-tags.
<my_tag> content <!-- comment on content --></my_tag>
XML Namespace
Namespaces in XML [http://www.w3.org/TR/REC-xml-names/] is a
companion specification to the main XML specification. It provides
a facility for associating the elements and/or attributes in all or part
of a document with a particular schema, as indicated by a URI. The
key aspect of the URI is that it is unique. The value of the URI need
not have anything to do with the XML document that uses it, although typically it would be a good location for the XML Schema or
DTD that defines the rules for the document type. The URI may be
mapped to a prefix which may then be used in front of tag and attribute names, separated by a colon. If not mapped to a prefix, the
URI sets the default schema for the current element and all of its
children.
A namespace declaration looks like an attribute on a start tag, but
may be identified by the keyword xmlns. In the following example,
the default namespace is set to the CellML namespace, and the
MathML namespace is declared and mapped to the mathml prefix,
which is then used on a <math> element. Note that the <model>
element and any children elements with no default namespace declaration or namespace prefix (such as the <component> element)
will be in the CellML namespace.
<model
xmlns="http://www.cellml.org/cellml/1.1#"
xmlns:mathml="http://www.w3.org/1998/Math/MathML">
<component> ... </component>
<mathml:math> ... math goes here ... </mathml:math>
</model>