XML (extensible markup language)


XML is a technology specification that enables you to create highly structured documents. The ML in XML stands for Markup Language. A markup language is any language that takes raw text and adds annotations. What is special about XML? It focuses on document semantics. This means that you can identify specific document parts and assign them specific meaning.
For example, if you are representing biological sequence data,you can clearly identify which portion of the document contains sequence identifiers and cross references to public databases, and which portion contains raw sequence data. These sections are clearly marked and organized in a hierarchical document structure. A human reader or a computer program can therefore easily traverse a complete document and extract individual pieces of data.

Here is an example of biological sequence data recorded in HTML and XML.

<html>
<body>
<h1>NM-171533</h1>
Organism: <b>Caenorhabditis elegans</b>
<p>
agcacatgacatgagcagtgccccaaatgatgactgtgagatcgacaaggg
aacaccttctaccgcttcactttttacaacgctgatgctcagtcaaccatcttcttct
acagctgttttacagtgtacatattgtggaagctcgtgcacatcttcccaattgca
aacatgtttattctg</span></span>
<p>
[Full sequence has been omitted for brevity.]
<body>
<html>
                                 HTML Code
 

<xmlversion="1.0" encoding="UTF-8" standalone="yes"?>
<Sequence>
<accession>NM-171533</accession>
<organism>Caenorhabditis elegans</organism>
<sequence_data>
agcacatgacatgagcagtgccccaaatgatgactgtgagatcgacaagggaacaccttctaccgcttcactttttacaacgctgatgctcagtcaaccatcttcttct
acagctgttttacagtgtacatattgtggaagctcgtgcacatcttcccaattgcaaacatgtttattc
[Full sequence has been omitted for brevity.]
<sequence_data>
</Sequence>
                                XML Code

So what's wrong with HTML?
  • One tag sets for all applications
  • † Predefined semantics for each tag
  • † Predefined data structures
  • † No formal validation
  • HTML is well suited to simple applications, but poorly suited to more demanding applications such as Large or complex collections of data,data intended to drive scripts or Java applets and etc.

What Does XML Provide ?
  • Extensibility: Users can define new tags as needed.
  • Structure: Hierarchical data can be modeled at any level of complexity
  • Validation:Data can be checked for structural correctness (DTD, XSD)
  • Media Independence:Same content can be published in multiple media

1.<?xml
Shows the beginning of xml document
?>
Shows the end of the declaration
2. version=“1.0”
Shows xml version information, which states that this xml document follow W3C XML1.0 Standard.
3. encoding=“UTF-8”
Allows to use different encodings, such as UTF-8, UTF-16, GB2312
By default: UTF-8
4.standalone=“yes”
DTD is included in the xml document “no” means external DTD will be referenced here.Default: no

Difference between RDBMS and XML format for storing data.


RDBMS
XML
Structure
Tables
Hierarchical Tree ,graph
Schema
Fix schema in advance
Flexible "self describing"
Queries
SQL (Simple..)
Xpath, Xquery
Ordering
None
has inherent ordering

Lets see an Example : How to model in XML
Students Table



Majors Table


Grades Table


id
name
age

id
major

id
course
grade
111
Michael R.
21

111
Biology

111
Math 101
B
112
John D.
20

112
Physics

111
Biology 101
B+




112
Computer Science

111
Statistics 101
A







112
Physics 101
A







112
Math 101
A







112
Programming 101
B+
The above tables can be represented as given below

<Students>
<student id="111">
<name>Michael R.</name>
<age>21</age>
<major>Biology</major>
<results>
<result course="Math 101" grade="B"/>
<result course="Biology 101" grade="B+"/>
<result course="Statistics 101" grade="A"/>
</results>
</student>
<student id="112">
<name>John D.</name>
<age>20</age>
<major>Physics</major>
<major>Computer Science</major>
<results>
<result course="Math 101" grade="A"/>
<result course="Physics 101" grade="A"/>
<result course="Programming 101" grade="B+"/>
</student>
</Students>
Q/A :
1. You're creating a database to contain information about university records: students, courses, grades, etc. Should you use the relational model or XML?
2. You're creating a database to contain information for a university web site: news, academic announcements, admissions, events, research, etc. Should you use the relational model or XML?

Well Formed XML :
  • First is xml declaration (xml version)
  • Only has one root element, other elements are all the sub-element of root element
  • Tags must be correctly closed.
  • Correctly nested
  • Namespace defined.
  • Attribute has to use single or double quotation marks.
  • Case sensitive (XML tags are case sensitive)
The syntax of XML
XML Namespace: An XML namespace is a collection of XML element and/or attribute names that are guaranteed to be unique.Basic trick is to use DNS((Domain Name Service) to
ensure uniqueness.
<?xml version="1.0" encoding="ISO-8859-15"?>
<html>
<body>
<p>Welcome to I308 Information representation</p>
</body>
<body>
<author>Prof. David Wild</author>
<days>Monday and Wednesday</days>
</body>
</html>
  • Allows you to solve the issue of same naming elements.
  • One namespace correspond to one DTD
  • It is defined in the starting tag of one element.
  • Namespace should be unique it cannot be xml,html,xsl,xmlns
  • element and attribute can have namespace.
  • XML namespace is a special type of reserved XML attribute that you place in an XML tag.The reserve attribute is more like a prefix you attach to any namespace.
  • The attribute prefix is xmlns: .The colon is used to separate the prefix from your namespace.
  • Here is a link about XML namespace

To rectify the above code using namespace it should be something like this
<?xml version="1.0" encoding="ISO-8859-15"?>
<html:html xmlns: html='http://www.w3.org/TR/xhtml1/'>
<html:body>
<html:p>Welcome to I308 Information representation</html:p>
</html:body>
 
<course:body xmlns:course='http://www.example.org/course'>
<course:author>Prof. David Wild</author>
<course:days>Monday and Wednesday</days>
</course:body>
 
</html:html>
Here is a list to study about XML:
W3C Schools
W3C XML
XML.com

There are 5 predefined entity references in XML:


&lt;
<
less than
&gt;
>
greater than
&amp;
&
ampersand
&apos;
'
apostrophe
&quot;
"
quotation mark

Note: Only the characters "<" and "&" are strictly illegal in XML. The greater than character is legal, but it is a good habit to replace it.


XML DTD (Document type descriptors)
  • Document Type Definition (DTD) defines the legal building blocks of an XML document. It defines the document structure with a list of legal elements and attributes.
  • A DTD can be declared inside an XML document or as an external reference.
  • Grammer like language for specifying elements,attributes,nesting,ordering,number of occurences.
  • Special type of attributes ID and IDREFS.
xml.png


Here can you get information about DTD Example of Internal DTD at W3C with XML
Elements of DTD:
<!DOCTYPE root element [ …….]> ---> DTD document declaration
<!ELEMENT name (sub-elements) (type)> ---> Element declaration
<!ATTLIST elementName attributeName type …> ----> Attribute declaration
<!ENTITY entityName “content”> ---> Entity declaration


Here is an example of DTD with XML

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!-- file name: DTDDemo.xml -->
<!-- Sample DTD for representing protein data-->
<!-- A PROTEIN-SET can have one or more PROTEIN elements -->
<!DOCTYPE PROTEIN-SET[
<!ELEMENT PROTEIN-SET (PROTEIN+)>
<!-- Main PROTEIN Element -->
<!ELEMENT PROTEIN (ACCESSION, ENTRY-NAME, PROTEIN-NAME, GENE-NAME+,
ORGANISM, COMMENT*, KEYWORD*)>
<!-- Sub Elements containing PCDATA -->
<!ELEMENT ACCESSION (#PCDATA)>
<!ELEMENT ENTRY-NAME (#PCDATA)>
<!ELEMENT PROTEIN-NAME (#PCDATA)>
<!ELEMENT GENE-NAME (#PCDATA)>
<!ELEMENT COMMENT (#PCDATA)>
<!ELEMENT KEYWORD (#PCDATA)>
<!-- ORGANISM for referencing NCBI Taxonomy ID -->
<!ELEMENT ORGANISM (#PCDATA)>
<!ATTLIST ORGANISM taxonomyid NMTOKEN #REQUIRED>
]>
 
<PROTEIN-SET>
<PROTEIN>
<ACCESSION>P26954</ACCESSION>
<ENTRY-NAME>IL3B-MOUSE</ENTRY-NAME>
<PROTEIN-NAME>Interleukin-3 receptor class II beta chain
[Precursor]
</PROTEIN-NAME>
<GENE-NAME>CSF2RB2</GENE-NAME>
<GENE-NAME>AI2CA</GENE-NAME>
<GENE-NAME>IL3RB2</GENE-NAME>
<GENE-NAME>IL3R</GENE-NAME>
<ORGANISM taxonomy-id="10090">Mus musculus</ORGANISM>
<COMMENT>FUNCTION: IN MOUSE THERE ARE TWO CLASSES OF HIGH-AFFINITY IL-3 RECEPTORS. ONE CONTAINS THIS
IL-3-SPECIFIC BETA CHAIN AND THE OTHER CONTAINS THE BETA CHAIN ALSO SHARED BY HIGH-AFFINITY IL-5 AND GM-CSF
RECEPTORS.</COMMENT>
<COMMENT>SUBUNIT: Heterodimer of an alpha and a beta chain.</COMMENT>
<COMMENT>SUBCELLULAR LOCATION: Type I membrane protein.</COMMENT>
<COMMENT>SIMILARITY: BELONGS TO THE CYTOKINE FAMILY OF RECEPTORS.</COMMENT>
<KEYWORD>Receptor</KEYWORD>
<KEYWORD>Glycoprotein</KEYWORD>
<KEYWORD>Signal</KEYWORD>
</PROTEIN>
</PROTEIN-SET>
 
DTD has several types of declarations like the Elements,Attributes,declaration,Notations.Here is a link describing different type of declarations.
Some regular expressions commonly used in XML DTD.
Sign
Time of occurrence
No sign
Only one (1)
?
Zero or one (0|1)
*
Zero or one or many, any times (0|1|many)
+
One or more, at least one (1|many)



To validate the XML along with the DTD you need a validator you can use Oxygen or xmllint to validate the XML along with DTD. Online resources for validation is at xmlvalidation
and W3C
Here is a picture of XML validation i did with xmllint with the above xml and DTD. I named the file as protein.xml. The validator checks the file and echoes all the xml file on screen.If the xml file has some error in syntax or anything the program throws an error and also mentions which part is causing the problem.You can also use Oxygen which is a GUI based tool and easy to access and lots of other functionalities in it.

valid.png

XML Schema (XSD)
  • Like DTDs,can specify elements,attributes,nesting,ordering,#occurences.
  • Datatypes,Namespace,keys,(typed)pointers and many things

XSD Advantages
XSD Disadvantages
XSD has a much richer language for describing what element or attribute content “looks like.”
This is related to the type system.
Verbose language, hard to read and write.
XSD Schema supports Inheritance, where one schema can inherit from another schema. This is a great feature
because it provides the opportunity for re-usability.
Provides no mechanism for the user to add more data types.

Querying XML
Sequence of development
  • XPath ---> consist of path expressions and conditions
  • XSLT ----> Xpath + transformation ,output formatting
  • XQuery---> Xpath + full featured query language

XPath
  • Not mature as SQL
  • No underlying algebra

Expression
Description
nodename
Selects all child nodes of the named node
/
Selects from the root node
//
Selects nodes in the document from the current node that match the selection no matter where they are
.
Selects the current node
..
Selects the parent of the current node
@
Selects attributes
*
Matches any element node
@*
Matches any attribute node
node()
Matches any node of any kind

Operator
Description
|
Computes two node-sets
+
addition
-
subtraction
*
multiplication
div
division
mod
Modulus (division remainder)
=
equal
!=
Not equal
<
Less than
<=
Less than or equal to
>
Greater than
>=
Greater than or equal to
or
or
and
and
Predicates are always embeded in square brackets .Predicates used to find a specific node or a node that contains a specific value.

Lets look at the Bookstore xml file and think XML as a tree .
Web scraping with Dapper. Let look at this cool tool. Here is another way of using google docs for web scraping.
Let see demo of XPath using the Bookstore version 2. Use the online XPath Query tool to perform your queries or you can download Oxygen available freely at IUware for students.

Some XPath queries examples
Query
Expression
All Book Titles
/Bookstore/Book/Title
All Book and Magazine titles
/Bookstore/(Book|Magazine)/Title
All book ISBNs
/Bookstore/Book/data(@ISBN)
All books costing less than $90
/Bookstore/Book[@Price < 90]
Titles of books costing less than $90 where "Ullman" is
an author.
/Bookstore/Book[@Price < 90 and Authors/Author/Last_Name = "Ullman"]/Title
Titles of books with a remark containing "great"
//Book[contains(Remark, "great")]/Title
Here is a nice link to XPath Tutorial

XQuery

  • Compositional language
  • Every XPath expression is XQuery expression.
  • Extract information to use in web service.

XQuery commonly used expression is the FLOWR expression ( FLWOR is an acronym: FOR,LET, WHERE, ORDER BY, RETURN. FLWOR is loosely analogous to SQL's SELECT-FROM-WHERE and can be used to provide join-like functionality to XML documents).

Note:The difference between For and Let for $x in (1,2,3,4) let $y :=("a","b","c") return($x,$y)

output : 1,a,b,c,2,a,b,c,3,a,b,c,4,a,b,c
for $x in (1,2,3,4)for $y in ("a","b","c") return ($x,$y)
What should be the output ?
Lets perform some XQuery using Bookstore version 2. We Will see some XPaths & XQuery in Lab session.
Query
XQuery
Titles of books costing less than $90 where "Ullman" is an author
for $b in doc("Untitled1.xml")/Bookstore/Book
where $b/@Price < 90
and $b/Authors/Author/Last_Name = "Ullman"
return <Book>
$b/Title
</Book>
Titles and author first names of books whose title contains one
of the author's first names
for $b in doc("Untitled1.xml")/Bookstore/Book
where some $fn in $b/Authors/Author/First_Name
satisfies contains($b/Title, $fn)
return <Book>
{ $b/Title }
{ $b/Authors/Author/First_Name }
</Book>
return only firs appearring in title

for $b in doc("Untitled1.xml")/Bookstore/Book
where some $fn in $b/Authors/Author/First_Name
satisfies contains($b/Title, $fn)
return <Book>
{ $b/Title }
{ for $fn in $b/Authors/Author/First_Name
where contains ($b/Title, $fn) return $fn
}
</Book>