XML for BIOINFORMATICS
LISA MARYANNE BENTO
DEPARTMENT OF COMPUTER SCIENCE
COCHIN UNIVERSITY OF SCIENCE AND TECHNOLOGY
Bioinformatics represents a new field of scientific inquiry, devoted to answering questions about life and using computational resources to answer those questions. A key goal of bioinformatics is to create database systems and software platforms capable of storing and analyzing large sets of biological data. XML is currently used to encode a wide range of biological data and has rapidly become a critical tool in bioinformatics. The real strength of XML is that it enables communities to create XML formats, and then use these common formats to share data. XML therefore enables individual researchers, software applications, and database systems to exchange and share biological data. Bsml (Bioinformatics sequence markup language) is an open XML standard for representing and exchanging bioinformatics sequence data. The Distributed Annotation System (DAS), a case study for Xml for Bioinformatics is an XML based protocol that facilitates the distribution and sharing of genome annotation data.
Bioinformatics represents a new field of scientific inquiry, devoted to answering questions about life and using computational resources to answer those questions. A key goal of bioinformatics is to create database systems and software platforms capable of storing and analyzing large sets of biological data. The exponential growth of biological data sets, and the desire to share data for open scientific exchange, the bioinformatics community is continually exploring new options for data representation, storage, and exchange. In the past few years, many in the bioinformatics community have turned to XML to address the pressing needs associated with biological data. XML, or Extensible Markup Language, is a technical specification originally created for data representation and exchange over the Internet. XML is an open standard, officially specified by the World Wide Web Consortium (W3C), and deliberately designed to be operating system and programming language independent. Since its introduction, XML has also been successfully used to represent a growing set of biological data, including nucleotide sequences, genome annotations, protein-protein interactions, and signal transduction pathways. XML also forms the backbone of biological data exchange, enabling researchers to aggregate data from multiple heterogeneous data sources.
Research, development, or application of computational tools and
approaches for expanding the use of biological, medical, behavioral or health data,
including those to acquire, store, organize, archive, analyze, or visualize such data.
2.1 What is Bioinformatics?
• Interface of biology and computers
• Analysis of proteins, genes and genomes using computer algorithms and computer databases
• Analysis and storage of the billions of DNA base pairs that are sequenced by genomics projects Bioinformatics is the science of using and developing computational tools and algorithms to help solve different biological problems. These problems include similarity searches of unknown DNA
/protein sequences, 3D protein structure prediction and protein function prediction. The extra information obtained from bioinformatic analysis of unknown data can help researchers to design better, or more precise experiments in solving their problems. In bioinformatics, we use existing biological databanks to help analyze any raw data from various experiments. Therefore, biological databases plays a vital role in bioinformatics.
2.2 Sequence Databases
Large numbers of DNA, RNA, and protein sequences have been determined in the past decades. Some institutional sequence databases have been set up to harbor these sequences as well as a wealth of associated data. The rate at which new sequences are being added to these databases is exponential. Computational techniques have been developed to allow fast search on these databases.
Maintained by the National Center for Biotechnology Information (NCBI), USA, GenBank contains hundreds of thousands of DNA sequences. It is divided into several sections with sequences grouped according to species, including:
• PLN: Plant sequences
• PRI: Primate sequences
• ROD: Rodent sequences
• MAM: Other mammalian sequences
• VRT: Other vertebrate sequences
• IVN: Invertebrate sequences
• BCT: Bacterial sequences
• PEG: Phage sequences
• VRL: Other viral sequences
• SYN: Synthetic sequences
• UNA: Unannotated sequences
• PAT: Patent sequences
• NEW: New sequences
Searches can be made by keywords or by sequence. The entries are just plain text. One important field is accession no., the accession number, which is a code that is unique to this entry and can be used for faster access to it. In the example, the accession number is Ml 217 4
This database is part of an international collaboration effort, which also includes the DNA DataBank of Japan (DDBJ) and the European Molecular Biology Laboratory (EMBL). GenBank can be reached through the following locator:
The European Molecular Biology Laboratory is an institution that maintains several sequence repositories, including a DNA database called the Nucleotide Sequence Database. Its organization is similar to that of GenBank, with the entries having roughly the same fields.
EMBL can be reached through the following locator:
The Protein Identification Resource (PIR) is a database of protein sequences cooperatively maintained and distributed by three institutions: the National Biomedical Research Foundation (in the USA), the Martinsried Institute for Protein Sequences (in Europe), and the Japan International Protein Information Database (in Japan). Several web sites provide interfaces to this database, including the following:
The Protein Data Bank is a repository of three-dimensional structures of proteins. For each protein represented, a general information header is provided followed by a list of all atoms present in the structure, with three spatial coordinates for each atom that indicate their position with three decimal places. This repository is maintained in Brookhaven, USA. Access to it can be gained through
XML, or Extensible Markup Language, is a technical specification originally created for data representation and exchange over the Internet. XML is an open standard, officially specified by the World Wide Web Consortium (W3C), and deliberately designed to be operating system and programming language independent. XML is a technology specification that enables you to create highly structured documents. The ML in XML stands for Markup Language. A markup language is any language that takes raw text and adds annotation. XML focuses on document semantics. This means that you can identify specific document parts and assign them specific meaning. For example, if you are representing biological sequence data, you can clearly identify which portion of the document contains sequence identifiers and cross-references to public databases, and which portion contains raw sequence data. These sections are clearly marked and organized in a hierarchical document structure. A human reader or a computer program can therefore easily traverse a complete document and extract individual pieces of data. Using XML, if you want to determine the accession number, simply extract the <accession> element. To determine the organism, extract the <organism> element. XML therefore makes it trivially easy for both humans and computers to identify and extract pieces of data.
3.1 Fundamentals of XML
In its most basic form, an XML document consists of a set of elements. An element represents a discrete unit of data, such as a product listing, news headline, or biological sequence.
An XML element is formally defined with a start tag and a corresponding end tag. Start tags always take the form: <ELEMENT_NAME>, whereas end tags always take the
For example, the Seq-data element is defined with a start <Seq-data> tag and an
end </Seq-data> tag. 11
The complete element therefore looks like this: <Seq-data>gcaggcgcagtgtgagcggcaacatggcgtccaggtc</Seq-data> XML requires that every start tag must have a matching end tag. This is true even for empty XML elements. An empty element is one that does not contain any textual data or subelements, but may contain attributes. For example, the following empty element includes a cross-reference to the EMBL
<cross-reference database="EMBL" id="M29855"x/cross_reference> XML has specific rules on naming XML elements. Specifically, element names must begin with a letter, an underscore character ("_"), or a colon character (":"). Names can then continue with letters, digits, hyphens, underscore, or colons. Names cannot begin with the letters "xml" or any case combination of "xml," as these are specifically reserved for use by the specification.
XML is also case sensitive
Every XML document must contain exactly one root element. This root element
represents the entry point for traversing the entire element hierarchy.
Attributes are used to provide additional information about a specific element. You can specify as many attributes for an element as you need, and they need not be placed in any specific order. Attributes are always placed within the start tag and never within the end tag.
XML requires that attribute values appear within quotes. You can use single quotes
(') , double quotes (") .
For example, the following excerpt specifies an id attribute:
<Sequence id="AY064249"> </Sequence>
XML documents should (but are not actually required to) begin with an XML prolog. The
XM prolog includes an XML declaration and an optional reference to a Document Type
Declaratic (DTD). The XML declaration specifies the XML version number and optional character encoding information. The declaration must begin with the characters: <?xml and end with the
charade ?> .
The following XML prolog specifies XML version 1.0:
<? xml version="l . 0?>
XML comments begin with the characters: <<! -- and end with the characters: --> . Here is
an example comment:
<!-- SARS coronavirus Urbani, complete genome. -->
Comments can span multiple lines, if needed.
Processing instructions are special XML directives, used to forward information to
Processing instructions must begin with the characters <? , and must end with the
Within these tags, a processing instruction consists of two parts:
Â¢ The first part is the software target. This indicates the target of the
directive, usually specifying a specific software application or a specific type of
Â¢ The second part is a list of one or more processing instructions. This can
be any arbitrary text, but usually takes the form of name/value pairs, called pseudo-
Processing instructions are frequently used in XSL Transformations (XSLT). With
XSLT, you can transform an XML document into another XML format or to an HTML
Here is an example XSLT processing instruction:
<?xml-stylesheet type="text/xsl" href="bsml_to_html.xsl"?>
In the line above, we have specified a target value of "xml-stylesheet." We have
also specified two name/value pairs. The first specifies the MIME type of the
transformation document, and the second specifies the name of the specific XSLT template
to use. In this case, we are using the BSML to HTML XSLT style sheet. Character Encoding The XML declaration can include optional information about character encoding. The XML specification requires that all XML parsers support Unicode. XML parsers are required to support two specific encodings of Unicode/ISO-10646: UTF- 16 and UTF-8. UTF-16 encodes Unicode characters using 16-bit characters. For text documents which primarily consist of ASCII characters, UTF-16 can result in inefficient storage and unnecessarily large documents. For these documents, it is more efficient to use the UTF-8 encoding schema. UTF-8 uses a few tricks to more compactly store Unicode characters. <?xml version="l.0" encoding="UTF-8"?>
Besides UTF-8 and UTF-16, you can also specify other character encodings, such as one of the ISO-8859 family of character encodings. This includes Latin 1 (ISO 8859-1), which contains characters for English and most Western European languages; Latin 2 (ISO 8859-
2), which contains character for most Eastern European languages; or Cyrillic (ISO 8859-
5), which contains characters for Russian and Russian-influenced languages, such as
Bulgarian and Macedonian.
Occasionally, you may want to escape an entire section of text. Text that is stored within a CDATA section is preserved exactly as it is. Reserved characters, such as the less than sign (< ), which would normally be interpreted as markup characters, are no longer interpreted as such. This can be useful if you want to include sample XML or HTML markup examples within your XML document.
CDATA sections must begin with the characters <! [ CDATA [ , and must end with
the characters ] ] > .
Here is a sample XML document with a CDATA section:
Start tags always take the form: <ELEMENT_NAME>.
Without the CDATA section, this XML document would result in an error;Specifically, an XML parser would complain that the <ELEMENT_NAME> element was missing a corresponding end tag. However, because this text is actually contained within a CDATA section, the parser knows to ignore all the markup characters and preserve the text as it is. Creating Well-Formed XML Documents To be well-formed, an XML document must meet the following requirements: Â¢ Every start tag must have a corresponding end tag. The only exception to this rule is the empty element tag syntax, e.g., <ELEMENT_NAME/> . Â¢ Elements must be properly nested. In other words, a subelement must have its start and end tags defined within the scope of the parent element. For example, this example is nested properly and therefore well-formed:
Â¢ All attribute values must appear within quotes.
Â¢ Every XML document must have exactly one root element.
Â¢ Reserved characters, such as the less than sign, are always treated as markup. If they appear on their own, they must be specified with character escape sequences, or placed within a CDATA section.
There are lots of XML editors and command line tools, which can test your XML documents for well-formedness.Web browsers, such as Internet Explorer 6.0 or Mozilla 1.4 or later, both of these now include lots of built-in XML features and a built-in XML parser. When these browsers load an XML document, the internal XML parser will automatically check for well-formed ness and report any errors. Creating Valid XML Documents An XML grammar defines rules for creating XML documents. For example, the BSML specification is actually a grammar, and this grammar spells out specific rules for creating BSML documents. We can therefore build applications that are specifically designed to consume specific types of XML documents. Furthermore, if multiple documents adhere to the same grammar, we can process all these documents using the same software application. We can even create new software applications, which aggregate data from multiple disparate sources.
There are two main types of XML grammars: Document Type Definitions (DTDs) and XML Schemas.
A document that adheres to all the critical rules in XML is said to be well-formed. A document that adheres to all the rules of a specific grammar is said to be valid. Consider this sample document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Bsml PUBLIC "-//Labbook, Inc. BSML DTD//EN"
<Seguence id="AY064249" length="1245" molecule="rna">
The end </Seg-data> tag is missing. The document is therefore not well-formed.
To fix this problem, we simply add the end tag, like this:
Now, consider this sample document:
<?xml version="l.0" encoding="UTF-8"?>
<!DOCTYPE Bsml PUBLIC "-//Labbook, Inc. BSML DTD//EN"
"ht tp://www.labbook.com/dtd/bsm!3 _1.dtd">
<Dna id="AY064249" length="1245">
All the start tags have matching end tags, everything is nested properly, and all attribute values appear in quotes. It is therefore well-formed. However, you can now see that the Sequences element contains a Dna element. BSML grammar does not actually specify a Dna element. Therefore, this document does not follow all the rules of the BSML grammar and is considered invalid.
An XML parser (or XML processor) is responsible for parsing an XML document and making its contents available to a calling application. Specific responsibilities include: retrieving XML documents from a local file system or from a network connection, checking to make sure that the document is well-formed, and making the contents of the document available via a standard Application Programming InterFace (API). A typical XML application consists of three distinct layers. Working from right to left, the first layer is an XML document or a set of XML documents. These documents contain useful information, which you want to extract; for example, the documents may contain useful BSML data that you want to analyze further. The second layer is the XML parser. The parser consumes XML documents and makes the content available to the third layer, which is your software application. The XML parser takes care of all XML specific details and enables your application to more easily focus on content and programming logic.
XML parsers are broadly divided into two types:
Fig. 3.1 A typical XML application consists of three distinct layers 17
Â¢ validating parser: this parser is capable of validating a document against an XML grammar, such as a DTD or an XML Schema.
Â¢ nonvalidating parser: this parser is not capable of validating a document against an
As a general rule of thumb, nonvalidating parsers tend to be faster and take up less memory. However, validating parsers tend to be more useful, as you can use them to validate documents, and you don't need to include any validation code within your software application.
Fundamentals of XML Namespaces
XML Namespaces were not defined in the original W3C XML 1.0 specification. However, the namespace specification was finalized soon after, and namespaces are now considered a crucial element in the XML family of protocols. They are also a critical building block for other XML specifications, including XML Schemas, XSL Transformations, SOAP, and the Web Service Description Language (WSDL).
XML Namespaces are designed to address two very specific issues.
First, namespaces prevent name conflicts. If your XML document references a single DTD or XML Schema, this is never an issue. However, if your document references two or more XML grammars, you have the potential for name conflicts. For example, two DTDs might define a Sequence element. XML Namespaces lets you attach a namespace to each Sequence element, and therefore uniquely identify each element. Second, namespaces enable you to mark certain elements for processing by a specific software application. From the software module perspective, an XML document consists of actionable elements and nonactionable elements. By filtering for elements from a specific namespace, the software module can determine which elements are actionable and take the appropriate action.
In practice, name conflicts do not actually occur that often. For example, if your document references two grammars, the chance that they both define the same element is small. This is not to minimize name conflicts. It is just to point out that the second scenario of the software module perspective is more common. Hence, let's dig a little deeper into this scenario.
First, consider the following XML document:
<?xral version="1.0" encoding="UTF-8"?> <stylesheet version="1.0"> <template
match="/"> <html> 18
<hl>BSML Sequence Data:</hl>
<value-of select="Bsml/Definitions/Sequences/Sequence"/> </body> </html> </template>
This is an example XSLT document. The document consists of two sets of elements. The first set consists of XSLT specific instructions. For example, stylesheet , template , and value-of are all XSLT instructions. The second set consists of HTML elements. For example, html, body , and hi are all HTML elements. An XSLT application will consume this document and apply the XSLT transformations. In this specific example, the style sheet is responsible for transforming BSML documents into HTML.
Now, consider this XML document:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/ XSL/Transform">
<xsl:template match="/"> <html>
<hl>BSML Sequence Data:</hl>
Sequence"/> </body> </html>
This document now contains an XML namespace declaration for XSLT. Furthermore, all XSLT elements now have an xsl prefix. For example, the template element is now defined as xsl: template . From the software module perspective, it is now a trivial task to determine which elements are XSLT instructions and which are not. It can therefore more easily carry out the XSL transformation. To use an XML namespace, you must first declare it. XML namespace declarations can occur within any XML element, but in practice, most developers place them at the top of their document usually within the root XML element. A namespace declaration is scoped to the element wherein the declaration occurs and all its subelements. An XML declaration is a special XML attribute consisting of three parts. The first part is the reserved prefix xmlns.
The second part is a namespace prefix of your choosing, and the third part is a Uniform Resource Identifier (URI). For example, the following element declares a
namespace for XSLT:
The namespace prefix serves as a shortcut to the namespace declaration. You can use whatever namespace prefix you like. However, there are a few common conventions. For instance, the XSLT prefix is usually specified as xsl and the XML Schema prefix is usually specified as xs or xsd. The URI value serves as a unique identifier and enables you or a software module to unambiguously partition elements into discrete namespaces. Values are most often represented as absolute URLs, e.g., http://www.w3.org/1999/XSL/Transform. If you are creating your own namespace, you should have control over the referenced host or URL. Having declared a namespace, you later reference the namespace via a Qualified Name. A Qualified Name consists of two parts: a namespace prefix and a local element name. The two parts are delimited with a colon character. For example, the following start tag now includes a Qualified Name:
In plain English, this start tag now references the template element in the xsl namespace.
We already know that every start tag must have a matching tag. In this case, the end tag
must also include a Qualified Name:
The complete XSLT example above should now make a lot more sense. It is repeated
<?xml version="l.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http:7/www.wS.org/1999/XSL/ Transform">
<xsl:template match="/"> <html> <body>
<hl>BSML Sequence Data:</hl>
<xsl:value-of select="Bsml/Definitions/Seguences/Sequence" /> </body> </html>
</xsl:stylesheet> 3.2 XML and XML is currently used to encode a wide range of biological data and has rapidly become a critical tool in bioinformatics. In fact, one recent paper on XML in bioinformatics predicted that XML will soon become "ubiquitous in the bioinformatics community". The real strength of XML is that it enables communities to create XML formats, and then use these common formats to share data. XML therefore enables individual researchers, software applications, and database systems to exchange and share biological data. In the end, this enables biologists to more easily access relevant data, aggregate data from multiple sources, and mine this data for important scientific clues. At the 2002 O'Reilly Open Bioinformatics Conference, Lincoln Stein of the Cold Spring Harbor Laboratory delivered a keynote speech, entitled "Creating a Nation." Stein's presentation and subsequent paper in Nature  describe a vision and a blueprint for creating common data formats, supporting open source software projects, and building interoperable web services for exchanging biological data. By historical analogy, Stein likened the current state of bioinformatics to the city states of Medieval Italy:During the Middle Ages and early Renaissance, Italy was fragmented into dozens of rival city- states controlled by such legendary families as the Estes, Viscontis and Medicis. Though picturesque, this political fragmentation was ultimately damaging to science and commerce because of the lack of standardization in everything from weights and measures to the tax code to the currency to the very dialects people spoke. In the same vein, Stein argued that bioinformatics is currently dominated by rival groups, rival data formats, and incompatible web sites, and that the lack of clear standards and interoperable software is a "significant hindrance to researchers wishing to exploit the wealth of genome data to its fullest. According to one bioinformatics expert quoted in the Nature feature, "To answer most interesting biological problems, you need to combine data from many data sources. However, creating seamless access to multiple data sources is extremely difficult" . Echoing Stein's sentiments exactly, the researcher concludes that "The key to bioinformatics is integration, integration, integration". Academic research labs are not the only ones interested in creating interoperable bioinformatics software. The Interoperable Informatics Infrastructure Consortium (I3C) is a consortium of computer companies, biotech companies, and academic research labs devoted to supporting interoperable data and software "to accelerate discovery and solve critical problems in drug development. Some XML formats currently in use in bioinformatics. AGAVE: Architecture for Genomic Annotation, Visualization and Exchange BioML: BlOpolymer Markup Language BioPAX: Biological Pathways Exchange BSML: Bioinformatic Sequence Markup Languag Fig 3.2 The Key to bioinformatics is integration of biological data
BSML is an open XML standard used to represent biological sequences and sequence annotation data. BSML is an open standard for representing and exchanging biological sequence data. This data can include raw sequence data, sequence features, literature references, networks of biological entities, multiple sets of tabular data, display widgets, used to store visual representations of sequences and resource information. BSML was originally created by Visual Genomics and was first funded in 1997 by the National Human Genome Research Institute (NHGRI). The goal of the initial NHGRI grant was to develop a standard for representing sequence data XML. Joseph H Spitzner, Ph.D. was the primary author of BSML at Visual Genomics. BSML is currently available as a Document Type Definition (DTD), but the data model is at least partially based on preexisting data formats, including the GenBank ASN.l file format. A number of other organizations have also released BSML conversion programs. For example, the Cold Spring Harbor Laboratory has released a utility for converting GenBank ASN.l sequence data to BSML. The EBI has also released a utility for converting European Molecular Biology Laboratory (EMBL) documents to BSML.
4.1 BSML Document Structure
The first property is that every BSML document must begin with an XML prolog and must include a reference to the BSML DTD. Every BSML document will therefore begin like this:
<!DOCTYPE Bsml PUBLIC "-//Labbook, Inc. BSML DTD//EN"
Note that we are referencing the BSML 3.1 DTD, available on the Rescentris.com web
Second, every BSML document must begin with a root Bsml element. Following the root element, BSML is divided into three main sections.
Â¢ Definitions: this section stores biological sequences and sequence annotations. The section can also include tables of associated data and network graphs.
Â¢ Research: this section stores information about experimental research, such as experimental conditions, program queries, or search parameters.
Â¢ Display: this section stores display widgets and references external image files. This section is primarily used by software applications that are capable of visually rendering sequence data
4.2 Representing Sequences
Sequence elements are used to store raw sequence data and can also be used to store annotations about the raw sequence. The first important detail to note is that all sequence data must appear with the BSML Definitions section. This section contains a Sequences element, which can contain any number of Sequence-import or Sequence elements. Sequence-import elements are used to reference sequence data stored within other BSML files. For example, consider the following document fragment:
<Sequence-import source="sars_sequencel.bsml" id="AY278741"/>
This document references sequence id=AY27841 in the sars_sequencel.bsml file. In contrast to the Sequence-import element, the Sequence element is used to define sequence data within a BSML document. The actual raw sequence data can be stored within the BSML document itself or within an external text or binary file. To represent raw sequence data, use the Seq-data element. To import data from an external text or binary file, use the Seq-data-import element. When importing data, you must specify a source attribute specifying the location of the file, and a format attribute specifying the text/binary format.
For example, consider the following document excerpts:
<Sequence id="AY278741" length="29727">
<Seq-data-import format="lUPACaa" source="sars_sequence.txt" />
Fig. 4.1 Element hierarchy of a BSML document 24
<Sequence id="AY278741" length="29727">
1 atattaggtt tttacctacc caggaaaagc caaccaacct cgatctcttg tagatctgtt
61 ctctaaacga actttaaaat ctgtgtagct gtcgctcggc tgcatgccta gtgcacctac
121 gcagtataaa caataataaa ttttactgtc gttgacaaga aacgagtaac tcgtccctct
[For brevity, sequence is truncated.]
Main attributes of the Sequence element
4.3 Representing Sequence Features
A sequence feature is any piece of annotation that provides additional details regarding a specific location or range of sequence data. In BSML, each sequence can contain any number of features. Features are formally nested within a Feature-tables element and individual features are defined within a Feature element. Two types of features are supported: positional and nonpositional. Positional features are tied to specific sequence locations and can be used to represent a host of sequence annotations, including protean-coding regions, locations of predicted genes, single nucleotide polymorphisms (SNPs), etc. Nonpositional features are hot tied to any specific region of sequence, but are instead associated with the sequence record as a whole. For example, you can attach literature references that are associated with the entire sequence record.
Nonpositional features are slightly less complex than positional features. This new example adds a single nonpositional feature detailing the direct submission to GenBank. More specifically, it lists the primary contributors of the work and their affiliation with the Centers for Disease Control and Prevention. The Reference element contains a list of authors, a title, and the complete journal reference. For references to published material, you can include cross-reference identifiers to MEDLINE and PubMed.
The record now includes a single nonpositional feature, describing the direct
submission to GenBank.
<? xml version=" 1. 0"?>
<! DOCTYPE Bsml PUBLlC "-//Labbook, Inc . BSML DTD//EN"
<Sequence id="AY278741" title="AY278741" molecule="rna" length="29727" db- 25
<Attribute name="definition" content="SARS coronavirus Urbani, complete genome."/>
<Attribute name="submission-date" content="21-APR-2003"/>
<Attribute name|="version" content="AY278741.1 GI:30027617"/>
<Attribute name="source" content="SARS coronavirus Urbani"/>
<Reference id="REFl" title="Direct Submission">
<RefAuthors>Bellini,W.J., Campagnoli,R.P., Icenogle,J.P., Monroe,S.S., Nix,W.A.,
Oberste,M.S., Pallansch,M.A. and Rota,P.A. </RefAuthors>
<RefJournal>Submitted (17-APR-2003) Division of Viral and Rickettsial Diseases,
Centers for Disease Control and Prevention, 1600 Clifton RD, NE, Atlanta, GA 30333,
</Reference> </Feature-table> </Feature-tables>
[For brevity, sequence is truncated.]
</Seq-data> </Sequence> </Sequences> </Definitions> </Bsml>
DTDs for Bioinformatics
Document Type Definitions (DTDs) describe XML document structures. Document Type Definitions (DTDs) contain specific rules, which constrain the content of XML documents. These rules are extremely specificâ€for example, a PROTEIN element must contain an ORGANISM element, and the ORGANISM element must include a taxonomy-id attribute. Within the world of XML, a set of these constraint rules is formally known as a grammar. XML grammars specify rules for constructing documents. Grammars can be written in DTDs or in XML Schemas . DTDs (and grammars in general) are important for several reasons. First, DTDs define specific contraints on XML documents and provide an easy method to verify that all the constraints are followed. A document which purports to follow a specific DTD is formally known as an XML instance document. An instance document which follows all the grammar rules is said to be valid. the validity checking can be performed by a validating XML parser, freeing you from the task of writing your own validation code.
The second reason DTDs are important is that they can easily be shared within communities or industries. With a common grammar, people, research labs, and ware applications can more easily exchange and distribute data. BSML is one of the DTD developed for bioinformatics.
Sample DTD for representing protein data: protein.dtd
<!-- Sample DTD for representing protein data-->
<!-- A PROTEIN-SET can have one or more PROTEIN elements -->
<! ELEMENT PROTEIN_SET (PROTEIN+)>
<!-- Main PROTEIN Element -->
GENE_NAME+,ORGANISM, COMMENT*, KEYWORD*)>
<!-- Sub Elements containing PCDATA -->
<!ELEMENT ACCESSION (#PCDATA)> 27
<!ELEMENT ENTRY_NAME (#PCDATA)>
<!ELEMENT PROTEIN-NAME (#PCDATA)>
<!ELEMENT GENE_NAME (#PCDATA)>
<!ELEMENT COMMENT (#PCDATA)>
<!ELEMENT KEYWORD (#PCDATA)>
<!-- ORGANISM for referencing NCBI Taxonomy ID -->
<!ELEMENT ORGANISM (#PCDATA)>
taxonomyid NMTOKEN ^REQUIRED>
Elements are declared with the prefix < ELEMENT. For example, the third line declares the PROTEIN_SET element. Each element is declared with a specific content modelâ€this defines a set of valid content that can exist within the specified element. For example, the PROTEIN -SET element can only contain PROTEIN elements.
Â¢ Attributes are declared with the prefix <!ATTLIST. For example, the final three lines specify that all ORGANISM elements must include a taxonomyid attribute.
Â¢ A number of elements are defined to contain #PCDATA. .
A sample instance document that adheres to the protein DTD
<?xml version=" 1 . 0" encoding="UTF-8" ?>
<!DOCTYPE PROTEIN-SET SYSTEM "protein. dtd" >
<PROTEIN_NAME>Interleukin-3 receptor class II beta chain
< / PROTEIN_NAME>
<ORGANISM taxonomy- id=" 10090" >Mus musculus</ORGANISM>
<COMMENT>FUNCTION: IN MOUSE THERE ARE TWO CLASSES
OF HIGH-AFFINITY IL-3 RECEPTORS. ONE CONTAINS THIS IL- 28
3 -SPECIFIC BETA CHAIN AND THE OTHER CONTAINS THE BETA
CHAIN ALSO SHARED BY HIGH-AFFINITY IL-5 AND GM-CSF
<COMMENT>SUBUNIT: Heterodimer of an alpha and a beta chain .
<COMMENT>SUBCELLULAR LOCATION: Type I membrane protein.
<COMMENT> SIMILARITY: BELONGS TO THE CYTOKINE FAMILY
OF RECEPTORS .
</ PROTEIN- SET>
XML instance documents reference DTDs via a Document Type Declaration. The declaration is part of the XML prolog and must be specified before the root XML element. Instance documents can include internal DTDs, reference external DTDs, or both. For example, the following document includes an internal DTD:
<?xml version="l.0" encoding="UTF-8"?>
<!DOCTYPE dna [<!ELEMENT dna (# PCDATA)>]>
The <!DOCTYPE prefix indicates the Document Type Declaration, and "dna" specifies the name of the root element. All the actual DTD rules are specified between the opening and closing square brackets. Following the Document Type Declaration, the document continues with instance dataâ€ in this case, beginning with the root dna element. By definition, internal DTDs are tied to specific instance documents and cannot be shared among multiple documents. Internal DTDs are therefore most useful during the initial stages of DTD development, where you may want to keep DTD rules together with a sample instance document.
. For example, you can separate the DTD above into a separate file, named dna.dtd:
<!-- External DTD -->
<! ELEMENT dna (#PCDATA)>
You can then create an external reference within your instance document:
<?xml version="l.0" encoding="UTF-8"?> <!DOCTYPE dna SYSTEM "dna.dtd">
The Document Type Declaration now includes a SYSTEM keyword, followed by a
Uniform Resource Identifier (URI). In this case, we specify a relative file location, but you
could just as easily specify an absolute URL to a specific web location.
The SYSTEM keyword generally designates DTDs that are used locally within a
specific application or organization. To reference a DTD that is publicly available, use the
PUBLIC keyword. For example, the following document references the NCBI TinySeq
<!DOCTYPE TSeq PUBLIC "-//NCBI//NCBI TSeq/EN"
<!-- Content Continues... --> </TSeq>
When referencing a public DTD, you must specify the PUBLIC keyword, followed by a public identifier, followed by a URL Although not required by the XML specification, most public identifiers are specified as Formal Public Identifiers (FPIs), as defined by the International Organization for Standardization (ISO). FPIs have a peculiar syntax, where the / and // characters serve as token delimiters. The first token is specified with a + or -. + indicates that the organization is formally registered with ISO (as the example above shows, NCBI is not registered with ISO). The second token indicates the owner of the DTD. The third token indicates the name of the DTD, and the fourth token indicates the natural language of the DTDâ€in this case, English.
In theory, an XML parser could extract the public identifier and look up the DTD in a DTD catalog. However, the XML specification requires that public identifiers must also include a URL. In practice; therefore, most XML parsers simply ignore the public identifier and retrieve the DTD from the URL.
Every time you declare a new element, it can be defined with one of five options:
• EMPTY: the element cannot contain any text or any child elements.
• ANY: the element can contain any text or any child element(s).
• #PCDATA: the element can contain regular text.
• Child Elements: the element can contain a specific set of child elements.
• MIXED: the element can contain text data interspersed with child elements. EMPTY
The EMPTY keyword is used to indicate that the declared element cannot contain any text
or any child elements.
<!ELEMENT DB_REFERENCE EMPTY>
The ANY keyword is used to indicate that the declared element can contain any text or any
defined child element.:
<! ELEMENT BODY ANY>
<! ELEMENT HI (#PCDATA)>
<! ELEMENT H2 (#PCDATA)>
<! ELEMENT B (#PCDATA)>
<! ELEMENT I (#PCDATA)>
The BODY tag uses the ANY keyword and can therefore contain any of the defined
elements, including HI, H2, B, and I.
The #PCDATA keyword is used to indicate that the declared element can contain text. PCDATA stands for Parsed Character Data. The XML parser will analyze PCDATA text and replace all entities with their defined text substitution strings . For example, the declaration below defines a SEQUENCE element:
<!ELEMENT SEQUENCE (#PCDATA)>
Below is a sample document fragment:
Elements can be declared to contain other elements, thereby creating hierarchical content
For example, consider the following declaratipn:
<!ELEMENT PROTEIN (ACCESSION, NAME, DESCRIPTION)>
This defines a protein element, which must contain three subelements: an accession
number, name, and description. Each subelement is separated by a comma, and instance
documents must follow the exact same order of elements.
When declaring child elements, each element can be appended with an occurrence
operator. This operator determines the number of times the element may appear and is
based on regular expression syntax.
No Operator Indicates that exactly one instance of the element is required
Indicates that zero or one instance of the element may appear
Indicates that zero or more instances of the element may appear
Indicates that one or more instances of the element may appear
<!-- Sample DTD for representing protein data-->
<!-- A PROTEIN-SET can have one or more PROTEIN elements -->
<!ELEMENT PROTEIN-SET (PROTEIN+)>
The final option for element declaration is mixed content. This indicates that an element can contain text data interspersed with specific child elements. Mixed content declarations require a special syntax, defined as follows:
<!ELEMENT ELEMENT-NAME (#PCDATA | CHILDl | CHILD2, etc.)* >
When using mixed content, you are not permitted to determine the sequence of child
elements or to specify any occurrence operators.
XML 1.0 defines a total of 10 different attribute types.
It indicates that the attribute value may be set to any arbitrary text string. For example, the
following rule defines a CDATA species attribute for a PROTEIN element:
<!ATTLIST PROTEIN species CDATA #REQUIRED>
The following PROTEIN element is therefore valid:
<PROTEIN species="Homo Sapiens"/>
Attributes can be restricted to a specific set of values by using the enumeration construct. Valid values must be placed within parentheses and separated by a vertical bar. For example, the following defines an enumerated source attribute, which is restricted to a list of four possible values:
<!ATTLIST SEQUENCE source (WormBase | FlyBase |
Ensembl | UCSC)
The following SEQUENCE element is therefore valid:
The third attribute type is ID, which requires that the attribute value contain a unique identifier. By using an ID attribute, each XML element can be assigned a unique identifier. You can then later reference those elements with IDREF attributes. This enables you to create a web of internal links within a single document. Consider a simple DTD for defining protein-protein interactions. The DTD consists of PROTEIN elements and INTERACTION elements. Each protein is assigned a unique identifier and each interaction contains INTERACTOR elements, which reference those identifiers. Here is the complete DTD:
<!-- Sample DTD for representing protein-protein interactions --> <!ELEMENT
SUBMISSION (PROTEIN+, INTERACTION+) >
<!-- Proteins must have a unique ID, and a text description -->
<!ELEMENT PROTEIN EMPTY>
<!ATTLIST PROTEIN id ID #REQUIRED>
<!ATTLIST PROTEIN description CDATA #REQUIRED>
<!-- Interactions use IDREF attributes to reference proteins -->
<!ELEMENT INTERACTION (INTERACTOR+)>
<!ELEMENT INTERACTOR EMPTY>
<!ATTLIST INTERACTOR reference IDREF #REQUIRED>.
The attribute is optional.
The attribute is required and must be specified.
The attribute has a defined default value. If no value is specified, the default value is automatically used. For example, the following attribute declaration specifies Homo Sapiens as the default species:
<!ATTLIST PROTEIN species CDATA "Homo Sapiens">
The attribute is hard coded to a specific value.
XML Schemas for Bioinformatics
XML Schema represents the successor to Document Type Definitions (DTDs), offering more features, flexibility, and complexity. The XML Schema specification is an official recommendation of the World Wide Web Consortium (W3C). The W3C specifically created XML Schemas to address several deficiencies with XML 1.0 Document Type Definitions (DTDs).
• XML Schema provides built-in data types, such as integers, floats, and dates.
• XML Schemas use regular XML syntax..
• Schemas support several object-oriented practices, which are simply not provided by DTDs.
• Schemas also support basic inheritance concepts.
• Schemas provide several additional validation rules.
• Schemas provide full support for XML Namespaces.
6.1 Essential Concepts
The <schema> element
An XML Schema document begins with an XML prolog and a root XML element. For
XML Schemas, the root element is the schema element.
In the line above, the schema element defines a namespace prefix "xs," which references the namespace for XML Schemas. The prefix could be named anything, but most people and applications use "xs" or "xsd." By declaring a namespace for XML Schemas, you can easily reference schema specific elements, such as xs:annotation and xs:complexType. These elements are referenced via qualified names. A qualified name consists of a namespace prefix, followed by a colon, and a local name. For example, the start tag: <xs: annotation> references the annotation element in the XML Schema namespace.
Creates an annotation element and is used to document the schema:
Sample XML Schema for representing Protein data.
Simple Types vs. Complex Types
A simple type contains a single value, such as a string, integer, or date value. A complex
type can contain more than one value, usually in the form of attribute values or element
This element contains a single string value, and is therefore considered a simple
<organism taxonomy_id="10090">Mus musculus</organism>
This element contains more than one value; specifically, it contains a string value,
e.g., "Mus musculus," and an attribute named taxonomy.id. The organism element is
therefore considered a complex type.
Global Elements vs. Local Elements
A global element is any element which is a direct child of the root schema element. For example, protein_set, protein, organism, and cross-reference are all direct children of the schema element, and are therefore considered global in scope. By contrast, a local element is any element which is scoped within another schema construct and is not a direct child of the schema element. For example, the protein element is global in scope, but it defines a number of local elements, such as entry_name, protein_name, and gene_name. Global elements can be referenced and reused multiple times throughout your schema. To do so, you simply specify the ref attribute and specify the name of the global element. By their very nature, local elements cannot be reused. In fact, local elements are scoped to their direct parent, and cannot be referenced outside this scope.
Creating Instance Documents
Instance documents are not required to specifically reference a grammar document, but most do so in practice.References to external XML Schemas are always specified within the root schema element. You have two main options. The first pertains to schemas without a declared namespace. In this case, you use the noNamespaceSchemaLocation attribute. The second option pertains to schemas with a declared namespace. In this case, you use the alternative schemaLocation attribute. We explore the first option .
A sample instance document, which adheres to the protein schema,
The root element includes an "xsi" namespace declaration that references the XML Schema instance specification. Again, the prefix "xsi " is just a convention and you are free to use whatever prefix you like. By declaring the instance namespace, your document is free to reference schema instance constructs. For example, your document can now reference either the noNamespaceSchemaLocation attribute or the schemaLocation attribute.Our protein schema has no declared namespace, and therefore the example uses the noNamespaceSchemaLocation attribute. The value of this attribute specifies the location of the associated schema. For example, specifies the protein.xsd file. This specific file must be located in the same directory as the instance document, but you could just as easily specify an absolute or relative URL and point to any location on the Internet.
6.2 Simple Types
Built-in Schema Types
XML Schemas specification includes support for 44 different data types, including
integers, floats, doubles, and dates.
<xs:element name="protein_name" type="xs:string"/>
This declares a protein_name element with a string data type.
Beyond the 44 primitive data types, XML Schema provides a built-in mechanism for
creating new types.
XML Schema provides two primary mechanisms by which you can derive new types. The first is derivation by extension. This means that the newly derived type has the same properties of the base type, plus a few additionally specified properties. The second is derivation by restriction. This means that the newly derived type has the same properties of the base type, but that additional restrictions are placed on the newly derived type. Facets, which enables you to derive new types via restriction and to place specific restrictions on data values. There are a total of 12 facets, including length, minLength, maxLength, pattern, and enumeration. To use the schema facets, you must create a new data type. This new type must be based on an existing data type in the type hierarchy.
The Pattern Facet
The pattern facet restricts string values to those that match a regular expression pattern.
Example Illustrates how to declare and reference a named simple type
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="
<xs:element ref="protein" minOccurs="1"
</xs -. element>
<xs-. element name=" accession" type="accessionType"
minOccurs="1" maxOccurs=" 1" />
<xs -.minLength value=" 4" />
<xs -.maxLength value=" 8" />
We are declaring a new data type, named accessionType. This type will contain string data, but the string data must be between four and eight characters in length.The first line indicates that we are declaring a new simpleType. This is a simple type because it will contain a single value, and will not contain any attributes or child elements. The second line indicates that the accessionType will be derived from the built-in string data type, and that we will be deriving via restriction. The third and fourth lines specify the minLength and maxLength facets.
Parsing XML in Perl
Pearl remains the programming language choice in bioinformatics. Pearl has excellent support for processing and manipulating text, finding regular expression patterns, retrieving files via the Internet, and connecting to a wide variety of relational datdbases. This makes it an ideal language for parsing flat text files, such as GenBank Flat File records, integrating biological data from multiple sources and performing sequence analysis. Building on the strengths, the bioinformatics community has developed BioPerl, a very successful open source module that includes numerous features, including the ability to retrieve biological data from remote data sources, run BLAST searches, and manipulate sequence data. Perl also has excellent support for XML, and is supported by a wide variety of third party open source XML modules. XML parsers are divided into; tree based and event based parsers. A tree based interface like DOM will parse an XML document and build an in memory tree of all its XML elements. An event parser like SAX, will read the document one line at a time. Each time the parser encounters an important piece of data, it will immediately fire off a start element trigger. To extract the XML data Your application must be registered to receive parsing events and act upon appropriately. In event based parsing, the XML parser is referred to as the event producer and your application handler is referred to as an application consumer. An event-based parser is always sequential.
The Distributed Annotation
The Distributed Annotation System (DAS) is an XML based protocol that facilitates the distribution and sharing of genome annotation data. Since its introduction, DAS has become one of most widely used protocols for biological data exchange, and is now implemented at numerous laboratories, including UCSC, the Institute for Genomic Research (TIGR), and the European Bioinformatics Institute (EBI). The core of the DAS protocol is built around a small set of XML queries, and a corresponding set of XML Document Type Definitions (DTDs). It therefore provides an exciting window into the use of XML at the cutting edge of bioinformatics.
8.1 Genome Annotation
A crucial step in genomic analysis and interpretation is genome annotation. At a very broad level, genome annotation is simply the process of analyzing regions of raw sequence data and adding notes, observations, and predictions .For example, annotation includes the identification of exons (protein-coding portions of genes) and introns , and the categorization of repeat-coding regions. Genome annotation may also jnclude the linking of sequence data to already cataloged genes, making computerized predictions on the location of novel genes, and identifying sequence similarities among species. In short, annotation attempts to decipher and analyse raw sequence data and connect it to biological function.
Ensembl is an example of genome annotation.
Genome annotation presents numerous technical challenges. First, annotation is highly decentralized. Second, it is not likely that one organization will be able to coordinate and centralize all genomic annotations. In response to these challenges, Lincoln Stein of Cold Spring Harbor. Laboratory, along with Scan Eddy and LaDeana Hillier, both of Washington University at St. Louis, set out to build a distributed protocol for genome annotation.
DAS is formally specified by a client/server protocol and a set of XML Document Type Definitions (DTDs). Client applications connect to DAS servers, send queries, and receive XML encoded data back. For example, a client can request all genomic annotations within a specified region of Human chromosome 11, or request only a subset of those annotations. All DAS servers adhere to the same specification and encode annotation data in the same XML format. Client applications can therefore easily aggregate data from multiple servers. Without DAS, a user would need to manually surf through three different web sites to compare annotation data. With DAS, a user can open a single client application and simply specify three DAS servers. Behind the scenes, the client application connects to each DAS server, aggregates the response data, and presents a unified data view. DAS specifies two types of servers: reference servers and annotation servers. Reference servers hold a "Reference map" to the genome and store the raw genomic sequence data. Annotation servers hold the actual genomic annotation. The two server types are not necessarily mutually exclusive, and a single server can act as both a reference and annotation server. In order to work successfully, DAS requires that the community at large agree on a set genomic reference maps. For exarnple, for the human genome, most DAS servers are using the public genome assembly, available from NCBI. Multiple versions of this assembly exist and new versions are continually published, as new data is generated and finalized. In order to accurately compare data from multiple DAS servers, each of the annotation servers must be using the same assembly, and the same version of the assembly. The WormBase DAS Viewer is a web-based DAS clients and BioJava DAS Client is a stand alone DAS client . Ensembl/Sanger and Wormbase.org are some of the DAS servers.
8.3 DAS Protocol Overview
The DAS protocol is built on a simple pattern of requests and,responses. DAS clients issue requests in the form of Internet URLs and servers issue responses encdoded in XML .Currently, there are only eight different DAS commands , and each command will trigger a different XML response from server.
responses encoded in XML. Transportation is provided by HTTP.
8.4 DAS commands
Transportation for the DAS protocol is handled by HTTTP (HyperText Transfer Protocol), the same protocol used to connect web browsers and web servers. 8.5 DAS Requests Each DAS request is specified as an Internet URL. The DRL is defined by five
components, which appear in the following order:
Â¢ Site-specific component: this is the Internet domain name, followed by a path to the DAS server application
Â¢ /das: a required prefix indicating the beginning of a DAS command
HTTP Response : XML
HTTP Request: URL
Â¢ [Data source]: a data source element (required for most commands) indicating the data source of interest
Â¢ [DAS Command]: the actual DAS command, e.g., entry-points, dna, features, etc.
Â¢ [Command Parameters]: each DAS command can include specific parameters to refine the actual query.