XML and Databases

来源：互联网发布：对外投资数据编辑：程序博客网时间：2024/04/19 23:20

XML and Databases

1.0 Introduction
2.0 Is XML a Database?
3.0 Why Use a Database?
4.0 Data versus Documents
   4.1 Data-Centric Documents
   4.2 Document-Centric Documents
   4.3 Data, Documents, and Databases
5.0 Storing and Retrieving Data
   5.1 Transferring Data
   5.2 Mapping Document Structure to Database Structure
      5.2.1 Template-Driven Mappings
      5.2.2 Model-Driven Mappings
         5.2.2.1 The Table Model
         5.2.2.2 The Data-Specific Object Model
   5.3 Data Types, Null Values, Character Sets, and All That Stuff
      5.3.1 Data Types
      5.3.2 Binary Data
      5.3.3 Null Data
      5.3.4 Character Sets
      5.3.5 Processing Instructions
      5.3.6 Storing Markup
   5.4 Generating DTDs from Database Schema and Vice Versa
6.0 Storing and Retrieving Documents
   6.1 Content Management Systems and Relational Databases
      6.1.1 Out-of-the-Box Use
      6.1.2 Writing Your Own System
7.0 XML Database Products
8.0 Additional Links
9.0 Comments and Feedback

1.0 Introduction

This paper briefly discusses the relationship between XML and databases and describes some of the types of software available to process XML documents with databases. Although it is not intended to be exhaustive, I hope that it describes some of the major issues in using XML with databases. It is somewhat biased towards relational databases simply because that is where my experience is.

2.0 Is XML a Database?

Before we start talking about XML and databases, we need to answer a question that occurs to many people: "Is XML a database?" In the strictest sense -- that is, when "XML" means an XML document -- the answer is no. Although an XML document contains data, without any additional software to help process that data, it is no more a database than any other text file.

In a more liberal sense -- that is, when "XML" means an XML document and all the surrounding XML tools and technologies -- then the answer is yes, but... It is yes because XML provides many of the things found in databases: storage (the XML document), schemas (DTDS, XML schema languages), query languages (XQL, XML-QL, QUILT, etc.), programming interfaces (SAX, DOM), and so on. It is but... because it lacks many of the things found in real databases: efficient storage, indexes, security, transactions and data integrity, multi-user access, triggers, queries across multiple documents, and so on.

So while it may be possible to use XML as a database in environments with not too much data, few users, and modest performance requirements, it will fail in most production environments, which have many users, strict data integrity requirements, and the need for good performance. Furthermore, given the low price and ease of use of databases like dBASE and Access, there seems little reason to use XML as a database even in the first case.

3.0 Why Use a Database?

The first question you need to ask yourself when you start thinking about XML and databases is why you want to use a database in the first place. Do you have legacy data you want to expose? Are you looking for a place to store your Web pages? Is the database used by an e-commerce application in which XML is used as a data transport? The answers to these questions will strongly influence your choice of database and middleware (if any), as well as how you use that database.

For example, suppose you have an e-commerce application that uses XML as a data transport. It is a good bet that your data has a highly regular structure and that things such as the entities and encodings used in XML documents are not important to you -- after all, you are interested only in the data, not how it is physically stored in the document. If your application is relatively simple, a relational database and data transfer middleware might satisfy your needs; if it is large and complex, you might want a complete application development environment that supports XML.

On the other hand, suppose you have a Web site built from a number of prose-oriented XML documents. Not only do you want to manage the site, you would like to provide a way for users to search its contents. Your documents are likely to have a highly irregular structure and things such as entity usage are probably important to you because they are a fundamental part of how your documents are structured. In this case, you might need some sort of "native XML" database that performs versioning, tracks entity usage, and supports a query language like XQL.

4.0 Data versus Documents

Perhaps the most important factor in choosing a database is whether you are using the database to store data or documents. If you want to store data, you will need a database that is tuned for data storage, such as a relational or object-oriented database, as well as middleware to transfer data between the database and XML documents. On the other hand, if you want to store documents, you will need a content management system which is designed specifically to store documents.

Although it is possible to store documents in a relational or object-oriented database yourself, you will usually find yourself duplicating the work of a content management system. Similarly, although a content management system is usually built on top of an object-oriented or hierarchical database, trying to use it as a database will probably prove to be frustrating.

Whether you need to store data or documents can often be answered by simply looking at your XML documents. The reason for this is that XML documents fall into two rough categories: data-centric and document-centric.

4.1 Data-Centric Documents

Data-centric documents are characterized by fairly regular structure, fine-grained data (that is, the smallest independent unit of data is at the level of a PCDATA-only element or an attribute), and little or no mixed content. The order in which sibling elements and PCDATA occurs is often not significant. Good examples are XML documents containing sales orders, flight schedules, restaurant menus, and so on. Data-centric documents are usually designed for machine consumption and the fact that XML is involved may be superfluous -- it is often simply the data transport.

For example, the following sales order document is data-centric:

 <Orders> <SalesOrder SONumber="12345"> <Customer CustNumber="543"> <CustName>ABC Industries</CustName> <Street>123 Main St.</Street> <City>Chicago</City> <State>IL</State> <PostCode>60609</PostCode> </Customer> <OrderDate>981215</OrderDate> <Line LineNumber="1"> <Part PartNumber="123"> <Description> <p><b>Turkey wrench:</b><br /> Stainless steel, one-piece construction, lifetime guarantee.</p> </Description> <Price>9.95</Price> </Part> <Quantity>10</Quantity> </Line> <Line LineNumber="2"> <Part PartNumber="456"> <Description> <p><b>Stuffing separator:<b><br /> Aluminum, one-year guarantee.</p> </Description> <Price>13.27</Price> </Part> <Quantity>5</Quantity> </Line> </SalesOrder> </Orders>

Note that in the XML world, many prose-rich documents are, in fact, data-centric. For example, consider a page on Amazon.com that displays information about a book. Although the page is largely text, the structure of that text is highly regular, much of it is common to all pages describing books, and each piece of page-specific text is limited in size. Thus, the page could be built from a simple, data-centric XML document -- containing page-specific text retrieved from the database -- and an XSL stylesheet that adds the boilerplate text. In general, any Web site that dynamically constructs HTML documents today by filling a template with database data can probably be replaced by data-centric XML documents and one or more XSL stylesheets.

For example, the following lease document:

 <Lease> <Lessee>ABC Industries</Lessee> agrees to lease the property at <Address>123 Main St., Chicago, IL</Address> from <Lessor>XYZ Properties</Lessor> for a term of not less than <LeaseTerm TimeUnit="Months">18</LeaseTerm> at a cost of <Price Currency="USD" TimeUnit="Months">1000</Price>. </Lease>

could be built from the following XML document and a simple stylesheet:

 <Lease> <Lessee>ABC Industries</Lessee> <Address>123 Main St., Chicago, IL</Address> <Lessor>XYZ Properties</Lessor> <LeaseTerm TimeUnit="Months">18</LeaseTerm> <Price Currency="USD" TimeUnit="Months">1000</Price> </Lease>

4.2 Document-Centric Documents

Document-centric documents are characterized by irregular structure, larger grained data (that is, the smallest independent unit of data might be at the level of an element with mixed content or the entire document itself), and lots of mixed content. The order in which sibling elements and PCDATA occurs is almost always significant. Good examples are books, email, advertisements, and almost any XHTML document. Document-centric documents are generally designed for human consumption.

For example, the following product description is document-centric:

 <Product> <Name>Turkey Wrench</Name> <Developer>Full Fabrication Labs, Inc.</Developer> <Summary>Like a monkey wrench, but not as big.</Summary> <Description> <Para>The turkey wrench, which comes in both right- and left-handed versions (skyhook optional), is made of the finest stainless steel. The Readi-grip rubberized handle quickly adapts to your hands, even in the greasiest situations. Adjustment is possible through a variety of custom dials.</Para> <Para>You can:</Para> <List> <Item><Link URL="Order.html">Order your own turkey wrench</Link></Item> <Item><Link URL="Wrenches.htm">Read more about wrenches</Link></Item> <Item><Link URL="catalog.zip">Download the catalog</Link></Item> </List> <Para>The turkey wrench costs just $19.99 and, if you order now, comes with a hand-crafted shrimp hammer as a bonus gift.</Para> </Description>

4.3 Data, Documents, and Databases

In reality, the distinction between data-centric and document-centric documents is not always clear. For example, an otherwise data-centric document, such as an invoice, might contain large-grained, irregularly structured data, such as a part description. And an otherwise document-centric document, such as a user's manual, might contain fine-grained, regularly structured data (often metadata), such as an author's name and a revision date. In spite of this, characterizing your documents as data-centric or document-centric will help you decide whether you are interested in data or documents, which in turn dictates the kind of system you need.

To store and/or retrieve data, you can use a database (usually relational, object-oriented, or hierarchical) and middleware (either built-in or third party), or you can use an XML server (a platform for building distributed applications, such as e-commerce, that use XML for data transfer) or XML-enabled Web server (a Web server that can build XML documents from data in a database). To store documents, you will need a content management system or persistent DOM implementation. The issues in using each type of system are discussed in sections 5.0, "Storing and Retrieving Data", and 6.0, "Storing and Retrieving Documents". You can find a list of available software in XML Database Products.

5.0 Storing and Retrieving Data

Data of the kind that is found in data-centric documents can originate either in the database (in which case you want to expose it as XML) or in an XML document (in which case you want to store it in a database). An example of the former is the vast amount of legacy data stored in relational databases; an example of the latter is data exposed on the Web as XML that you want to store in your database for further processing. Thus, depending on your needs, you may need software that transfers data from an XML document to the database, from the database to an XML document, or both.

5.1 Transferring Data

When storing data in the database, it is often acceptable to discard much of the information about a document, such as its name and DTD, as well as its physical structure, such as entity definition and usage, the order in which attribute values and sibling elements occur, the way in which binary data is stored (Base64 v. unparsed entity v. something else), CDATA sections, and encoding information. Similarly, when retrieving data from the database, the resulting XML document is likely to contain no CDATA or entity usage (other than the predefined entities lt, gt, amp, apos, and quot) and the order in which sibling elements and attributes appear is likely to be the order in which the data was returned by the database.

Although this may seem shocking at first, it is often quite reasonable. For example, consider the case in which XML is used as a data format for transferring a sales order from one database to another. In this case, it really doesn't matter whether the sales order number is stored in the document before or after the sales order date, nor does it matter if the customer's name is stored in a CDATA section, an external entity, or directly as PCDATA. All that is important is that the relevant data is transferred from the first database to the second. Thus, the data transfer software needs to consider hierarchical order (which groups the information about a single sales order) and little else.

One consequence of ignoring information about the document and its physical structure is that "round-tripping" a document -- that is, storing the data from a document in the database and then reconstructing the document from that data -- often results in a different document, even in the canonical sense of the term. Whether this is acceptable depends on your needs and might influence your choice of database and data transfer middleware.

5.2 Mapping Document Structure to Database Structure

In order to transfer data between an XML document and a database, it is necessary to map document structure to database structure and vice versa. Such mappings fall into two general categories: template-driven and model-driven.

5.2.1 Template-Driven Mappings

In a template-driven mapping, there is no predefined mapping between document structure and database structure. Instead, you embed commands in a template that is processed by the data transfer middleware. For example, consider the following template (not used by any real product), in which SELECT statements are embedded in <SelectStmt> elements:

 <?xml version="1.0"?> <FlightInfo> <Intro>The following flights have available seats:</Intro> <SelectStmt>SELECT Airline, FltNumber, Depart, Arrive FROM Flights</SelectStmt> <Conclude>We hope one of these meets your needs</Conclude> </FlightInfo>

When processed by the data-transfer middleware, each SELECT statement might be replaced by its results, formatted as XML:

 <?xml version="1.0"?> <FlightInfo> <Intro>The following flights have available seats:</Intro> <Flights> <Row> <Airline>ACME</Airline> <FltNumber>123</FltNumber> <Depart>Dec 12, 1998 13:43</Depart> <Arrive>Dec 13, 1998 01:21</Arrive> </Row> ... </Flights> <Conclude>We hope one of these meets your needs</Conclude> </FlightInfo>

Template-driven mappings can be tremendously flexible. For example, some products allow you to place result set values wherever you want in the result set -- including using them as parameters in a subsequent SELECT statement -- rather than simply formatting the results themselves, as is shown above. Others support programming constructs such as for loops and if statements. And still others support parameterization of SELECT statements, such as through HTTP parameters.

Currently, template-driven mappings are available only for transferring data from a relational database to an XML document.

5.2.2 Model-Driven Mappings

In a model-driven mapping, a data model of some sort is imposed on the structure of the XML document and this is mapped, either implicitly or explicitly, to the structures in the database and vice versa. What is lost in flexibility is gained in simplicity, since the system, which is based on a concrete model, generally does more work for the user. Because the result of transferring data from the database to an XML document follows a single model, XSL is commonly integrated into model-driven products to provide the flexibility found in template-driven systems.

Two models for viewing the data in an XML document are common: the table model and the data-specific object model. Other models of the data in an XML document are also possible. For example, through the use of ID and IDREF attributes, an XML document can be used to represent a directed graph. However, these models do not seem to be widely supported by the available middleware.

5.2.2.1 The Table Model

The table model is used by many of the middleware packages for transferring data between an XML document and a relational database. It models the XML document as a single table or set of tables. That is, the structure of the XML document must be similar to the following, where the <database> element does not exist in the single-table case:

 <database> <table> <row> <column1>...</column1> <column2>...</column2> ... </row> ... </table> ... </database>

The term "table" is loosely interpreted to mean a single result set (when transferring data from the database to XML) or a single table or updateable view (when transferring data from XML to the database). If data from more than one result set is desired (when transferring data from the database) or the XML document contains more deeply nested elements than those necessary to represent a set of tables (when transferring data to the database), then the transfer is simply not possible.

5.2.2.2 The Data-Specific Object Model

The second common model for data in an XML document is a tree of data-specific objects, in which element types generally correspond to objects and content models, attributes, and PCDATA correspond to properties. This model maps directly to object-oriented and hierarchical databases and can be mapped to relational databases using traditional object-relational mapping techniques or SQL 3 object views. Note that this model is not the Document Object Model (DOM); the DOM models the document itself, not the data in the document, and is used to build content managment systems over relational databases, as is described in section 6.1.2.

For example, the sales order document shown above could be viewed as a tree of objects from five classes -- Orders, SalesOrder, Customer, Line, and Part -- as shown in the following diagram:

 Orders | SalesOrder / | Customer Line Line | | Part Part

When modelling an XML document as a tree of data-specific objects, there is no requirement that elements necessarily correspond to objects. For example, if an element contains only PCDATA, such as the CustName element in the sales order document, it can be reasonably viewed as a property because, like a property, it contains a single, scalar value. Similarly, it is sometimes useful to model elements with mixed or element content as properties. An example of this is the Description element in the sales order document: although it has mixed content in the form of XHTML, it is more useful to view the description as a single property because its component pieces are meaningless by themselves.

5.3 Data Types, Null Values, Character Sets, and All That Stuff

This section discusses a number of issues related to storing data from XML documents in databases. Generally, you will have no choice about how the middleware you choose resolves these issues, but you should be aware that these issues exist, as it might help you in choosing your middleware.

5.3.1 Data Types

XML does not support data types in any meaningful sense of the word. Except for unparsed entities, all data in an XML document is text, even if it represents another data type, such as a date or an integer. Generally, the data transfer middleware will convert data from text (in the XML document) to other types (in the database) and vice versa. However, the number of text formats recognized for a particular data type are likely to be limited, such as to those supported by a given JDBC driver. Dates are most likely to cause problems, as the range of possible formats is enormous. Numbers, with their various international formats, might cause problems as well.

5.3.2 Binary Data

There are two common ways to store binary data in XML documents: unparsed entities and Base64 encoding (a MIME encoding that maps binary data to a subset of US-ASCII). For relational databases, both can prove problematic, as the rules governing how binary data is sent to or retrieved from the database can be quite strict and can cause problems for the middleware. In addition, there is no standard notation for indicating that an element contains Base64-encoded data, so the middleware might not recognize this encoding at all. Finally, the notation associated with an unparsed entity or Base64 encoded element is one of those things that might be discarded by the middleware when storing data in the database. Therefore, if binary data is important to you, be sure to check that your middleware supports it.

5.3.3 Null Data

In the database world, null data means data that simply isn't there. This is very different from a value of 0 (for numbers) or zero length (for a string). For example, suppose you have data collected from a weather station. If the thermometer isn't working, a null value is stored in the database rather than a 0, which would mean something different altogether.

XML also supports the concept of null data through optional element types and attributes. If the value of an optional element type or attribute is null, it simply isn't included in the document. As with databases, empty elements or attributes containing zero length strings are not null: their value is a zero-length string.

When mapping the structure of an XML document to the database and vice versa, you should be careful that optional element types and attributes are mapped to nullable columns and vice versa. The result of not doing so is likely to be an insertion error (when transferring data to the database) or invalid document (when transferring data from the database).

Because the XML community is likely to have a more flexible notion of what is meant by null than the database community -- in particular, many XML users are likely to consider empty elements or attributes containing zero-length strings to be "null" -- you should check how your chosen middleware handles this situation. Some middleware may offer the user the choice of defining what constitutes "null" in an XML document.

5.3.4 Character Sets

By definition, an XML document can contain any Unicode character except some of the control characters. Unfortunately, many databases offer limited or no support for Unicode and require special configuration to handle non-ASCII encodings of character data. If your data contains non-ASCII characters, be sure to check how and if both your database and middleware handle these characters.

5.3.5 Processing Instructions

Processing instructions are not part of the "data" of an XML document and it is likely that most middleware cannot handle them. The problem is that, especially with a strict mapping of the XML document structure to database structure, processing instructions simply do not fit because they can occur virtually anywhere in the document. Thus, it is hard for middleware to know where to store them and when to retrieve them. If processing instructions and "round-tripping" of documents are important to you, be sure to check how your middleware handles this.

5.3.6 Storing Markup

As was mentioned in section 5.2.2, it is sometimes useful to store elements with element or mixed content in the database without further parsing. The most common way to do this is simply to store the markup itself in the database. Unfortunately, this leads to a problem when retrieving the data: it is impossible to determine whether such markup in the database is, in fact, true markup or was originally an entity that represented a markup character, such as can be done with the lt and gt entities.

For example, if the following description:

 <description> <b>Confusing example:</b> &lt;foo/&gt; </description>

is stored in the database as:

 <b>Confusing example:</b> <foo/>

the database cannot tell whether <b> and <foo> are markup or text. There are several possible solutions to this, such as flagging markup in some way or using entities for non-markup markup characters, but you must be careful that these are acceptable to your other applications that use the data. For example, if you want to search for a less-than sign ("<") and an lt entity ("<") is stored in the database, you need to be aware of this.

5.4 Generating DTDs from Database Schema and Vice Versa

A common question when transferring data between XML documents and a database is how to generate XML DTDs from database schema and vice versa. In short, this is a fairly straight-forward operation, but the results are generally less useful than many users would like. (It should also be noted that this is generally a one-time operation, as most applications, and especially all vertical applications, work with a known set of DTDs/relational schema. The obvious exceptions are tools that store random XML documents in relational databases or that expose relational data as XML documents; in the latter case, it is not clear how useful DTDs are.)

For example, the following (simplistic) procedure generates a relational schema from a DTD:

For each element type with element or mixed content, create a table and a primary key column.
For each element type with mixed content, create a separate table in which to store the PCDATA, linked to the parent table through the parent's primary key.
For each single-valued attribute of that element type, and for each singly-occurring child element type with PCDATA-only content, create a column in that table. If the child element type or attribute is optional, make the column nullable.
For each multi-valued attribute and for each multiply-occurring child element type with PCDATA-only content, create a separate table to store values, linked to the parent table through the parent's primary key.
For each child element type with element or mixed content, link the parent element's table to the child element's table with the parent's primary key.

The following (simplistic) procedure generates a DTD from a relational schema:

For each table, create an element.
For each column in a table, create an attribute or a PCDATA-only child element.
For each primary key/foreign key relationship in which a column of the table contributes the primary key, create a child element.

Unfortunately, there are a number of drawbacks to these procedures. For example, there is no way to definitively predict data types or column lengths from the DTD. Any prediction, such as that made by reading a sample document, can be defeated by simply placing data of another "type" or of one character more length in a document. (The long-term solution to this problem is the use of data types in XML schema documents.) Similarly, when generating a DTD from a relational schema, there is no way to predict the order in which child elements "should" occur or whether a column, such as a database-internal row identifier, should be transferred at all. In both cases, naming collisions are possible.

Despite these drawbacks, these algorithms still provide a useful starting point for generating DTDs from relational schema and vice versa.

6.0 Storing and Retrieving Documents

Documents tend to originate either in XML or some other format, such as RTF, PDF, or SGML, which is then converted to XML. (For information about documents built from data inserted into a template, see the discussion at the end of section 4.1). Thus, if you work with XML documents, as opposed to the data stored in XML documents, you probably need a way to store and retrieve your documents, as well as ways to convert them to and from other formats. This section discusses only the former case; for information about conversion software, see any of the XML software Web sites listed in section 3.0, "Additional Links" of XML Database Products.

For very simple document sets, the file system or some sort of version control system (such as is used for software version control) may be adequate for your needs. However, if you have a complex document set, you usually need a content management system.

(The term content management system, as opposed to document management system, reflects the fact that such systems generally allow you to break your documents into discrete content fragments, such as examples, procedures, chapters, or sidebars, as well as metadata, such as author names, revision dates, and document numbers, rather than having to manage each document as a whole. Not only does this simplify such things as how to coordinate the work of multiple writers working on the same document, it allows you to assemble entirely new documents from existing components.)

Unlike the case where data from an XML document is stored in the database, content management systems generally support "round-tripping" of documents, as things such as physical structure are often critical to document maintenance. Content management systems also provide a number of other capabilities, including:

Version and access control
Search engines
Editors
Publishing engines, such as to paper, CD, or the Web
Separation of content and style
Extensibility through scripting or programming
Integration of database data

Because my background is in databases, not content management, I will not describe content management systems in any further detail. Fortunately, such detail generally isn't necessary, as the systems themselves hide most technical issues from the user.

6.1 Content Management Systems and Relational Databases

Because relational database systems are widespread, and because the phrase "storing XML documents in a database" seems, for many people, to be synonymous with using a content management system, many people assume that relational databases are a good way to store their XML documents (as opposed to the data in those documents). Whether this is a good idea is an open question.

One the one hand, many people who have tried to do this swear that it is a very bad idea. The reason they give is that relational databases do not inherently deal well with many of the things needed by a content management system: order, hierarchy, irregular structure, and fields of highly variable length. For example, if you want to store information about the order in which PCDATA and child elements appear in a parent element, you must store this in a separate column and order the children yourself. Nor can you easily, using SQL, formulate the query, "get me all chapters in which the third paragraph mentions part 123 in bold".

On the other hand, a number of large content management systems, such as BladeRunner (Interleaf), SigmaLink (STEP), Parlance Content Manager (XyEnterprise), and Target 2000 (Progressive Information Technology), are based on relational databases. One of the reasons cited is that relational databases scale better than object-oriented databases.

What is clear is that a relational database is not, in itself, a content management system. That is, if you want to do content management with a relational database, your choices are to buy a content management system based on a relational database, accept limitations in functionality with out-of-the-box use, or write your own system.

6.1.1 Out-of-the-Box Use

At its worst, out-of-the-box use is limited to storing XML documents in unparsed form in a single column as a BLOB. This has obvious drawbacks, such as the inability to create a new document from parts of existing documents, but it is simple and works in some cases, such as when you have a description written in XHTML and no need to break it into smaller pieces. Furthermore, relational databases are steadily increasing their text processing capabilities, with full-text indexing and the ability to perform specialized searches, such as proximity searches or searches that use thesauri to look for synonyms. Thus, storing unparsed documents in a single column might suffice for simple document management.

Fortunately, relational databases are becoming more XML aware, so out-of-the-box use is often better than simply storing documents in a single column. Although many of the new XML features are oriented more towards data transfer than document management, they do fit the needs of some applications, especially those that fall in between data transfer and content management.

6.1.2 Writing Your Own System

If you decide to write your own system, you should think very hard before you start. If you really need a full-featured content management system, you are probably wasting your time -- all you will do is duplicate years of somebody else's hard work. On the other hand, if your needs are simple, writing your own system might be worth your time.

The general strategy in writing your own system is to map the DOM (or a subset of the DOM) to the database using an object-relational mapping. That is, you create one table in the database for each object in the DOM you are interested in and connect these with primary and foreign keys. Whether this is an exact mapping of the DOM or an approximate one, such as that described below, depends on your needs. Note that this appears to be the strategy used by "native" XML databases built on relational databases.

An example of this type of system has been described by Mark Birbeck on the XML-L mailing list. The system consists of five tables:

Attribute definition: defines attributes, including their type, legal values, and so on
Element/attribute association: defines which attributes apply to which elements
Content model definition: defines which elements can contain which other elements
Attribute values: contains attribute values and pointers to the appropriate rows in the attribute definition and element/attribute association tables
Element values: contains element values (PCDATA or pointers to other element values), the order in which the element occurs in its parent, a pointer to the row that contains the value of the parent element, and a pointer to the appropriate row in the element/attribute association table

The first three tables are equivalent to a simple DTD; the next two tables contain the actual data. By repeatedly querying the latter two tables, it is possible to reconstruct any part of an XML document. For complete details, see the topics "Record ends, Mixed content, and storing XML documents on relational database" and "storing XML documents on relational database" in the XML-L archives for December, 1998.

For information about other strategies for storing hierarchies such as XML in a relational database, see the topic "XML documents in relational databases" in the XML-L archives for November, 1998.

7.0 XML Database Products

For an up-to-date list of database products you can use with XML, see XML Database Products.

8.0 Additional Links

For an ongoing discussion of the issues surrounding XML and databases (native, relational, object-oriented, etc.), join the XML:DB mailing list.

The following are links to other articles and white papers discussing XML and databases:

XML Enters the DBMS Arena by Edmund X. DeJesus. A high-level discussion about XML and databases, focusing on the difference between "native" XML databases and XML add-ons to existing databases.
XMach-1: A Benchmark for XML Data Management by Timo Boehme and Erhard Rahm. A benchmark for native XML databases and XML-enabled databases.
XML Data Servers: An Infrastructure for Effectively Using XML in Electronic Commerce by Doug Barry. A white paper discussing XML, databases, and how they work together. I haven't read it, but the table of contents looks good. Cost: US$595
XML APIs for databases by Ramnivas Laddad. Describes how to implement SAX and DOM over result sets (the "table model"). Includes source code.
Modeling relational data in XML A discussion of the "table" and "data-specific objects" models.
XML representation of a relational database A complete description of the "table model" of an XML document.
Mapping Objects To Relational Databases by Scott W. Ambler. An excellent paper on object-relational mappings, which are used when the "data-specific objects" model is used with a relational database. More links on the AmbySoft O-R mapping page.

9.0 Comments and Feedback

Please send comments and feedback to Ronald Bourret at rpbourret@rpbourret.com. Please note that I travel often and may take two to three weeks to respond.

Thanks to John Cowan, Dylan Walsh, Nick Leaton, and others for their helpful comments.

Back to Ron Bourret's home page

XML and Databases

XML and Databases

Table of Contents

1.0 Introduction

2.0 Is XML a Database?

3.0 Why Use a Database?

4.0 Data versus Documents

4.1 Data-Centric Documents

4.2 Document-Centric Documents

4.3 Data, Documents, and Databases

5.0 Storing and Retrieving Data

5.1 Transferring Data

5.2 Mapping Document Structure to Database Structure

5.2.1 Template-Driven Mappings

5.2.2 Model-Driven Mappings

5.3 Data Types, Null Values, Character Sets, and All That Stuff

5.3.1 Data Types

5.3.2 Binary Data

5.3.3 Null Data

5.3.4 Character Sets

5.3.5 Processing Instructions

5.3.6 Storing Markup

5.4 Generating DTDs from Database Schema and Vice Versa

6.0 Storing and Retrieving Documents

6.1 Content Management Systems and Relational Databases

6.1.1 Out-of-the-Box Use

6.1.2 Writing Your Own System

7.0 XML Database Products

8.0 Additional Links

9.0 Comments and Feedback