XML and Databases

来源:互联网 发布:网络问卷 编辑:程序博客网 时间:2024/04/26 01:48
2001年05月25日 13:48:00

XML and Databases

Copyright 1999, 2000 by Ronald Bourret
September, 1999 (Last updated November 2000)

Table of Contents

1.0 Introduction
2.0 Is XML a Database?
3.0 Why Use a Database?
4.0 Data versus Documents
4.1 Data-Centric Documents
4.2 Document-Centric Documents
4.3 Data, Documents, and Databases
5.0 Storing and Retrieving Data
5.1 Transferring Data
5.2 Mapping Document Structure to Database Structure
5.2.1 Template-Driven Mappings
5.2.2 Model-Driven Mappings
5.2.2.1 The Table Model
5.2.2.2 The Data-Specific Object Model
5.3 Data Types, Null Values, Character Sets, and All That Stuff
5.3.1 Data Types
5.3.2 Binary Data
5.3.3 Null Data
5.3.4 Character Sets
5.3.5 Processing Instructions
5.3.6 Storing Markup
5.4 Generating DTDs from Database Schema and Vice Versa
6.0 Storing and Retrieving Documents
6.1 Content Management Systems and Relational Databases
6.1.1 Out-of-the-Box Use
6.1.2 Writing Your Own System
7.0 XML Database Products
8.0 Additional Links
9.0 Comments and Feedback

1.0 Introduction

This paper briefly discusses the relationship between XML and databases and describes some of the types of software available to process XML documents with databases. Although it is not intended to be exhaustive, I hope that it describes some of the major issues in using XML with databases. It is somewhat biased towards relational databases simply because that is where my experience is.

2.0 Is XML a Database?

Before we start talking about XML and databases, we need to answer a question that occurs to many people: "Is XML a database?" In the strictest sense -- that is, when "XML" means an XML document -- the answer is no. Although an XML document contains data, without any additional software to help process that data, it is no more a database than any other text file.

In a more liberal sense -- that is, when "XML" means an XML document and all the surrounding XML tools and technologies -- then the answer is yes, but... It is yes because XML provides many of the things found in databases: storage (the XML document), schemas (DTDS, XML schema languages), query languages (XQL, XML-QL, QUILT, etc.), programming interfaces (SAX, DOM), and so on. It is but... because it lacks many of the things found in real databases: efficient storage, indexes, security, transactions and data integrity, multi-user access, triggers, queries across multiple documents, and so on.

So while it may be possible to use XML as a database in environments with not too much data, few users, and modest performance requirements, it will fail in most production environments, which have many users, strict data integrity requirements, and the need for good performance. Furthermore, given the low price and ease of use of databases like dBASE and Access, there seems little reason to use XML as a database even in the first case.

3.0 Why Use a Database?

The first question you need to ask yourself when you start thinking about XML and databases is why you want to use a database in the first place. Do you have legacy data you want to expose? Are you looking for a place to store your Web pages? Is the database used by an e-commerce application in which XML is used as a data transport? The answers to these questions will strongly influence your choice of database and middleware (if any), as well as how you use that database.

For example, suppose you have an e-commerce application that uses XML as a data transport. It is a good bet that your data has a highly regular structure and that things such as the entities and encodings used in XML documents are not important to you -- after all, you are interested only in the data, not how it is physically stored in the document. If your application is relatively simple, a relational database and data transfer middleware might satisfy your needs; if it is large and complex, you might want a complete application development environment that supports XML.

On the other hand, suppose you have a Web site built from a number of prose-oriented XML documents. Not only do you want to manage the site, you would like to provide a way for users to search its contents. Your documents are likely to have a highly irregular structure and things such as entity usage are probably important to you because they are a fundamental part of how your documents are structured. In this case, you might need some sort of "native XML" database that performs versioning, tracks entity usage, and supports a query language like XQL.

4.0 Data versus Documents

Perhaps the most important factor in choosing a database is whether you are using the database to store data or documents. If you want to store data, you will need a database that is tuned for data storage, such as a relational or object-oriented database, as well as middleware to transfer data between the database and XML documents. On the other hand, if you want to store documents, you will need a content management system which is designed specifically to store documents.

Although it is possible to store documents in a relational or object-oriented database yourself, you will usually find yourself duplicating the work of a content management system. Similarly, although a content management system is usually built on top of an object-oriented or hierarchical database, trying to use it as a database will probably prove to be frustrating.

Whether you need to store data or documents can often be answered by simply looking at your XML documents. The reason for this is that XML documents fall into two rough categories: data-centric and document-centric.

4.1 Data-Centric Documents

Data-centric documents are characterized by fairly regular structure, fine-grained data (that is, the smallest independent unit of data is at the level of a PCDATA-only element or an attribute), and little or no mixed content. The order in which sibling elements and PCDATA occurs is often not significant. Good examples are XML documents containing sales orders, flight schedules, restaurant menus, and so on. Data-centric documents are usually designed for machine consumption and the fact that XML is involved may be superfluous -- it is often simply the data transport.

For example, the following sales order document is data-centric:

   >Orders<      >SalesOrder SONumber="12345"<         >Customer CustNumber="543"<            >CustName/CustName<            >Street<123 Main St.>/Street<            >City/City<            >State/State<            >PostCode<60609>/PostCode<         >/Customer<         >OrderDate<981215>/OrderDate<         >Line LineNumber="1"<            >Part PartNumber="123"<               >Description<                  >p<>b/b<>br /<                  Stainless steel, one-piece construction,                  lifetime guarantee.>/p<               >/Description<               >Price<9.95>/Price<            >/Part<            >Quantity<10>/Quantity<         >/Line<         >Line LineNumber="2"<            >Part PartNumber="456"<               >Description<                  >p<>bb<>br /<                  Aluminum, one-year guarantee.>/p<               >/Description<               >Price<13.27>/Price<            >/Part<            >Quantity<5>/Quantity<         >/Line<      >/SalesOrder<   >/Orders<

Note that in the XML world, many prose-rich documents are, in fact, data-centric. For example, consider a page on Amazon.com that displays information about a book. Although the page is largely text, the structure of that text is highly regular, much of it is common to all pages describing books, and each piece of page-specific text is limited in size. Thus, the page could be built from a simple, data-centric XML document -- containing page-specific text retrieved from the database -- and an XSL stylesheet that adds the boilerplate text. In general, any Web site that dynamically constructs HTML documents today by filling a template with database data can probably be replaced by data-centric XML documents and one or more XSL stylesheets.

For example, the following lease document:

   >Lease<      >Lessee/Lessee< agrees to lease the property at      >Address<123 Main St., Chicago, IL>/Address< from >Lessor/Lessor< for a term of not less than >LeaseTerm      TimeUnit="Months"<18>/LeaseTerm< at a cost of >Price      Currency="USD" TimeUnit="Months"<1000>/Price<.   >/Lease<

could be built from the following XML document and a simple stylesheet:

   >Lease<      >Lessee/Lessee<      >Address<123 Main St., Chicago, IL>/Address<      >Lessor/Lessor<      >LeaseTerm TimeUnit="Months"<18>/LeaseTerm<      >Price Currency="USD" TimeUnit="Months"<1000>/Price<   >/Lease<

4.2 Document-Centric Documents

Document-centric documents are characterized by irregular structure, larger grained data (that is, the smallest independent unit of data might be at the level of an element with mixed content or the entire document itself), and lots of mixed content. The order in which sibling elements and PCDATA occurs is almost always significant. Good examples are books, email, advertisements, and almost any XHTML document. Document-centric documents are generally designed for human consumption.

For example, the following product description is document-centric:

   >Product<   >Name/Name<   >Developer/Developer<   >Summary/Summary<   >Description<   >Para/Para<      >Para/Para<      >List<   >Item<>Link URL="Order.html"/Link<>/Item<   >Item<>Link URL="Wrenches.htm"/Link<>/Item<   >Item<>Link URL="catalog.zip"/Link<>/Item<   >/List<      >Para/Para<      >/Description<   

4.3 Data, Documents, and Databases

In reality, the distinction between data-centric and document-centric documents is not always clear. For example, an otherwise data-centric document, such as an invoice, might contain large-grained, irregularly structured data, such as a part description. And an otherwise document-centric document, such as a user's manual, might contain fine-grained, regularly structured data (often metadata), such as an author's name and a revision date. In spite of this, characterizing your documents as data-centric or document-centric will help you decide whether you are interested in data or documents, which in turn dictates the kind of system you need.

To store and/or retrieve data, you can use a database (usually relational, object-oriented, or hierarchical) and middleware (either built-in or third party), or you can use an XML server (a platform for building distributed applications, such as e-commerce, that use XML for data transfer) or XML-enabled Web server (a Web server that can build XML documents from data in a database). To store documents, you will need a content management system or persistent DOM implementation. The issues in using each type of system are discussed in sections 5.0, "Storing and Retrieving Data", and 6.0, "Storing and Retrieving Documents". You can find a list of available software in XML Database Products.

5.0 Storing and Retrieving Data

Data of the kind that is found in data-centric documents can originate either in the database (in which case you want to expose it as XML) or in an XML document (in which case you want to store it in a database). An example of the former is the vast amount of legacy data stored in relational databases; an example of the latter is data exposed on the Web as XML that you want to store in your database for further processing. Thus, depending on your needs, you may need software that transfers data from an XML document to the database, from the database to an XML document, or both.

5.1 Transferring Data

When storing data in the database, it is often acceptable to discard much of the information about a document, such as its name and DTD, as well as its physical structure, such as entity definition and usage, the order in which attribute values and sibling elements occur, the way in which binary data is stored (Base64 v. unparsed entity v. something else), CDATA sections, and encoding information. Similarly, when retrieving data from the database, the resulting XML document is likely to contain no CDATA or entity usage (other than the predefined entities lt, gt, amp, apos, and quot) and the order in which sibling elements and attributes appear is likely to be the order in which the data was returned by the database.

Although this may seem shocking at first, it is often quite reasonable. For example, consider the case in which XML is used as a data format for transferring a sales order from one database to another. In this case, it really doesn't matter whether the sales order number is stored in the document before or after the sales order date, nor does it matter if the customer's name is stored in a CDATA section, an external entity, or directly as PCDATA. All that is important is that the relevant data is transferred from the first database to the second. Thus, the data transfer software needs to consider hierarchical order (which groups the information about a single sales order) and little else.

One consequence of ignoring information about the document and its physical structure is that "round-tripping" a document -- that is, storing the data from a document in the database and then reconstructing the document from that data -- often results in a different document, even in the canonical sense of the term. Whether this is acceptable depends on your needs and might influence your choice of database and data transfer middleware.

5.2 Mapping Document Structure to Database Structure

In order to transfer data between an XML document and a database, it is necessary to map document structure to database structure and vice versa. Such mappings fall into two general categories: template-driven and model-driven.

5.2.1 Template-Driven Mappings

In a template-driven mapping, there is no predefined mapping between document structure and database structure. Instead, you embed commands in a template that is processed by the data transfer middleware. For example, consider the following template (not used by any real product), in which SELECT statements are embedded in >SelectStmt< elements:

   >?xml version="1.0"?<   >FlightInfo<      >Intro/Intro<      >SelectStmt