The Michigan Benchmark

Description

Downloads

People

Comments 



Authors: Kanda Runapongsa, Jignesh M. Patel, H. V. Jagadish, Yun Chen,
and Shurug Al-Khalifa

Department of Electrical Engineering and Computer Science
The University of Michigan, Ann Arbor, MI 48109, USA
{krunapon, jignesh, jag, yunc, shurug}@eecs.umich. edu

Abstract

We propose a micro-benchmark for XML data management to aid engineers in designing improved XML processing engines. This benchmark is inherently different from application-level benchmarks, which are designed to help users choose between alternative products. We primarily attempt to capture the rich variety of data structures and distributions possible in XML, and to isolate their effects, without imitating any particular application. The benchmark specifies a single data set against which carefully specified queries can be used to evaluate system performance for XML data with various characteristics.

We have used the benchmark to analyze the performance of three database systems: two native XML DBMS, and a commercial ORDBMS. The benchmark reveals key strengths and weaknesses of these systems. We find that commercial relational techniques are effective for XML query processing in many cases, but are sensitive to query rewriting, and require better support for efficiently determining indirect structural containment.

Contents

  1. Introduction
  2. Related Work
  3. Benchmark Data Set
  4. Benchmark Queries
  5. The Benchmark in Action
  6. Conclusions

1. Introduction

XML query processing has taken on considerable importance recently, and several XML databases have been constructed on a variety of platforms. There has naturally been an interest in benchmarking the performance of these systems, and a number of benchmarks have been proposed [6, 19, 21]. The focus of currently proposed benchmarks is to assess the performance of a given XML database in performing a variety of representative tasks. Such benchmarks are valuable to potential users of a database system in providing an indication of the performance that the user can expect on their specific application. The challenge is to devise benchmarks that are sufficiently representative of the requirements of "most" users. The TPC series of benchmarks accomplished this, with reasonable success, for relational database systems. However, no benchmark has been successful in the realm of ORDBMS and OODBMS which have extensibility and user defined functions that lead to great heterogeneity in the nature of their use. It is too soon to say whether any of the current XML benchmarks will be successful in this respect -we certainly hope that they will.

One aspect that current XML benchmarks do not focus on is the performance of the basic query evaluation operations such as selections, joins, and aggregations. A ``micro-benchmark'' that highlights the performance of these basic operations can be very helpful to a database developer in understanding and evaluating alternatives for implementing these basic operations. A number of questions related to performance may need to be answered: What are the strengths and weaknesses of specific access methods? Which areas should the developer focus attention on? What is the basis to choose between two alternative implementations? Questions of this nature are central to well-engineered systems. Application-level benchmarks, by their nature, are unable to deal with these important issues in detail. For relational systems, the Wisconsin benchmark [11] provided the database community with an invaluable engineering tool to assess the performance of individual operators and access methods. The work presented in this paper is inspired by the simplicity and the effectiveness of the Wisconsin benchmark for measuring and understanding the performance of relational DBMSs. The goal of this paper is to develop a comparable benchmark for XML DBMSs. The benchmark that we propose to achieve this goal is called the Michigan benchmark.

A challenging issue in designing any benchmark is the choice of the benchmark's data set. If the data is specified to represent a particular "real application", it is likely to be quite uncharacteristic for other applications with different data characteristics. Thus, holistic benchmarks can succeed only if they are able to find a real application with data characteristics that are reasonably representative for a large class of different applications.

For a micro-benchmark, the challenges are different. The benchmark data set must be complex enough to incorporate data characteristics that are likely to have an impact on the performance of query operations. However, at the same time, the benchmark data set must be simple so that it is not only easy to pose and understand queries against the data set, but also easy to pinpoint the component of the system that is performing poorly. We attempt to achieve this balance by using a data set that has a simple schema but carefully orchestrated structure. In addition, random number generators are used sparingly in generating the benchmark's data set. The Michigan benchmark uses random generators for only two attribute values, and derives all other data parameters from these two generated values. Furthermore, as in the Wisconsin benchmark, we use appropriate attribute names to reflect the domain and distribution of the attribute values.

When designing benchmark data sets for relational systems, the primary data characteristics that are of interest are the distribution and domain of the attribute values and the cardinality of the relations. Moreover, there may be a few additional secondary characteristics, such as clustering and tuple/ attribute size. In XML databases, besides the distribution and domain of attribute values and cardinality, there are several other characteristics, such as tree fanout and tree depth, that are related to the structure of XML documents and contribute to the rich structure of XML data. An XML benchmark must incorporate these additional features into the benchmark data and query set design. The Michigan benchmark achieves this by using a data set that incorporates these characteristics without introducing unnecessary complexity into the data set generation, and by carefully designing the benchmark queries that test the impact of these characteristics on individual query operations.

The main contributions of this paper are:

2. Related Work

Several proposals for generating synthetic XML data have been proposed [1, 5]. Aboulnaga et al. [1] proposed a data generator that accepts as many as 20 parameters to allow a user to control the properties of the generated data. Such a large number of parameters adds a level of complexity that may interfere with the ease of use of a data generator. Furthermore, this data generator does not make available the schema of the data which some systems could exploit. Most recently, Barbosa et al. [5] proposed a template-based data generator for XML, which can generate multiple tunable data sets. In contrast to these previous data generators, the data generator in this proposed benchmark produces an XML data set designed to test different XML data characteristics that may affect the performance of XML engines. In addition, the data generator requires only few parameters to vary the scalability of the data set. The schema of the data set is also available to exploit.

Four benchmarks [6,19,21,27] have been proposed for evaluating the performance of XML data management systems. XMach-1 [6] and XMark [21] generate XML data that models data from particular Internet applications. In XMach-1 [6], the data is based on a web application that consists of text documents, schemaless data, and structured data. In XMark [21], the data is based on an Internet auction application that consists of relatively structured and data-oriented parts. XOO7 [19] is an XML version of the OO7 Benchmark [17], which is a benchmark for OODBMSs. The OO7 schema and instances are mapped into a Document Type Definition (DTD), and the eight OO7 queries are translated into three respective languages for query processing engines: Lore [14, 18], Kweelt [20], and an ORDBMS. Recognizing that different applications requires different benchmarks, Yao et al. [27] have recently proposed, Xbench, which is a family of a number of different application benchmarks.

While each of these benchmarks provides an excellent measure of how a test system would perform against data and queries in their targeted XML application, it is difficult to extrapolate the results to data sets and queries that are different from ones in the targeted domain. Although the queries in these benchmarks are designed to test different performance aspects of XML engines, they cannot be used to perceive the system performance change as the XML data characteristics change. On the other hand, we have different queries to analyze the system performance with respect to different XML data characteristics, such as tree fanout and tree depth; and different query characteristics, such as predicate selectivity.

Finally, we note that [2] presents desiderata for an XML database benchmark, identifies key components and operations, and enumerates ten challenges that the XML benchmark should address. The central focus of this work is application-level benchmarks, rather than micro-benchmarks of the sort we propose.

3. Benchmark Data Set

In this section, we first discuss the characteristics of XML data sets that can have a significant impact on the performance of query operations. Then, we present the schema and the generation algorithm for the benchmark data.

3.1 A Discussion of the Data Characteristics

In a relational paradigm, the primary data characteristics are the selectivity of attributes (important for simple selection operations) and the join selectivity (important for join operations). In an XML paradigm, there are several complicating characteristics to consider as discussed in Section 3.1.1 and Section 3.1.2.

3.1.1 Depth and Fanout

Depth and fanout are two structural parameters important to tree-structured data. The depth of the data tree can have a significant performance impact, for instance, when computing indirect containment relationships between ancestor and descendant nodes in the tree. Similarly, the fanout of the tree node can affect the way in which the DBMS stores the data, and answers queries that are based on selecting children in a specific order (for example, selecting the last child).

One potential way of evaluating the impact of fanout and depth is to generate a number of distinct data sets with different values for each of these parameters and then run queries against each data set. The drawback of this approach is that the large number of data sets makes the benchmark harder to run and understand. Instead, our approach is to fold these into a single data set.

We create a base benchmark data set of a depth of 16. Then, using a ``level'' attribute, we can restrict the scope of the query to data sets of certain depth, thereby, quantifying the impact of the depth of the data tree. Similarly, we specify high (13) and low (2) fanouts at different levels of the tree as shown in Figure 1. The fanout of 1/13 at level 8 means that every thirteenth node at this level has a single child, and all other nodes are childless leaves. This variation in fanout is designed to permit queries that focus isolating the fanout factor. For instance, the number of nodes is the same (2,704) at levels 7 and 9. Nodes at level 7 have a fanout of 13, whereas nodes at level 9 have a fanout of 2. A pair of queries, one against each of these two levels, can be used to isolate the impact of fanout. In the rightmost column of Figure 1, ``% of Nodes'' is the percentage of the number of nodes at each level to the number of total nodes in a document.
Level Fanout Nodes % of Nodes
12 10.0
22 20.0
32 40.0
42 80.0
513 160.0
613 2080.0
713 2,7040.4
81/13 35,1524.8
92 2,7040.4
102 5,4080.7
112 10,8161.5
122 21,6323.0
132 43,2646.0
142 86,52811.9
152 173,05623.8
16- 346,11247.6
Figure 1: Distribution of the Nodes in the Base Data Set

3.1.2 Data Set Granularity

To keep the benchmark simple, we choose a single large document tree as the default data set. If it is important to understand the effect of document granularity, one can modify the benchmark data set to treat each node at a given level as the root of a distinct document. One can compare the performance of queries on this modified data set against queries on the original data set.

3.1.3 Scaling

A good benchmark needs to be able to scale in order to measure the performance of databases on a variety of platforms. In the relational model, scaling a benchmark data set is easy -- we simply increase the number of tuples. However, with XML, there are many scaling options, such as increasing number of nodes, depths, or fanouts. We would like to isolate the effect of the number of nodes from effects due to other structural changes, such as depth and fanout. We achieve this by keeping the tree depth constant for all scaled versions of the data set and changing the numbers of fanouts of nodes at only a few levels, namely levels 5-8. In the design of the benchmark data set, we deliberately keep the fanout of the bottom few levels of the tree constant. This design implies that the percentage of nodes in the lower levels of the tree (levels 9-16) is nearly constant across all the data sets. This allows us to easily express queries that focus on a specified percentage of the total number of nodes in the database. For example, to select approximately 1/16. of all the nodes, irrespective of the scale factor, we use the predicate aLevel = 13.

We propose to scale the Michigan benchmark in discrete steps. The default data set, called DSx1, has 728K nodes, arranged in a tree of a depth of 16 and a fanout of 2 for all levels except levels 5,6,7, and 8, which have fanouts of 13, 13, 13, 1/13 respectively. From this data set we generate two additional "scaled-up" data sets, called DSx10 and DSx100 such that the numbers of nodes in these data sets are approximated 10 and 100 times the number of nodes in the base data set, respectively. We achieve this scaling factor by varying the fanout of the nodes at levels 5-8. For the data set DSx10 levels 5-7 have a fanout of 39, whereas level 8 has a fanout of 1/39. For the data set DSx100 levels 5-7 have a fanout of 111, whereas level 8 has a fanout of 1/111. The total number of nodes in the data sets DSx10 and DSx100 is 7,180K and 72,351K respectively (this translates into a scale factor of 9.9x and 99.4x.)

3.2 Schema of Benchmark Data

The construction of the benchmark data is centered around the element type BaseType. Each BaseType element has the following attributes:

  1. aUnique1: A unique integer generated by traversing the entire data tree in a breadth-first manner. This attribute also serves as the element identifier.
  2. aUnique2: A unique integer generated randomly.
  3. aLevel: An integer set to store the level of the node.
  4. aFour: An integer set to aUnique2 mod 4.
  5. aSixteen: An integer set to aUnique1 + aUnique2 mod 16. This attribute is generated using both the unique attributes to avoid a correlation between the value of this attribute and other derived attributes.
  6. aSixtyFour: An integer set to aUnique2 mod 64.
  7. aString: A string approximately 32 bytes in length.

The content of each BaseType element is a long string that is approximately 512 bytes in length. The generation of the element content and the string attribute aString is described in Section 3.3.

In addition to the attributes listed above, each BaseType element has two sets of subelements. The first is of type BaseType. The number of repetitions of this subelement is determined by the fanout of the parent element, as described in Figure 1. The second subelement is an OccasionalType, and can occur either 0 or 1 time. The presence of the OccasionalType element is determined by the value of the attribute aSixtyFour of the parent element. A BaseType element has a nested (leaf) element of type OccasionalType if the aSixtyFour attribute has the value 0. An OccasionalType element has content that is identical to the content of the parent but has only one attribute, aRef. TheOccasionalType element refers to the BaseType node with aUnique1 value equal to the parent's aUnique1-11 (the reference is achieved by assigning this value to aRef attribute.) In the case where there is no BaseType element has the parent's aUnique1-11 value (e. g., top few nodes in the tree), the OccasionalType element refers to the root node of the tree. The XML Schema specification of the benchmark data is shown in Figure 2.

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
	 targetNamespace="http://www.eecs.umich.edu/db/mbench/bm.xsd"
	 xmlns="http://www.eecs.umich.edu/db/mbench/bm.xsd"
	 elementFormDefault="qualified">
	<xsd:complexType name="BaseType" mixed="true">
		<xsd:sequence>
			<xsd:element name="eNest" type="BaseType" maxOccurs="unbounded">
				<xsd:key name="aU1PK">
					<xsd:selector xpath=".//eNest"/>
					<xsd:field xpath="@aUnique1"/>
				</xsd:key>
				<xsd:unique name="aU2">
					<xsd:selector xpath=".//eNest"/>
					<xsd:field xpath="@aUnique2"/>
				</xsd:unique>
			</xsd:element>
			<xsd:element name="eOccasional" type="OccasionalType" minOccurs="0">
				<xsd:keyref name="aU1FK" refer="aU1PK">
				<xsd:selector xpath="../eOccasional"/>
				<xsd:field xpath="@aRef"/>
				</xsd:keyref>
			</xsd:element>
		</xsd:sequence>
		<xsd:attributeGroup ref="BaseTypeAttrs"/>
	</xsd:complexType>
	<xsd:complexType name="OccassionalType">
		<xsd:simpleContent>
			<xsd:extension base="xsd:string">
				<xsd:attribute name="aRef" type="xsd:integer" use="required"/>
			</xsd:extension>
		</xsd:simpleContent>
	</xsd:complexType>
	<xsd:attributeGroup name="BaseTypeAttrs">
		<xsd:attribute name="aUnique1" type="xsd:integer" use="required"/>
		<xsd:attribute name="aUnique2" type="xsd:integer" use="required"/>
		<xsd:attribute name="aLevel" type="xsd:integer" use="required"/>
		<xsd:attribute name="aFour" type="xsd:integer" use="required"/>
		<xsd:attribute name="aSixteen" type="xsd:integer" use="required"/>
		<xsd:attribute name="aSixtyFour" type="xsd:integer" use="required"/>
		<xsd:attribute name="aString" type="xsd:string" use="required"/ >
	</xsd:attributeGroup>
</xsd:schema>
Figure 2: Benchmark Specification in XML Schema

3.3 Generating the String Attributes and Element

The element content of each BaseType element is a long string. Since this string is meant to simulate a piece of text in a natural language, it is not appropriate to generate this string from a uniform distribution. Selecting pieces of text from real sources, however, involves many difficulties, such as how to maintain roughly constant size for each string, how to avoid idiosyncrasies associated with the specific source, and how to generate more strings as required for a scaled benchmark. Moreover, we would like to have benchmark results applicable to a wide variety of languages and domain vocabularies.

To obtain the string value that has the distribution similar to the distribution of a natural language text, we generate these long strings synthetically, in a carefully stylized manner. We begin by creating a pool of 2 16 (over sixty thousands) 3 synthetic words. The words are divided into 16 buckets, with exponentially growing bucket occupancy. Bucket i has 2 i words. For example, the first bucket has only one word, the second has two words, the third has four words, and so on. The words are not meaningful in any language, but simply contains information about the bucket from which it is drawn and the word number in the bucket. For example, "15twentynineB14" indicates that this is the 1, 529th word from the fourteenth bucket. To keep the size of the vocabulary in the last bucket at roughly 30,000 words, words in the last bucket are derived from words in the other buckets by adding the suffix "ing" (to get exactly 2 15 words in the sixteenth bucket, we add the dummy word "oneB0ing").

The value of the long string is generated from the template shown in Figure 3, where "PickWord" is actually a placeholder for a word picked from the word pool described above. To pick a word for "PickWord", a bucket is chosen, with each bucket equally likely, and then a word is picked from the chosen bucket, with each word equally likely. Thus, we obtain a discrete Zipf distribution of parameter roughly 1. We use the Zipf distribution since it seems to reflect word occurrence probabilities accurately in a wide variety of situations. The value of aString attribute is simply the first line of the long string that is stored as the element content.

Sing a song of PickWord,
A pocket full of PickWord
Four and twenty PickWord
All baked in a PickWord.


When the PickWord was opened,
The PickWord began to sing;
Wasn't that a dainty PickWord
To set before the PickWord?


The King was in his PickWord,
Counting out his PickWord;
The Queen was in the PickWord
Eating bread and PickWord.


The maid was in the PickWord
Hanging out the PickWord;
When down came a PickWord,
And snipped off her PickWord!

Figure 3: Generation of the String Element Content

Through the above procedures, we now have the data set that has the structure that facilitates the study of the impact of data characteristics on system performance and the element/ attribute content that simulates a piece of text in a natural language.

4. Benchmark queries

In creating the data set above, we make it possible to tease apart data with different characteristics, and to issue queries with well-controlled yet vastly differing data access patterns. We are more interested in evaluating the cost of individual pieces of core query functionality than in evaluating the composite performance of queries that are of application-level. Knowing the costs of individual basic operations, we can estimate the cost of any complex query by just adding up relevant piecewise costs (keeping in mind the pipelined nature of evaluation, and the changes in sizes of intermediate results when operators are pipelined).

We find it useful to refer to queries as "selection queries", "join queries" and the like, to clearly indicate the functionality of each query. A complex query that involves many of these simple operations can take time that varies monotonically with the time required for these simple components.

In the following subsections, we describe each of these different types of queries in detail. In these queries, the types of the nodes are assumed to be BaseType unless specified otherwise.

4.1 Selection

Relational selection identifies the tuples that satisfy a given predicate over its attributes. XML selection is both more complex and more important because of the tree structure. Consider a query, against a popular bibliographic database, that seeks books, published in the year 2002, by an author with name including the string "Blake". This apparently straightforward selection query involves matches in the database to a 4-node "query pattern", with predicates associated with each of these four (namely book, year, author, and name). Once a match has been found for this pattern, we may be interested in returning only the book element, all the nodes that participated in the match, or various other possibilities. We attempt to organize the various sources of complexity in the following.

4.1.1 Returned Structure

In a relation, once a tuple is selected, the tuple is returned. In XML, as we saw in the example above, once an element is selected, one may return the element, as well as some structure related to the element, such as the sub-tree rooted at the element. Query performance can be significantly affected by how the data is stored and when the returned result is materialized.

To understand the role of returned structure in query performance, we use the query, "Select all elements with aSixtyFour = 2." The selectivity of this query is 1/ 64 (1.6%) (Notes that details about the computation of the selectivities of these queries can be found in Query Selectivity Computation . This query is run in the following cases:

The remaining queries in the benchmark simply return the unique identifier attributes of the selected nodes (aUnique1 for eNest and aRef for eOccasional), except when explicitly specified otherwise. This design choice ensures that the cost of producing the final result is a small portion of the query execution cost.

4.1.2 Simple Selection

Even XML queries involving only one element and few predicates predicate can show considerable diversity. We examine the effect of this simple selection predicate in this set of queries.