Table of Contents
Qizx is a XML database engine designed for query speed. This is made possible by the underlying technology, and by a non-naive query compiler that takes advantage of indexes automatically. The result is that Qizx is really one the very fastest XML query engines available today, with nearly no need for intervention of the administrators or developers.
Nevertheless, it would be illusory to believe that the way queries are written has no influence on their execution speed. In general, however smart a compiler is, it cannot always compensate for unadapted or poorly written programs. This remark is relevant for classical programming languages like C or Java, so it is probably even more true for XQuery, which is a new and complex language and whose execution — like for any database language — is dependent on the actual data being queried.
Simply put, this chapter aims at helping you to answer this question: does my query contain some constructs that prevent Qizx from optimizing it?
Please note the following points::
A fairly good knowledge of XML Query is desirable to fully understand the ideas exposed here.
These indications are applicable to Qizx. No representations are made about other XQuery implementations. However some suggestions simply represent common sense.
The query optimizer in Qizx will certainly be improved in the course of time, therefore some recommendations can become obsolete. It is recommended to read the updated version of this document coming with a new release.
To illustrate what has been told above, let's take a simple example:
The following query produces a simple report about tests performed by a particular agent named John.
for $t in collection("/tests")/test[ agent = "John" ] return <test id="{ $t/id }">{ $t/date }</test>
The collection named /tests contains a large number of documents describing individual tests performed on some device. It is part of a XML Library built and indexed by Qizx.
The main element of each document is named test
. Each test
element has
a sub-element agent
which contains the name of the person who conducted the test
a sub-element id
which uniquely identifies the test
a sub-element date
giving the date of the test, etc.
Written as such, this query can be executed at optimal speed by Qizx.
Here are alternate ways of producing the same results, with variable efficacy:
Use a where clause:
for $t in collection("/tests")/test where $t/agent = "John" return <test id="{$t/id}">{ $t/date }</test>
This is equivalent to the first optimal query, because in fact the where clause $t/agent = "John"
is automatically transformed into a predicate [ ./agent = "John" ]
.
Iterate on documents:
let $docs as node() := collection("/tests") for $doc in $docs where $doc/test/agent = "John" return <test id="{$doc/test/id}">{ $doc/test/date }</test>
This is very inefficient, because collection("/tests")
has to be first expanded into a sequence of document nodes, then a separate query must be performed on each document. If the documents are small, this results in a dumb traversal taking no advantage of indexes. This example might look a bit contrived, but in practice such a situation can happen fairly easily.
Iterate on agent:
(: iterate on 'agent' instead of 'test' :) for $t in collection("/tests")/test/agent where $t = "John" return <test id="{$t/../id}">{ $t/../date }</test>
Almost equivalent to the first optimal query, but the expressions $t/../id
and $t/../date
are a bit slower and unlikely to be optimized in future versions.
An example where the name of an element is provided as a parameter:
for $t in collection("/tests")/*[ name() = $name and ./agent = "John" ] return <test id="{ $t/id }">{ $t/date }</test>
Assuming that $name
contains the QName
"test
", this query is equivalent to the preceding ones. However here the compiler is not able to find the optimization, so this query will execute much more slowly than the others.
By the way, in this case dynamic evaluation can solve the problem efficiently. Let's build the query as a string, then evaluate it with the function x:eval
(an extension function) :
for $t in x:eval(concat('collection("/tests")/', $name, '[ $t/agent = "John" ]')) return <test id="{$t/id}">{ $t/date }</test>
This section surveys the most important categories of expressions: Path expressions and text search.
Main advice: avoid using the function contains()
if possible.
The contains()
function is used to search any string. Using contains on a document or an element node means that the text contents of the node has to be "flattened" first (in other words, all the pieces of text contained anywhere inside the node have to be concatenated) before performing a linear text search. This can be extremely slow on a large document or collection of documents.
Recommended practices:
Use full-text functions/expressions when possible. Instead of searching any string, you generally want to look for words. The full-text functions rely on indexes and are very efficient.
If you definitely need to use contains()
— you look for specific characters — try to reduce the domain where the string is searched. For example instead of //CHAPTER[contains(. 'x^2')]
that searches for 'x^2
' in the whole CHAPTER
element, you would use //CHAPTER[contains(.//FORMULA, 'x^2')]
because you know the searched string can appear only inside a FORMULA
element contained within the chapter.
To search for a full-text expression wherever inside documents, use something like:
/*[ . ftcontains full_text_expression ]
But by all means not //*[. ftcontains
with a double-slash. This would be utterly inefficient and can even end up with a ...
]OutOfMemory
exception. In the same way avoid something like /elem/*[. ftcontains
or ...
]/elem//*[. ftcontains
...
]
Qizx generally does a good job with Path Expressions. It detects the parts of a path expression that can be optimized using indexes, compiles this parts into a fast-executing query, and evaluates the (possible) non-indexable remainder of the expression as a filter on the indexed query.
All XML elements are indexed, so using for example //CHAPTER
on a large collection is very efficient (no need to scan the entire collection).
This also applies to leaf nodes like comment()
, processing-instruction()
and text()
.
Element with simple contents matching a value.
For example in the predicate [ agent="John" ].
The match can be of several kinds:
simple equality: =
eq
order comparison: <
<=
>
>=
lt
le
gt
ge
pattern matching functions on text fn:matches()
, x:like()
and x:ulike()
.
Note that the non-equal operator !=
or ne
cannot be indexed properly.
The value of the element can be indexed in different types: string, number or date. So the test can involve numeric or date/dateTime values. See the chapter Configuring the indexing process for more details about data conversions.
Notice that only simple element contents are indexed: <operator>Jo<br/>hn</operator>
would not be matched by operator = "John"
.
Examples of predicates that use indexes in a path expression:
[id = "id234"] [./date > xs:date("2003-01-01")] [.//weight > 10.5] [x:like(name, "Al%")] [matches(name, "Al[a-z]+")]
Attributes matching a given value: similar to Element with simple contents.
Examples:
test[ @id = "id234" ] test[ @date > xs:date("2003-01-01") ] test[ @width > 10 and @width < 100 ] test[ x:like(@name, "Al%") ]
inequality comparisons (>
>=
<
<=
gt
ge
lt
le
) can take significantly more time than equality.
A range comparison like @width > 10 and @width < 100 is not recognized as such, it is preferable to use the x:in-range function which is likely to be more efficient.
A predicate like x:like(., pattern) cannot be optimized if the first argument is '.' (current element). But this use is unlikely.
Full-text predicates:
//SPEECH[ . ftcontains "romeo juliet" all words ] //SPEECH[ . ftcontains "to be or not to be" ]
See the Full-text extensions for more information.
A combination of indexable steps using the axes child::
, descendant::
and descendant-or-self::
, or the abbreviations '/
' and '//
'.
For example collection("/tests")/test[ agent = "John" ]
is fully indexed thus does not require to actually access the documents to be executed.
Note that descendant
is as efficient as child
, so specifying intermediate steps in a path will no make the query faster (rather the opposite!).
Examples:
$root/x[@x = 1]//y/z[@y <= 2] $root/X//Z[ .//agent = "John" ] $root(...)//X[ creation/@date > xs:date("2003-01-01")]/Y[props/weight > 10] $root//test[ x:like(operator[@level > 5], "Al%") ]
Here the expression $root
stands for any expression that can be used as the root of a path-expression, such as collection()
or doc()
.
A wildcard element name like in child::*
can be optimized, but only if it is followed by an indexable step:
$root/*[@x = 1] $root/*/operator
Therefore the following queries are not indexable:
$root/*/* $root//*
An explicit path like /elem1/elem2/elem3
is not more efficient than //elem3
. In fact it can be slightly less efficient.
Similarly, it is not useful to avoid using //
by writing something like /*/*/elem3
: the expression //elem3
is quite as fast.
The and
connector in predicates is properly optimized.
A range test like [ 10 < @width and @width < 100 ]
is currently not recognized as such, so it will evaluate much more slowly than if it were properly optimized. Therefore it is highly recommended to use the extension function x:in-range($item, $lower-bound, $upper-bound)
. In this example: [ x:in-range(@width, 10, 100) ]
.
Notice that the predicate [ 10 < child and child < 100 ]
is not strictly equivalent to [ x:in-range(child, 10, 100) ]
when there are several children child
elements, so it cannot in principle be replaced automatically.
The or
connector in predicates is also optimized.
A Path Expression used as a predicate is optimized. For example here is a query for a device
element that contains at least one fault
element at any depth:
$root/device[.//fault]
The not()
and empty()
functions used in predicates are now optimized (as of version 3.0), when their argument can be indexed. Examples:
$root/device[ not(empty(./fault)) ]
$root/device[ empty(fault) ]
Similarly, predicates comparing count()
on a sub-path are optimized (as of version 3.0):
$root/device[ count(./fault) >= 3 ]
$root/device[ count(./fault) = 0 ] (: similar to empty() :)
Many language features that cannot be compiled to use indexes. Qizx tries to use indexes as much as possible, and leaves the non-indexable features to the plain XQuery interpreter.
Non-indexable features:
In general, any expression, used as a predicate, which is not mentioned above is non-indexable.
Predicates containing position()
or last()
, explicitly or implicitly, like in child[2]
which is a short form for child[position() = 2]
.
Since the semantics of these functions are dependent on the evaluation context, it is nearly impossible to index their values.
Function last()
can be a performance killer. Use with care.
Axes ancestor::
, ancestor-or-self::
, parent::
or '..
', preceding::
, preceding-sibling::
, following::
, following-sibling::
.
Node tests like node()
or prefix:*
.
As suggested above, the '*
' node test (meaning any element) cannot always be optimized. It is preferable to avoid using this wildcard when possible.
Expressions not currently optimized but which are likely to be optimized in next versions:
Recognition of range test like 10 < @width and @width < 100
.
Quantified expressions some
and ...
in...
satisfies...
every
used as predicate or where clause....
in...
satisfies...
Predicates which are implicitly a join, like:
for $t in collection("/invoices")/invoice, $c in collection("/clients")/client[ @id = $t/client-id ] return ...
This is equivalent to the following query, which is more obviously an equi-join:
for $t in collection("/invoices")/invoice, $c in collection("/clients")/client where $c/@id = $t/client-id return ...