Table of Contents
Qizx uses indexes to greatly increase the speed of queries over XML Libraries.
By default, Qizx indexes most of the information available in XML documents: elements, attributes, other nodes, and full-text. This is done automatically, therefore in most cases there is no need for the database administrator to explicitly specify indexes.
In most other XML database engines, if you want to obtain an optimal or simply decent querying speed, you have to spend time defining specific indexes manually. Moreover when such a system is in production phase, if you need to optimize new queries, then you need to add new indexes, which means reindexing the whole database. Needless to say, this is problem-prone, time-consuming and costly.
You need to read this chapter only if you want to enhance or customize the indexing used by default in Qizx.
Qizx supports customization through an Indexing Specification associated with an XML Library. An Indexing Specification allows to:
Modify or extend the conversions performed on the values of attributes and simple elements.
Qizx automatically recognizes and converts numeric and date values in attributes and simple elements, so that queries using those data types can be boosted by indexes, for example:
//item[ weight = 10 ] //event[ @date >= xs:date("2007-12-31") ]
This mechanism actually extends the XQuery language, since it allows number and date comparison even if the values in documents do not conform to the syntax of XML Schema types. For example, with a suitable indexing, the queries above could respectively match:
<item><weight>10.0Kg</weight>...</item>
and:
<event date="12/31/2007">...</event>
The default conversions are compatible with the standard XQuery semantics.
This extension can currently be used only on documents stored in an XML Library. It is therefore not available in Qizx/open.
The default conversions can be tuned or extended or suppressed, specific conversions can be added for specific contexts, custom converters can be plugged in Qizx.
Suppress full-text indexing where it is not needed.
From v3.0, other full-text customization is achieved through the API.
Tune miscellaneous parameters, e.g the maximum length of indexed element values.
The next section explains indexing in Qizx with more details, the following section explains how to configure this indexing.
This section explains how indexing works in Qizx: what indexes are built, what are the default rules and conversions, what is an Indexing Specification.
Qizx creates and exploits the following indexes:
Given an element name, this index returns all XML elements in all documents of a Library that bear this name. It also contains information about structural relationships (child/descendant).
Given an attribute name and a value, this index returns all elements that have an attribute with this name and value.
There are three types of attribute indexes, according to the type of the attribute value: text, numeric and date/time.
When indexing, Qizx attempts to convert attribute textual values into numeric or date values by using successively converters called "Sieves". These objects are pluggable and can be redefined, as explained in the following sections.
By default, all attributes values are indexed as raw strings.
If the value can be converted to a double number, then it is added to the numeric attribute index.
If it can be converted to a date or date-time value, then it is added to the date/time attribute index.
For example in an element instance like <elem num="12.0" date="12/31/2004"/>
, the attribute num
is added to the numeric index and the attribute date
is added to the date/time index. The element elem
can thus be found by a query like elem[@num=12]
or elem[@date=xs:date("2004-12-31")]
(notice the non-string values in the queries).
Given an element name and a value, this index returns all elements that have a simple content corresponding to this value.
Note: "simple content" is a sequence of characters which appears as the only contents of an element (by contrast with "mixed-content"). For example <e>1234</e>
is an element with a simple content.
By default, such a content is indexed if it is recognized as a "token", i.e some text without whitespace. For example the content of <e>1234</e>
is indexed as simple content but the content of <p>this is a paragraph</p>
is not (nevertheless the words inside element p
are put into the full-text index).
When recognized as a token, a simple content is indexed much in the same way as an attribute: numeric or date/time values are detected and added respectively to the simple-content numeric index and the simple-content date/time index. The default date pattern is the same as for attributes.
Given a word, this index returns all the text nodes that contain an occurrence of this word.
Words are extracted from element contents using a "Word Sieve", which in addition normalizes the words (for example remove accents and converts the word to lowercase).
The Word Sieve is also used when parsing full-text queries. Consequently there can be only one word sieve per Library.
An Indexing Specification is associated with a XML Library. It applies to all documents of the XML Library.
as a consequence, if two documents have incompatible indexing requirements, they have to be stored in two different Libraries. However this is unlikely because Indexing Specifications allow fairly fine tuning.
Generally speaking, an Indexing Specification can contain:
Values of general parameters.
Specifications for full-text indexing.
Rules for recognizing numeric and date values in element content and attributes.
An indexing specification is an XML document.
The root element has the name indexing
.
The indexing
element bears attributes defining global properties.
It contains a list of rules applicable to elements or to attributes.
Example:
<indexing word-min='1' word-max='30' string-max='50' xmlns:my="http://www.acme.com/ns/my" > <!-- Rules for all elements. --> <element as="numeric+string" /> <element as="date" sieve="FormatDateSieve" format="yyyy-MM-dd" /> <element as="string" /> <!-- A specific rule for element NumData: disable full-text indexing inside this element. --> <element name="NumData" full-text="false" /> <!-- Rules for all attributes. --> <attribute /> <attribute as="numeric+string" /> <attribute as="date+string" /> <attribute as="string" /> <!-- A specific rule for attribute my:Date of element my:Invoice: its format is a localized date. --> <attribute name="my:Date" context="my:Invoice" as="date" sieve="FormatDateSieve" format="MM/dd/yyyy" /> </indexing>
These properties apply globally to a specification. They appear as attributes of the top element indexing
:
string-max
An integer value specifying the maximum length of a String key (default is 50). An element content or attribute value longer than this value is not indexed.
The purpose is to avoid cluttering the indexes with useless long values (like a complete paragraph).
word-max
An integer value specifying the maximum length of a word in the full-text index (default is 30).
The purpose is to prevent long strings without whitespace to be treated like words if they are never to be searched in full-text mode.
word-min
An integer value specifying the minimum length of a word in the full-text index (default is 2).
This is a simple way of supporting "stop words".
word-sieve
Deprecated. From version 3.0, Text Tokenizers replace word sieves and are plugged through the Java API only. See section "Custom Sieves".
How rules work:
Each rule defines a conversion method from a text value (contained in a simple element, or in an attribute) to data types like number (double floating-point) and date.
For each text value to convert, rules are applied in sequence (from the most specific to the least specific).
When a rule succeeds (i.e its conversion method works on the text value considered), the converted value is stored in the indexes specified in the rule, and the conversion process for this text value finishes.
A rule has different properties:
Whether it applies to element content or to attribute values: the tag <element/>
or <attribute/>
is used respectively.
A name
and/or a context
(optional) to restrict the applicability of the rule to specific element/attribute names in a particular context of ancestor elements. Default rules — no name and no context — apply to all elements or attributes.
What are the target indexes: date, numeric, string or a combination.
The conversion method: this is called a Sieve, it is implemented by a Java class, and can be passed parameters.
Element rules generally serve to define how simple element content is indexed. They can also enable or disable full-text indexing through their attribute "full-text".
Element rules are specified in the indexing specification by an empty element named element
. Its properties are defined by attributes:
name
When specified, the name indicates that the rule applies only to elements which have this name. The name can of course have a namespace prefix.
If the name is absent, the rule is a default rule applicable to all elements of documents to be indexed.
context
It is a specification of ancestors of the element to which the rule applies.
The element names are separated by a space or a slash. Names can have a namespace prefix.
A name can also be the wildcard '*' matching any element.
The rightmost name matches the parent, the leftmost matches the "oldest" ancestor, like in a XSLT pattern.
Example: this rule applies to an element named birth-date,
child of customer
itself grand-child of invoice
:
<element name="birth-date" context="invoice/*/customer" as="date" sieve="FormatDateSieve" format="MM/dd/yyyy" />
as
Specifies the target indexes. Possible values are:
date
: if the conversion to a date is possible (using a DateSieve
, as explained hereafter), then the content is indexed as a date (Date Simple Content index), else this rule fails.
date+string
: same as date, but the contents is indexed both as a date and as a string value (Simple Content index).
number
: if the numeric conversion is possible (using a NumberSieve
, as explained hereafter), then the content is indexed as a numeric value (Numeric Simple Content index), else this rule fails.
number+string
: same as number, but also index as string value.
string
: index as string value. This will never fail, so it must be the last applicable rule in a particular context.
sieve
Specifies an analyzer which performs conversion from string to number or date.
A rule where as
is date
or date+string
must specify a DateSieve
; a rule where as
is number
or number+string
must specify a NumberSieve
.
A predefined Sieve can be selected here, or it is possible to specify a custom Java class (See section "Custom Sieves".)
Parameters for sieves (predefined or custom) are specified as additional attributes of the rule.
Predefined Sieve classes.
sieve="FormatNumberSieve"
is the default when attribute as
specifies a numeric conversion.
Parameters:
Optional parameter format
(as specified by java.text.DecimalFormat
).
By default the format corresponds to double literals in the XQuery language, or to the xs:double type in XML Schema..
Optional parameter locale
specifies the locale for the format. Values accepted are similar to the values accepted by java.util.Locale
, for example en-US
or de
.
<element name="amount" context="invoice" as="numeric" format="000.0#" />
sieve="ISODateSieve"
is the default when attribute as
specifies a date. This sieve accepts a type date or dateTime in ISO601 format, for example 2006-05-05
or 2006-05-05T12:30:00Z
.
There is no additional parameter.
sieve="FormatDateSieve"
specifies a date conversion with a format similar to patterns accepted by java.text.SimpleDateFormat
.
Parameters:
Optional parameter format
(as specified by java.text.SimpleDateFormat
). By default the local ``short format'' is used (for example MM/DD/YYYY
in US locale), and the time-zone is the default time-zone of the Java Runtime.
Optional parameter timezone
specifies the default time-zone for the sieve. Values accepted are similar to the values accepted by java.util.TimeZone
.
Attention:
Optional parameter locale
specifies the locale for the format. Values accepted are similar to the values accepted by java.util.Locale, for example en-US
or de
.
Optional parameter lenient
accepts a boolean value (true
or false
). By default the sieve is not lenient: it accepts only values strictly matching the format.
Example:
<element name="edit-date" as="date" sieve="FormatDateSieve" format="yyyy-MM-dd" timezone="GMT-5" />
full-text
Value is yes
or no
.
Full-text indexing is enabled by default for elements.
Setting no
disables full-text indexing inside the applicable element, that is, also for descendant elements (unless explicitly re-enabled).
Full-text can be re-enabled on descendant elements with a rule with full-text="yes"
.
Attribute rules are very similar to element rules.
The main difference is about full-text indexing:
The full-text yes/no attribute is applicable only to the considered attribute.
Full-text indexing is not enabled for attributes by default.
Actually, full-text search in attribute values is not yet supported in the current version of Qizx.
The default indexing specification is set when creating a new XML Library. It can be written as follows:
<indexing> <element as="numeric+string"/> <element as="date+string" /> <element as="string" /> <attribute as="numeric+string" /> <attribute as="date+string" /> <attribute as="string" /> </indexing>
Interpretation:
if a simple element content can be converted by the default numeric sieve [first element rule], then it is indexed both as a number and as a string,
else if its value can be converted by the default date sieve [second element rule], then it is indexed both as a date and as a string,
else it is indexed as a string if its length is less than the string-max parameter.
The same for attributes.
Full-text indexing is enabled by default, and uses the default TextTokenizer
.
the default rules are not implicitly used when you write a new Indexing Specification (see below). That means that you have to explicitly copy these rules into your Indexing Specification if you want to use them.
In the current version of Qizx, simple Indexing Specifications can be edited using Qizx Studio. See below for more details.
In the general case, indexing specifications have to be written as an XML file and then stored into an XML Library (and then indexes should be rebuilt if necessary).
The recommended practice is to start from the default specification as provided above and add rules and/or modify default rules.
The following important points should be remembered:
The default rules are not implicit. That is, you have to copy these rules into your Indexing Specification if you want to use them. The reason is that you may want to not use some of these rules.
When performing queries on an XML Library, Qizx relies on the actual indexes of the Library. This means that if some information is not indexed, then the corresponding queries would return no result.
Note: it would be unmanageable to use indexes in some parts of a Library, and a "fallback" strategy in some other parts.
For example if your Indexing Specification blocks indexing of numeric values, then a query like //good[@weight > 100]
will not work (because it relies on the numeric value of attribute 'weight').
Similarly, if your Indexing Specification blocks full-text indexing in some parts of documents, then a full-text query will find no result in those parts.
The specification is stored in the Library. It is initialized when creating the library, then used automatically when documents are added.
Select the concerned Library.
Right-click and select "
" in the menu, then " " in the sub-menu.This brings a dialog that allows you selecting the file containing the Indexing Specification. You can also use the button "
" to select the default rules.When you push the button "
", the specification is parsed and stored if valid.Then you are suggested to rebuild the indexes entirely. This is highly recommended since the indexing rules have changed. If the Library is empty this is of course not necessary.
Command line options for creating a new Library in a Library group with a custom Indexing Specification:
qizx -groupgroupLocation
-librarylibName
-indexingspecification
-create
Here 'groupLocation
' is the directory that contains the group of Libraries, 'libName
' is the name of the Library to create and 'specification
' is the path of a XML file that contains the specification.
Command line options for changing the Indexing Specification of an existing Library:
qizx -groupgroupLocation
-librarylibName
-indexing specification -reindex
It is necessary to use the option -reindex
to rebuild the indexes, unless the Library contains no document.
A custom Sieve is necessary if you want to index in numeric or date/time form some content, and the default Sieves provided with Qizx are not suitable:
Numeric value: the value cannot be parsed by the Java class java.text.DecimalFormat
.
Date/time value: the value cannot be parsed by java.text.SimpleDateFormat
and is not an ISO date.
Full-text: you want more capabilities than provided by the default TextTokenizer
(for example to handle a specific language). Full-text customization has changed in 3.0 and is achieved through the Java API: see the package com.qizx.api.fulltext and its plug-in interface FullTextFactory
.
As seen above, a custom sieve is specified by a sieve
attribute in an element or attribute rule. The value of the sieve
attribute is a fully qualified name of a Java class.
Custom Sieve Java classes must of course be accessible through the CLASSPATH
of your application (or more exactly by its class loader) .
For more details, refer to the Java documentation of interfaces below and to the source code of default implementations (provided in the distribution).
Must implement the interface com.qizx.api.Indexing.NumberSieve
.
The default implementation is com.qizx.util.text.FormatNumberSieve
.
Must implement the interface com.qizx.api.Indexing.DateSieve
.
The default implementation is com.qizx.util.text.ISODateSieve.
Another predefined implementation is com.qizx.util.text.FormatDateSieve
which is based on Java SimpleDateFormat
.
Example:
<indexing word-sieve="com.mybusiness.xmlapp.WordSieve"> <element as="number" sieve="com.mybusiness.xmlapp.NumberSieve"/> <element as="date" sieve="com.mybusiness.xmlapp.DateSieve" param1="..." param2="..."/> <element as="token" /> <attribute as="number" sieve="com.mybusiness.xmlapp.NumberSieve"/> <attribute as="date" sieve="com.mybusiness.xmlapp.DateSieve"/> <attribute as="token" /> </indexing>