Table of Contents
Starting from version 3.0, Qizx supports most of the new XQuery Full-Text candidate standard.
The full-text facility existing in former versions is completely deprecated, and is no more available. A Migration guide can be found at the end of this chapter.
The first section is an introduction to the Standard Full-Text (XQFT). Since there is currently little literature about this new standard, except the specifications, we hope you will find this tutorial useful.
Support of the XQuery Full-Text facilities in Qizx is detailed in the second section.
This tutorial (http://www.xmlmind.com/_tutorials/XQueryFullText), after a short presentation of main concepts, simply introduces main features through concrete examples.
Qizx 3.0 supports a large part of mandatory and optional XQuery Full-Text features.
The two following chapters detail supported and unsupported features. To understand this section, it is recommended to have some acquaintance with XQFT, through W3C specifications or by reading our tutorial.
Qizx now supports many full-text features, but some capabilities - namely stemming and thesaurus - are highly language-specific and can only be supported by specialized extensions.
To offer the best language support, Qizx full-text can be extended through the Java API. It is possible to plug objects supporting:
Text Tokenization (see below).
Stemming (but no implementation is available by default).
Thesaurus lookup (no implementation is available by default).
Scoring: score computation can be redefined by plugging another Scorer.
Scoring: see dedicated section below.
Operator not in: supported.
Operator ftnot: fully supported.
Order (keyword ordered): fully supported.
Cardinality (occurs ... times): fully supported.
Proximity (keywords window and distance): support of the "words" unit.
Ignore (keyword without content) is supported, except some corner cases.
Language:
The language if specified is used for finding a Text Tokenizer (see below), for stemming and in Thesaurus lookup.
In the API, the related methods of the class FullTextFactory have a language
argument.
Case sensitivity (option 'case sensitive').
Note: queries using this feature can be significantly slower, especially if a large number of documents are searched.
Diacritic characters sensitivity (option 'diacritics sensitive').
Note: queries using this feature can be significantly slower, especially if a large number of documents are searched.
Wildcards (option 'with wildcards').
Note: looking up indexes for matches of a wildcard is normally quite fast (depending on the size of indexes of course). A wildcard character in first position (e.g ".*tion") can induce a measurable overhead (typically a few tens of milliseconds).
Stemming: (option 'with stemming')
Supported, but no stemmer is available by default. Stemmers can be plugged through the API (see below)
Mixing Stemming and Case Sensitivity is not guaranteed to return proper results, as stemmers can fold the case.
Thesaurus (option 'with thesaurus').
Supported, but no Thesaurus available by default. Thesaurus drivers can be plugged through the API (see below).
Some combinations of operators and options have unspecified or unclear meanings, therefore no guaranty can be given about the results returned.
Examples:
("yellow" ftor "red") distance at least 3 words
Does not make sense since the distance cannot be computed if only one of the two words is present.
"York" ftand ftnot "New" window 2 words
Could be interpreted as an occurrence of "York" without "New" around. This is however questionable in terms of semantics and has to be clarified in the specifications.
Window and Distance "big units" ("sentence" and "paragraph"). Might be supported in the future.
Scope ("same sentence", "different paragraph" etc). Might be supported in the future.
Stop-words:
We regard stop-words as a feature from the past, only useful when it was important to reduce the size of indexes.
Scoring in Qizx is document-based. This means that all matched nodes belonging to a given document get the same score.
Computing scores on a node basis is a new concept, not yet well understood. In addition, that would probably be costly in terms of computation time. It is possible that future versions of Qizx optionally offer node-based scoring.
Score computation for a document relies on two values associated with each term (word) of a query:
Relative term frequency in a document: the frequency of the word in the document divided by the average frequency in all documents. So if a term is more frequent in the considered document than the average, the score will be higher for that document.
Inverse Document Frequency: the total number of documents divided by the number of documents that contain the term. When a term is present in a smaller number of documents, it is considered more relevant and gets a higher score.
The exact formulas used for computing the score are defined by a pluggable object, implementing the interface com.qizx.api.fulltext.Scorer
.
The default scorer is implemented by class com.qizx.api.util.fulltext.DefaultScorer
.
The default scorer no longer supports document ranking through a metadata property "ft-weight
" set on a Document.
This feature has been disabled in 3.1 because it makes scoring too slow. Future versions will provide a faster mechanism, plus features for fast heuristic scoring on very large document sets.
Tokenization is the process of chunking text into "words", here called "tokens". It is in general very language-specific.
Tokenizers can be plugged through the API. See package com.qizx.api.fulltext .
The Qizx distribution contains a generic Tokenizer that works with most Western languages, without taking into account linguistic particularities.
Overlapping tokens ares not supported.
Overlapping tokens would happen for example with a composed word like "new-born", if one insists indexing both the whole word new-born and each of the two words new and born.
The recommended practice is to always separate composed words into simple words, for example to treat the dash as a whitespace. This will work correctly both in indexing and queries. In some languages, like German for example, this might be a difficult task and requires using a dictionary.
A token is not allowed to span element boundaries.
A situation where a word is split by an element boundary seems very unusual, the only example we can think of is an element used to mark a "drop cap" or initial letter in a paragraph, like in:
<p><big>O</big>nce upon a time...</p>
but that is definitely not a good idea.
Qizx full-text can be extended through the Java API. This allows plugging language-specific functionalities such as Stemming and Thesaurus (for tokenization, please previous section).
The Java package com.qizx.api.fulltext contains several interfaces defining extension points. The package com.qizx.api.util.fulltext contains basic implementations.
For more information please read the javadocumentation of these interfaces and classes.
Plugging is performed through interface com.qizx.api.fulltext.FullTextFactory
, which creates other objects like tokenizers, Scorers, Stemmers, Thesaurus drivers.
A new implementation of FullTextFactory can be set on each Library
or XQuerySession
interface. Notice that an implementation set for querying has to be consistent with the FullTextFactory used for indexing documents, in order to get meaningful results, in particular the same Tokenizer should be used (i.e created by the factory).
Stemming: supported through an implementation of interface com.qizx.api.fulltext.Stemmer
.
No stemmer is officially supported in the distribution. A sample implementation of a Stemmer based on the snowball package is available in the API samples.
Mixing Stemming and Case Sensitivity is not guaranteed to return proper results, as stemmers can fold the case.
Thesaurus: supported through an implementation of interface com.qizx.api.fulltext.Thesaurus
.
No Thesaurus is officially supported in the distribution. A very simple implementation of a Thesaurus is available in the API samples.
When introducing the standard XQuery Full-Text in Qizx 3.0, we have discarded the former full-text facilities based on the extension function ft:contains (also accessible by the name x:fulltext). A radical decision, motivated by the wish to keep Qizx clean and avoid unnecessary legacy.
To help migrating queries written with the former function ft:contains, a correspondence table is provided here. To help understand this section, it is recommended that you have knowledge of the standard full-text syntax and capabilities. A tutorial is provided above.
Table 7.1. Correspondence from former full-text
Description | Former full-text | Standard full-text |
---|---|---|
Simple term | //LINE[ ft:contains("Juliet") ] | //LINE[ . ftcontains "Juliet" ] |
Specify a sub-context | //SPEECH[ ft:contains("Juliet", SPEAKER)] | //SPEECH[ SPEAKER ftcontains "Juliet" ] |
All words | //LINE[ft:contains("Juliet AND romeo")] //LINE[ft:contains("Juliet romeo")] | //LINE[ . ftcontains "Juliet romeo" all words] //LINE[ . ftcontains "Juliet" ftand "romeo" ] |
All words (from a computed string sequence) | declare variable $w := ("Juliet", "romeo"); //SPEECH[ ft:all-words($w) ] | declare variable $w := ("Juliet", "romeo"); //SPEECH[ . ftcontains { $w } all words] |
Any word in a list | //LINE[ft:contains( "Juliet OR romeo")] | //LINE[ . ftcontains "Juliet romeo" any word] //LINE[ . ftcontains "Juliet" ftor "romeo" ] |
Any word (from a computed string sequence) | declare variable $w := ("Juliet", "romeo"); //LINE[ ft:any-word($w) ] | declare variable $w := ("Juliet", "romeo"); //LINE[ . ftcontains { $w } any word ] |
Exclude a word | //LINE[ft:contains( "Juliet AND NOT romeo")] or //LINE[ft:contains("Juliet -Romeo")] | //LINE[ . ftcontains "Juliet" ftand ftnot "romeo" ] |
Phrase | //LINE[ft:contains( "'to be or not to be'")] | //LINE[ . ftcontains "to be or not to be" ] |
Phrase (from a computed string sequence) | declare variable $ph := ("to be", "or not", "to be"); //SPEECH[ ft:phrase($ph) ] | declare variable $ph := ("to be", "or not", "to be"); //SPEECH[ . ftcontains { $ph } phrase ] |
Phrase with window | //LINE[ft:contains( "'to be the question'~10")] | //LINE[ . ftcontains "to be the question" window 10 words ] |
And of Phrases | //SPEECH[ft:contains( "'to be' AND 'to die, to sleep'")] | //SPEECH[ . ftcontains "to be" ftand "to die, to sleep"] |
Or of Phrases | //SPEECH[ft:contains( "'to be' OR 'to die, to sleep'")] | //SPEECH[ . ftcontains "to be" ftor "to die, to sleep"] |
Phrase1 but not phrase2 | //SPEECH[ft:contains( " 'to be' NOT 'to die' ")] | //SPEECH[ . ftcontains "to be" ftand ftnot "to die" ] |
Term with wildcard | //LINE[ ft:contains("H%let") ] //LINE[ ft:contains("H_mlet") ] | //LINE[ . ftcontains "H.*let" with wildcards ] //LINE[ . ftcontains "H.mlet" with wildcards ] |