Table of Contents
Starting from version 3.0, Qizx implements the standard XQuery Full-Text from the W3C (abbreviated XQFT hereafter).
Please see chapter Standard Full-Text for more information about standard full-text support. That chapter contains a section explaining how to migrate your Qizx 2.2 applications from the former full-text functionalities.
This current chapter introduces new full-text extension functions from version 3.1:
A simplified search function that uses a simpler and more usual query syntax than the XQuery Full-Text standard.
Note: it is actually similar to the former full-text function (in Qizx 2.2 and before), but beware that the syntax is somewhat different.
Utility functions for highlighting full-text terms, generating summary snippets, looking up indexes and finding spell-checking suggestions.
The justification for a simplified full-text search facility is the following:
A standard XQFT query is not an object than can be manipulated by an XQuery script. This makes it more difficult for an XQuery application to synthesize a full-text query and then execute it, unless one resorts to a dynamic evaluation function like Qizx x:eval().
The standard XQuery Full-Text from the W3C is not yet a completely stable specification (in July 2009, it reached the stage of Candidate Recommendation, and it can take up to one year before it becomes a definitive standard).
The standard W3C full-text syntax is a bit complex and unusual, even for advanced users (those users who would otherwise have no difficulty with a query like: title:product +"beta quality" -alpha
).
This syntax is very simple and resembles the one found in most full-text engines. Notice that there is no notion of Fields, since XQuery itself provides all the means of searching specific parts of XML documents.
Search Capability | Examples | Remarks |
---|---|---|
Simple word (without quotes) | Hello | Tokenized according to the language and configuration. Note than a composed word like never-ending can actually be tokenized into 2 words, equivalent to phrase "never ending". |
Wildcard | ?ello *ell* | Can be used in place of a simple word inside a phrase. |
Phrase (single or double quotes) | "Hello world" 'Hello, world!' | Tokenized according to the language and configuration. |
Phrase with proximity | "hello world"~3 | Same meaning as in "window 3 words" of the standard syntax: matches "hello new world", but not "hello brand new world". |
Required term | +world +'Hello world' | Acts like a ftand, while plain terms act like a ftor. |
Negated term | -hello -"old world" | Such terms must not be found in the searched document or fragment. |
Juxtaposition | hello "brave new world" +me -you | Terms without + are ORed. Terms with + are ANDed. The example on the left is equivalent to: |
function ft:contains ($query
, [$options
])
function ft:contains ($query
,$context
,$options
)
returns true if the search context matches the full-text query.
Note: this function is similar to the former ft:contains function of Qizx up to version 2.2, but beware that the query syntax is not quite the same.
This function is typically used as a predicate in a Path Expression. Examples:
//SPEECH[ ft:contains("+romeo +juliet") ], //SPEECH[ ft:contains(" 'to be or not to be' ", LINE, <options/>) ]
Returned value: true if the context matches the query, false otherwise.
Parameter $
query: A query using the simple full-text syntax.
Parameter $context (optional): A node, or sequence of nodes, inside which the full-text expression is searched for. Note: this is the equivalent of a Field in classical full-text engines.
When context
parameter is not specified, the current context node '.' is used implicitly like in the example above. Note that when the function is called with 2 arguments, the last argument represents the options, not the context.
When context
parameter is present, it specifies a smaller search domain (in general inside to the current context node) . The 2nd example above finds SPEECH elements which contain at least one LINE element which in turn contains the phrase 'to be or not to be'.
Parameter $options (optional): An element (conventionally named "options") bearing attributes:
attribute case
: value is "sensitive" or "insensitive" (using only first characters, e.g "sens", is allowed)
attribute diacritics
: value is "sensitive" or "insensitive"
attribute language
: value is a legal language name, used for tokenizing words and phrases, and stemming. This option must precede stemming and thesaurus options if used (see below).
attribute stemming
: value is a boolean "true" or "false". Assumes that the application provides a Stemmer implementation (see the Java API documentation).
attribute thesaurus
: value is a thesaurus URI. Assumes that the application provides a Thesaurus implementation (see the API documentation).
Example:
<options language="fr" diacritics="sensitive"/>
function ft:score ($sequence
, [$length
], [$start
])
returns the sequence sorted by decreasing full-text score. Optionally, the result sequence can be 'sliced' in pages by specifying the first element and the length of a page.
The input sequence is typically a full-text search expression using either ft:contains()
or the standard operator 'contains text
'.
The purpose of this function is to simplify the use of scoring, but also to make it more efficient than the `for score ... order by $score descending
' pattern of XQFT standard. Further versions of Qizx could enhance this function to make it even more efficient by allowing fast heuristic scoring strategies.
When $length and $start are used, this function is an optimized equivalent of:
fn:subsequence( for $hit score $score in $sequence order by $score descending return $hit, $start, $length )
Example:
ft:score( //SPEECH[ ft:contains("hello +world") ], 10 )
Returned value: The input sequence ordered by descending score, possibly sliced.
Parameter $
sequence: A query using the simple full-text syntax (function ft:contains), or the standard 'contains text'
operator.
Parameter $length (optional): Number of results to be returned. Used for slicing results. If not specified, the value is 10.
Parameter $start (optional): rank of the first hit to be returned. Used for slicing results.
function ft:highlight ($node
,$query
, [$options
]) as node()
Transforms an XML fragment (document or node) by replacing each occurrence of the words of a full-text query by a XML template that contains the word. This is called highlighting because typically it can be used with a formatting language (HTML) to render the word with some styling, using for example CSS.
Words within a ftnot clause are not highlighted.
Word occurrences are highlighted individually. For example if the query specifies a phrase, all occurrences of the words of this phrase will be highlighted, whether they belong to an occurrence of the phrase or not.
Example:
let $doc := <P>this is some text searched by a query.</P> return ft:highlight( $doc, "query text", <options word-wrap="B"/> )
returns:
<P>this is some <B>text</B> searched by a <B>query</B>.</P>
Returned value: A copy of the node in which all occurrences of the full-text query words are replaced by the specified pattern.
Parameter $node
: an XML fragment (document or node) to be highlighted.
Parameter $query
: An expression which is either of:
The operator contains text
. Example:
ft:highlight($node, . contains text "hello world" any word)
Note: the expression must be exactly 'contains text
', a boolean combination is not allowed. The context part (here '.') is ignored. Full-text options following contains text
are taken into account.
the function ft:contains(). The optional context
argument is ignored. Full-text options are taken into account.
ft:highlight($node, ft:contains(" 'hello world' "))
Note: in this example, although the query requires a phrase, all individual occurrences of the words 'hello' and 'world' will be highlighted, not the phrase only.
a string (using the simple full-text syntax). In that case it is not possible to specify options.
ft:highlight($node, "hello world")
Parameter $options (optional): An element (conventionally named "options") with attributes containing the options. There are two ways of specifying how a word is "highlighted":
The first way uses a simple element bearing an attribute, similar to the SPAN element of HTML with a class attribute:
attribute word-wrap
: its value is the name of an element used to wrap the word. Default is "B".
optional attribute word-style
: value is the name of an attribute placed on the word-wrapper element. It is not present by default.
optional attribute word-pattern
: value is a pattern that is used to give a value to attribute word-style
. If it contains the character %, this character is replaced by the rank of the word in the query.
Example:
let $doc := <P>this is some text searched by a query.</P> return ft:highlight( $doc, "xquery +text", <options word-wrap="SPAN" word-style="class" word-pattern="hilite%"/> )
produces:
<P>this is some <SPAN class="hilite1">text</SPAN > returned by a <SPAN class="hilite0">XQuery</SPAN> expression.</P>
The second way uses a function called by name (XQuery cannot pass a function as a parameter of another function):
attribute word-function
: value is the name of a function that is called for each occurrence of a word to highlight. The value returned must be a Node which replaces the word.
The called function must be compatible with this signature:
function($word as xs:string, $word-rank as xs:int, $node as text()):
$word receives a string which receives the value of the word
$word-rank is an integer which receives the rank of the word in the query.
$node is the text node that contains the word. This allows to test arbitrarily complex conditions.
Example that highlights a word with bold if it is inside a TITLE, otherwise with a span/class:
declare function local:hilite($word, $word-rank, $node) { if($node/parent::TITLE) then <B>{$word}</B> else <span class="hilite{$word-rank}">{$word}</span> } let $doc := <P>this is some text searched by a query.</P> return ft:highlight( $doc, . contains text "query text" all words, <options word-function="local:hilite"/> )
function ft:snippet ($node
,$query
, [$options
]) as element()
Extracts a representative snippet from a document. words from a full-text query are "highlighted" in the same way as the ft:highlight function. This allows getting a result similar to the snippets produced by most major web search engines.
A snippet is an element that contains text fragments and highlighted words.
Example:
for $doc in //SPEECH[ ft:contains("hello +world") ] return ft:snippet($doc)
Returned value: An element node containing the snippet.
Parameter $node
: an XML document or node to be represented.
Parameter $query
: A string (simple syntax query) or an expression using contains text, for example . contains text "hello world"
.
Parameter $options (optional): An element (conventionally named "options") with attributes.
Options similar to ft:highlight:
attribute word-wrap
: its value is the name of an element used to wrap the word. Default is "B".
optional attribute word-style
: value is the name of an attribute placed on the word-wrapper element. It is not present by default.
optional attribute word-pattern
: value is a pattern that is used to give a value to attribute word-style
. If it contains the character %, this character is replaced by the rank of the word in the query.
attribute word-function
: value is the name of a function that is called for each occurrence of a word to highlight. The value returned must be a Node which replaces the word.
The called function must be compatible with this signature: function($word as xs:string, $word-rank as xs:int, $node as text()):
Specific options:
attribute snippet
: its value is the name of an element used to wrap the snippet. Default is "snippet".
optional attribute length
: the maximum number of words in the snippet. Default value is 20.
optional attribute work-size
: the maximum number of words from start examined to find the best parts of the document. Default value is 500.
function ft:word-count($word as xs:string
) as xs:integer?
returns the total count of occurrences of this word in the current XML Library.
Example:
ft:word-count("hamlet") (: counts occurrence of Hamlet, HAMLET etc. :)
Parameter $word
: A string containing a single word. Character case and diacritics are not taken into account.
Returned value: An positive integer item, or the null sequence if the word is not found, or if not connected to an XML Library.
function ft:word-doc-count($word as xs:string
) as xs:integer?
returns the total count of documents in the current XML Library that contain at least one occurrence of this word.
Parameter $word
: A string containing a single word. Character case and diacritics are not taken into account.
Returned value: An positive integer item, or the null sequence if the word is not found.
function ft:word-lookup([$word-pattern as xs:string?
]) as xs:string*
returns a list of words indexed in the current XML Library that match the pattern. If no pattern is passed, then all the words indexed in the Library are returned.
Attention: words are sorted ignoring character case and diacritics, and the different forms in which a word occurs are not returned. For example ft:word-lookup("cafe")
does not return a sequence like ("CAFE", "CAFÉ", "Cafe", "Café", "cafe", "café")
even if these forms occur in the XML Library. This situation is likely to change in later versions, which will optimize case-sensitive and diacritics-sensitive searches, but that will require to change the representation of indexes.
Parameter $word-pattern
: A string containing a wildcard pattern (standard syntax, case and diacritics insensitive). If absent, then all the words indexed in the Library are listed.
Returned value: A sorted list of strings, or the null sequence if the word is not found. Sorting is done ignoring character case and diacritics.
function ft:suggest($word as xs:string
) as xs:string*
returns a list of words that are "close to" the specified word, sorted by increasing distance. The distance used is a simple Levenshtein algorithm, where differences in case or diacritics have a lesser weight than deletion or insertions. The function also tries space insertion (e.g "myword" can yield "my word").
Note: this function is not a spell-checking facility, it can only return words that actually appear in a document of the Library.
Parameter $word
: A string containing a single word. Character case and diacritics are taken into account for distance calculation.
Returned value: A string sequence containing at most 20 suggestions. Best effort is done for returning at least one suggestion.
This section is a short tutorial showing how to use Qizx full-text functionalities.
The most classical way of doing full-text queries is to look for whole documents matching a full-text expression anywhere in their contents. For example, using standard XQuery Full-Text:
/*[ . contains text "printing press" ] (: uses implicit collection :)
or the same using the simplified syntax:
/*[ ft:contains(" 'printing press' ") ] (: notice the quotes :)
The 2 examples above return a sequence of the root elements of the matching documents. If you want to retrieve the Document objects themselves, use xlib:document():
for $doc in /*[ . contains text "printing press" ] return xlib:document($doc)
The Advanced Search by Google offers the possibility to search for pages that match "all these words", "this exact wording or phrase", "one or more of these words", but not pages that have "any of these unwanted words" (words are specified in form fields).
This is easy to implement with XQFT and Qizx, assuming that you have the field values in 4 variables named $all, $exact, $any, $unwanted:
/*[ . contains text { $all } all words ftand { $any } any word ftand { $exact } phrase ftand ftnot { $unwanted } any word ]
Note that if all fields are empty, no error is detected but no document is returned.
The function ft:score is designed to make easier to finding best scoring documents and list them in pages. To display the first 10 documents matching a query:
ft:score( /*[ . contains text "printing press" ] , 10)
To display the following 10 documents by descending score, just increment a variable $start (initialized to 0) by 10 and use it as third argument of
ft:score( /*[ . contains text "printing press" ] , 10, $start)
Popular web search engines display a short abstract of each document showing highlighted terms of the full-text query. The function ft:snippet allows to do this easily in Qizx:
let $query := "printing press" for $doc in /*[ . contains text { $query } ] return ft:snippet($doc, $query)
The output of ft:snippet and ft:highlight functions can be controlled finely (see the reference documentation).
This query finds the 10 best matching documents, and for each document returns a snippet where the query terms are in bold:
for $doc in ft:score(/*[ . contains text { $all } all words ftand { $any } any word ftand { $exact } phrase ftand ftnot { $unwanted } any word ], 10) return <div><h4>{ xlib:document($doc) }</h4> { ft:snippet($doc, . contains text { $all } all words ftand { $any } any word ftand { $exact } phrase, <options word-wrap="b"/>) }</div>