Chapter 12. General XQuery extension functions

Table of Contents

1. Serialization
1.1. Serialization to XML, HTML, XHTML, plain text
1.2. JSON Serialization
2. Parsing
2.1. XML Parsing
2.2. Semi-structured Content Parsing
2.2.1. JSON parser
2.2.2. HTML parser
2.2.3. HTML5 parser
3. XSL Transformation
4. Dynamic evaluation
5. Query extensions
5.1. Estimated count and pagination
5.2. Pattern-matching
5.3. Range testing
6. Date and Time
6.1. Differences with W3C specifications
6.2. Cast Extensions
6.3. Additional constructors
6.4. Additional accessors
7. Error handling

These general purpose functions belong to the namespace denoted by the predefined "x:" prefix. The x: prefix refers to namespace "com.qizx.functions.ext".

1. Serialization

1.1. Serialization to XML, HTML, XHTML, plain text

Serialization — the process of converting XML nodes into a stream of characters — is defined in XQuery 1.0 Specifications, however there is no standard function for performing serialization.

x:serialize can output a document or a node into XML, HTML, XHTML or plain text, to a file or to the default output stream.

x:serialize( $node as node(), $options as element(option) )
  as xs:string?

Description: Serializes the element and all its content into text. The output can be a file (see options below).

Parameter $treea XML tree to be serialized to text.

Parameter $optionsan element bearing options in the form of attributes: see below.

Returned value: The path of the output file if specified, otherwise the serialized result.

The options argument (which may be absent) has the form of an element of name "options" whose attributes are used to specify different options. For example:

x:serialize( $doc,
             <options output="out\doc.xml"
                      encoding="ISO-8859-1" indent="yes"/>)

This mechanism is similar to XSLT's xsl:output specification and is very convenient since the options can be computed or extracted from a XML document.

Table 12.1. Implemented serialization options

option namevaluesdescription
methodXML (default) XHTML, HTML, or TEXToutput method
output / filea file pathoutput file. If this option is not specified, the generated text is returned as a string.
versiondefault "1.0"version generated in the XML declaration. No validity check.
standalone"yes" or "no".No check is performed.
encodingmust be the name of an encoding supported by the JRE.The name supplied is generated in the XML declaration. If different than UTF-8, it forces the output of the XML declaration.
indent"yes" or "no" (default "no").output indented.
indent-value (extension)integer valuespecifies the number of space characters used for indentation.
omit-xml-declaration"yes" or "no" (default "no").controls the output of a XML declaration.
include-content-type"yes" or "no" (default "no").for XHTML and HTML methods, if the value is "yes", a META element specifying the content type is added at the beginning of element HEAD.
escape-uri-attributes"yes" or "no" (default "yes").for XHTML and HTML methods, escapes URI attributes (i.e specific HTML attributes whose value is an URI).
doctype-publicthe public ID in the DOCTYPE declaration.Triggers the output of the DOCTYPE declaration. Must be used together with the doctype-system option.
doctype-systemthe system ID in the DOCTYPE declaration.Triggers the output of the DOCTYPE declaration.
auto-dtd (extension)"yes" or "no" (default "yes").

If the node is a document node and if this document has DTD information, then output a DOCTYPE declaration.

  • A Document stored in an XML Library may have properties storing this information (dtd-system-id and dtd-public-id) initially set by import.

  • a parsed document gets DTD information from the XML parser.

  • a constructed node has no DTD information.


 

1.2. JSON Serialization

This function transforms an XML tree representing JSON data into JSON format.

The XML JSON tree is typically built by the x:content-parse function but can also be built by XQuery constructor.

In future versions supporting XQuery 3.0 Maps and Arrays, this function will also be able to serialize such data into JSON format.

x:serialize-json( $json-data as item(), $options as element(option) )
  as xs:string?

Description: Serializes the element and all its content into JSON format. The output can be a file (see options below) or a string.

Parameter $treea XML tree representing JSON data to be serialized. This tree must conform with the JSON schema used by Qizx (see below).

Parameter $optionsan element bearing options in the form of attributes: see below.

Returned value: The path of the output file if specified, otherwise the serialized result.

The options argument (which may be absent) has the form of an element of name "options" whose attributes are used to specify different options. For example:

x:serialize-json( $doc, <options file="json.xml" />)

with $doc holding a XML document representing JSON data in the Qizx/JSON representation:

<?xml version='1.0'?>
<map xmlns="com.qizx.json">
  <pair name="a">
    <number>1.0</number>
  </pair>
  <pair name="b">
    <array>
      <boolean>true</boolean>
      <string>str</string>
      <map/>
    </array>
  </pair>
  <pair name="nothing">
    <null/>
  </pair>
</map>

then the file json.xml will contain:

{ "a": 1.0, "b": [ true, "str", {  } ], "nothing": null }

Table 12.2. Implemented JSON serialization options

option namevaluesdescription
methodXML (default) XHTML, HTML, or TEXToutput method
output / filea file pathoutput file. If this option is not specified, the generated text is returned as a string.
indentinteger valuespecifies the number of space characters used for indentation.

 

2. Parsing

2.1. XML Parsing

function x:parse($xml-text)
  as node()?

Parses a string representing an XML document and returns a node built from that parsing. This can be useful for converting to a node a string from any origin.

Note that function x:eval could be used too (and it is more powerful, since any kind of node can be built with it), but there are some syntax differences: for example in x:eval, the curly braces { and } have to be escaped by duplicating them.

Parameter $xml-textA well-formed XML document as a string.

Returned value: A node of the Data Model if the string could be correctly parsed; the empty sequence if the argument was the empty sequence. An error is raised if there is a parsing error.

2.2. Semi-structured Content Parsing

From version 4.2, Qizx offers a generic mechanism to plug Content Importers, i.e parsers of "semi-structured data", i.e data that is not XML, mais can easily transformed into XML representation, and then stored and manipulated in an XML database such as Qizx.

For example:

  • various dialects of HTML can be transformed into XML. The resulting XML can be serialized back into HTML using the x:serialize function above.

  • JSON can be mapped into XML: Qizx offers a built-in facility for parsing JSON data, using a specific schema for its XML representation.

  • Parsers for other formats are planned after version 4.2: Mime Mail (RFC822), CSV, and probably some office formats like RTF.

function x:parse-content($string, $format-name [, $options])
  as node()?
,
function x:content-parse($string, $format-name [, $options])
  as node()?

Parses a string representing of some semi-structured data in the format specified by its name $format and returns a node built from that parsing.

Note: content-parse is the old name for parse-content and will be deprecated.

Parameter $stringA well-formed XML document as a string.

Parameter $format-nameA string naming the Content Importer. For example "html", "json". The recognized names are described for each Content Importer.

Parameter $optionsAn XML node with an attribute for each option. For example <options namespaces="true"/>

Returned value: A node of the Data Model if the string could be correctly parsed; the empty sequence if the argument was the empty sequence. An error is raised if there is a parsing error.

function x:parse-url-content($url, $format-name [, $options])
  as node()?

Parses semi-structured data located at $url, in the format specified by $format, and returns a node built from that parsing.

Parameter $urlA well-formed URL. Supported URL protocols currently are http: and file: (by default).

Parameter $format-nameA string naming the Content Importer. For example "html", "json". The recognized names are described for each Content Importer.

Parameter $optionsAn XML node with an attribute for each option. For example <options namespaces="true"/>

Returned value: A node of the Data Model if the string could be correctly parsed; the empty sequence if the argument was the empty sequence. An error is raised if there is a parsing error.

2.2.1. JSON parser

format argument:

for invoking the JSON parser, the value of the $format-name argument is "json" or "text/json".

Options

No options available to date.

Generated XML

Example:

x:content-parse('{ "a" : 1, b:[true, "str", {}], nothing:null}',
                "json")

Produces

<?xml version='1.0'?>
<map xmlns="com.qizx.json">
  <pair name="a">
    <number>1.0</number>
  </pair>
  <pair name="b">
    <array>
      <boolean>true</boolean>
      <string>str</string>
      <map/>
    </array>
  </pair>
  <pair name="nothing">
    <null/>
  </pair>
</map>

Schema:

  • A JSON map is represented by a map element with as many children pair elements as there are key-value pairs in the map.

  • A pair element has an attribute name for the value of the key. Its child element represents the value.

  • A JSON array is represented by a array element with as many children elements as there are array items.

  • JSON values are trivially represented as elements boolean, number, string.

  • A JSON null value is represented by the empty element null.

  • All elements use the namespace "com.qizx.json".

2.2.2. HTML parser

HTML parsing is performed by the TagSoup parser, allowing parsing "as it is found in the wild", i.e possibly malformed.

format argument:

for invoking the HTML parser, the value of the $format-name argument is "html" or "text/html".

Options

Recognizable options are either TagSoup option, or a short name for SAX features.

TagSoup options:

  • "ignore-bogons": A value of "true" indicates that the parser will ignore unknown elements.

  • "bogons-empty ": A value of "true" indicates that the parser will give unknown elements a content model of EMPTY; a value of "false", a content model of ANY.

  • "root-bogons" : A value of "true" indicates that the parser will allow unknown elements to be the root of the output document.

  • "default-attributes": A value of "true" indicates that the parser will return default attribute values for missing attributes that have default values.

  • "translate-colons: A value of "true" indicates that the parser will translate colons into underscores in names.

  • "restart-elements": A value of "true" indicates that the parser will attempt to restart the restartable elements.

  • i"gnorable-whitespace": A value of "true" indicates that the parser will transmit whitespace in element-only content via the SAX ignorableWhitespace callback. Normally this is not done, because HTML is an SGML application and SGML suppresses such whitespace.

  • "cdata-elements": A value of "true" indicates that the parser will process the script and style elements (or any elements with type='cdata' in the TSSL schema) as SGML CDATA elements (that is, no markup is recognized except the matching end-tag).

SAX features:

Short names are used: for example "namespaces" is a short name for "http://xml.org/sax/features/namespaces".

  • "namespaces"

  • "namespace-prefixes"

  • "external-general-entities"

  • "external-parameter-entities"

  • etc... see the documentation of TagSoup.

2.2.3. HTML5 parser

HTML5 parsing is performed by the parser by Henri Sivonen and Mozilla Foundation (c) 2007-2010.

format argument:

for invoking the HTML parser, the value of the $format-name argument is "html5" or "text/html5".

Options

In addition to SAX features (short names), recognizable options are:

  • "unicode-normalization-checking"

  • "html4-mode-compatible-with-xhtml1-schemata"

  • "mapping-lang-to-xml-lang"

  • "scripting-enabled"

3. XSL Transformation

The x:transform function invokes a XSLT style-sheet on a node and can retrieve the results of the transformation as a tree, or let the style-sheet output the results.

This is a useful feature when one wants to transform a document (for example extracted from the XML Libraries) or a computed fragment of XML into different output formats like HTML, XSL-FO etc.

This example generates the transformed document $doc into a file out\doc.xml:

x:transform( $doc, "ssheet1.xsl",
             <parameters param1="one" param2="two"/>,
             <options output-file="out\doc.xml" indent="yes"/>)

The next example returns a new document tree. Suppose we have this very simple stylesheet which renames the element "doc" into "newdoc":

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                 version ="1.0" >
  <xsl:template match="doc">
     <newdoc><xsl:apply-templates/></newdoc>
  </xsl:template>
</xsl:stylesheet>

The following XQuery expression:

x:transform( <doc>text</doc>, "ssheet1.xsl", <parameters/> )

returns:

<newdoc>text</newdoc>
x:transform( $source as node(), 
             $stylesheet-URI as xs:string, 
             $xslt-parameters as element(parameters) 
             [, $options as element(options)] )
  as node()?

Transforms the source tree through a XSLT stylesheet. If no output file is explicitly specified in the options, the function returns a new tree.

Parameter $sourcea XML tree to be transformed. It does not need to be a complete document.

Parameter $stylesheet-URIthe URI of a XSLT stylesheet. Stylesheets are cached and reused for consecutive transformations.

Parameter $xslt-parametersan element holding parameter values to pass to the XSLT engine. The parameters are specified in the form of attributes. The name of an attribute matches the name of a xsl:param declaration in the stylesheet (namespaces can be used). The value of the attribute is passed to the XSLT transformer.

Parameter $options[optional argument] an element holding options in the form of attributes: see below.

Returned value: if the path of an output file is not specified in the options, the function returns a new document tree which is the result of the transformation of the source tree. Otherwise, it returns the empty sequence.

Table 12.3. XSLT transform options

option namevaluesdescription
output-fileAn absolute file path.Output file. If this option is not specified, the generated tree is returned by the function, otherwise the function returns an empty sequence.
XSLT output properties (instruction xsl:output): version, standalone, encoding, indent, omit-xml-declaration etc. These options are used by the style-sheet for outputting the transformed document. They are ignored if no output-file option is specified.
Specific options of the XSLT engine (Saxon or default XSLT engine) An invalid option may cause an error.

About the efficiency of the connection with XSLT

The connection with an XSLT engine uses generic JAXP interfaces, and thus must copy XML trees passed in both directions. This is not as efficient as it could be and can even cause memory problems if the size of processed documents is larger then a few dozen megabytes, depending on the available memory size.

4. Dynamic evaluation

The following functions allow dynamically compiling and executing XQuery expressions.

function x:eval( $expression as xs:string )
  as xs:any

Compiles and evaluates a simple expression provided as a string.

The expression is executed in the context of the current query: it can use global variables, functions and namespaces of the current static context. It can also use the current item '.' if defined in the evaluation context.

However there is no access to the local context (for example if x:eval is invoked inside a function, the arguments or the local variables of the function are not visible.)

Parameter $expressiona simple expression (cannot contain prologue declarations).

Returned value: evaluated value of the expression.

Example:

declare variable $x := 1;
declare function local:fun($p as xs:integer) { $p * 2 };

let $expr := "1 + $x, local:fun(3)"
return x:eval($expr)

This should return the sequence (2, 6).

5. Query extensions

5.1. Estimated count and pagination

The following functions can be used to quickly estimate the count of documents returned by a query, when an exact count of all results would be too long to compute. They are designed to work on tens of millions of documents.

The estimated count provided by these functions is valid under the following conditions:

  • There is zero or one result ("hit") of the query per document The functions count documents, not nodes.

  • The query is applied to an homogeneous domain (a collection, typically): that is, each document in the domain has a chance to match the query (or in other terms, the domain does not contain documents that cannot match the query, and would only distort the count estimation).

Examples: assume collection /Products (the domain) contains only documents whose main node is 'Product'.

The estimated count in the following example would be the size of collection('/Products'):

x:count-estimate(collection('/Products')//Product)

In the following example the function looks at the first result items and estimates the total number by extrapolation: this would be a fraction of the the size of the domain represented by collection('/Products'). The accuracy can be controlled by an optional parameter (see below):

x:count-estimate(collection('/Products')//Product [ price > 10 ])

Functions:

function x:count-estimate( $query [, $min-count as xs:integer ])
  as xs:boolean

Returns an estimated count of documents matching the query.

Parameter $queryany query that matches one node per document at most. Should be an expression, which is evaluated within the x:count-estimate function. Passing an already evaluated sequence brings no profit.

Parameter $min-countOptional (default value is 200). Controls the accuracy of the count estimation. This is the number of result items enumerated before doing the estimation. The estimated count is then obtained by comparing the current position in the search domain to the size of the domain, and extrapolating. A larger $min-count gives a better accuracy, but can lead to slower execution.

Returned value: an integer item. If smaller than $min-count, this value represents the exact count. Otherwise the value is strongly rounded to provide a precision of about 10% (for example 11000 instead of 10653).

function x:paged-query( $page-start as xs:integer, $page-size as xs:integer, $query [, $min-count as xs:integer ])
  as xs:boolean

Similar to x:count-estimate() but in addition returns a "page" of result items.

This function could be implemented with x:count-estimate() and subsequence($query, $page-start, $page-size), but it combines the two operations in a slightly more efficient way.

Parameter $queryany query that matches one node per document at most. This expression is in fact a function (or "lambda expression") passed to the x:paged-query function itself. Passing an expression already evaluated (e.g using a variable) would bring no profit.

Parameter $page-startThe desired start position in the result sequence.

Parameter $page-sizeThe desired number of result items.

Parameter $min-countOptional (default value is 200). Controls the accuracy of the count estimation. This is the number of result items enumerated before doing the estimation. The estimated count is then obtained by comparing the current position in the search domain to the size of the domain, and extrapolating. A larger $min-count gives a better accuracy, but can lead to slower execution.

Returned value: A sequence made of: first an integer item which is the estimated count, exactly like in x:count-estimate() , then the items of the page.

5.2. Pattern-matching

The following functions match the string-value of nodes (elements and attributes) with a pattern.

Example 1: this expression returns true if the value of the attribute @lang matches the SQL-style pattern:

x:like( "en%", $node/@lang )

Example 2: this expression returns true if the content of the element 'NAME' matches the pattern:

$p/NAME[ x:like( "Theo%" ) ]
function x:like( $pattern as xs:string [, $context-nodes as node()* ])
  as xs:boolean

Returns true if the pattern matches the string-value of at least one node in the node sequence argument.

Parameter $patterna SQL-style pattern: the wildcard '_' matches any single character, the wildcard '%' matches any sequence of characters.

Parameter $context-nodesoptional sequence of nodes. The function checks sequentially the string-value of each node against the pattern. If absent, the argument default to '.', the current item. This makes sense inside a predicate, like in the example 2 above.

Returned value: a boolean.

function x:ulike( $pattern as xs:string [, $context-nodes as node()* ])
  as xs:boolean

This function is very similar to x:like, except that the pattern has syntax à la Unix ("glob pattern"). The character '?' is used instead of '_' (single character match), and '*' instead of '%' (multi-character match).

Note: these functions — as well as the standard fn:matches function, and the full-text functions — are automatically recognized by the query optimizer which uses library indexes to boost their execution whenever possible.

5.3. Range testing

This function allows testing if a item belongs to a range, in a optimized way.

This function is used typically to optimize a predicate in a Library query, for example

 //element[ x:in-range(@weight, 1, 10) ] 

which is equivalent to

//element[ @weight >= 1 and @weight <= 10 ]

The reason for this function is that the query optimizer is not able to detect such a double test in all situations. The function could become useless in later versions of Qizx, after improvement of the query optimizer.

function x:in-range( $value, $low-bound as item(), $high-bound as item() )
  as xs:boolean

function x:in-range( $value, $low-bound as item(), $high-bound as item(), 
                     $low-included as xs:boolean,
                     $high-included as xs:boolean )
  as xs:boolean

Returns true if at least one item from the sequence $value belongs to the range defined by other parameters.

Parameter $valueAny sequence of items. Items must be comparable to the bounds, otherwise a type error is raised.

Parameters $low-bound, $high-boundLower and upper bounds of the range. They must be of compatible types.

Parameters $low-includedIf $low-included is equal to true(), the comparison used is $low-bound <= $value, otherwise $low-bound < $value. If absent, <= is assumed.

Parameters $high-includedIf $high-included is equal to true(), the comparison used is $value <= $high-bound, otherwise $value < $high-bound. If absent, <= is assumed.

Returned value: True if at least one item from the sequence $value belongs to the range defined by $low-bound, $high-bound.

6. Date and Time

6.1. Differences with W3C specifications

Qizx is compliant with the W3C Recommendation. The only differences at present are extensions of the cast operation: Qizx can directly cast date, time, dateTime and durations to and from double values representing seconds, and keeps the extended "constructors" that build date, dateTime, etc, from numeric components like days, hours, minutes, etc.

6.2. Cast Extensions

In order to make computations easier, Qizx can:

  • Cast xdt:yearMonthDuration to numeric values: this yields the number of months. The following expression returns 13:

    xdt:yearMonthDuration("P1Y1M") cast as xs:integer
  • Conversely, cast numeric value representing months to xdt:yearMonthDuration. The following expression holds true:

    xdt:yearMonthDuration(13) = xdt:yearMonthDuration("P1Y1M")
  • Cast xdt:daytimeDuration to double: this yields the number of seconds. The following expression returns 7201:

    xdt:dayTimeDuration("PT2H1S") cast as xs:double
  • Conversely, cast a numeric value representing seconds to xdt:daytimeDuration.

  • Cast xs:dateTime to double. This returns the number of seconds elapsed since ``the Epoch'', i.e. 1970-01-01T00:00:00Z. If the timezone is not specified, it is considered to be UTC (GMT).

  • Conversely, cast a numeric value representing seconds from the origin to a dateTime with GMT timezone.

  • cast from/to the xs:date type in a similar way (like a dateTime with time equal to 00:00:00).

    xdt:date("1970-01-02") cast as xs:double = 86400
  • cast from/to the xs:time type in a similar way (seconds from 00:00:00).

    xdt:time("01:00:00") cast as xs:double = 3600

6.3. Additional constructors

These constructors allow date, time, dateTime objects to be built from numeric components (this is quite useful in practice).

function xs:date( $year as xs:integer,
                  $month as xs:integer,
                  $day as xs:integer )
  as xs:date

Builds a xs:date from a year, a month, and a day in integer form. The implicit timezone is used.

For example xs:date(1999, 12, 31) returns the same value as xs:date("1999-12-31").

function xs:time( $hour as xs:integer,
                  $minute as xs:integer,
                  $second as xs:double )
  as xs:time

Builds a xs:time from an hour, a minute as integer, and seconds as double. The implicit timezone is used.

function xs:dateTime( $year as xs:integer, $month as xs:integer, $day as xs:integer, 
                      $hour as xs:integer, $minute as xs:integer, $second as xs:double 
                      [, $timezone as xs:double] )
  as xs:dateTime

Builds a xs:dateTime from the six components that constitute date and time.

A timezone can be specified: it is expressed as a signed number of hours (ranging from -14 to 14), otherwise the implicit timezone is used.

6.4. Additional accessors

These functions are kept for compatibility. They are slightly different than the standard functions:

  • they accept several date/time and durations types for the argument (so for example we have get-minutes instead of get-minutes-from-time, get-minutes-from-dateTime etc.),

  • but they do not accept untypedAtomic (node contents): such an argument should be cast to the proper type before being used. So the standard function might be as convenient here.

function get-seconds( $moment )
  as xs:double?

Returns the "second" component from a xs:time, xs:dateTime, and xs:duration.

Can replace fn:seconds-from-dateTime, fn:seconds-from-time, fn:seconds-from-duration, except that the returned type is double instead of decimal, and an argument of type xdt:untypedAtomic is not valid.

function get-all-seconds( $duration )
  as xs:double?

Returns the total number of seconds from a xs:duration. This does not take into account months and years, as explained above.

For example get-all-seconds(xs:duration("P1YT1H")) returns 3600.

function get-minutes( $moment )
  as xs:integer?

Returns the "minute" component from a xs:time, xs:dateTime, and xs:duration.

function get-hours( $moment )
  as xs:integer?

Returns the "hour" component from a xs:time, xs:dateTime, and xs:duration.

function get-days( $moment )
  as xs:integer?

Returns the "day" component from a xs:date, xs:dateTime, xs:day, xs:monthDay and xs:duration.

function get-months( $moment )
  as xs:integer?

Returns the "month" component from a xs:date, xs:dateTime, xs:yearMonth, xs:month, xs:monthDay and xs:duration.

function get-years( $moment )
  as xs:integer?

Returns the "year" component from a xs:date, xs:dateTime, xs:year, xs:yearMonth and xs:duration.

function get-timezone( $moment )
  as xs:duration?

Returns the "timezone" component from any date/time type and xs:duration.

The returned value is like timezone-from-* except that the returned type is xs:duration, not xdt:dayTimeDuration.

7. Error handling

Early versions of XQuery had no mechanism to handle run-time errors. Qizx introduced its own try/catch since the very first version.

Qizx now supports the standard try/catch defined in XQuery 3.0.

For the record, the try/catch construct provided by early versions of Qizx (still supported) is documented here:

try { expr } catch($error) { fallback-expr }

The try/catch extended language construct first evaluates the body expr. If no error occurs, then the result of the try/catch is the return value of this expression.

If an error occurs, the local variable $error receives a string value which is the error message, and fallback-expr is evaluated (with possible access to the error message). The resulting value of the try/catch is in this case the value of this fallback expression. An error in the evaluation of the fallback-expression is not caught.

The type of this expression is the type that encompasses the types of both arguments.

Important

The body (first expression) is guaranteed to be evaluated completely before exiting the try/catch - unless an error occurs. In other terms, lazy evaluation, which is used in most Qizx expressions, does not apply here.

This is specially important when functions with side-effects are called in the body. If such functions generate errors, these errors are caught by the try/catch, as one can expect. Otherwise lazy evaluation could produce strange effects.

Example: tries to open a document, returns an element error with an attribute msg containing the error message if the document cannot be opened.

try {
    doc("unreachable.xml")
}
catch($err) {
    <error msg="{$err}"/>
}