Qizx fe-4.4p2 API

com.qizx.api.util.fulltext
Class FullTextSnippetExtractor

java.lang.Object
  extended by com.qizx.xdm.XMLPullStreamBase
      extended by com.qizx.api.util.fulltext.FullTextSnippetExtractor
All Implemented Interfaces:
FullTextPullStream, XMLPullStream

public class FullTextSnippetExtractor
extends com.qizx.xdm.XMLPullStreamBase
implements FullTextPullStream

Extracts a snippet of text from an XML node, attempting to show the key terms of a full-text query.

As an implementation of the FullTextPullStream interface, this object separates full-text term occurrences from surrounding plain text, allowing to "highlight" by enclosing them within an XML element.

To get the results, the method moveToNextEvent has to be called. Returned events are of type FT_TERM and TEXT, finally followed by XMLPullStream.END.

Example: the full-text query is: . ftcontains 'romeo juliet', and the FullTextSnippetExtractor is used on this document:

<PLAY>
  <TITLE>The Tragedy of Romeo and Juliet</TITLE>
  <FM> ...
  

Events generated could be successively:

TEXTtext='The Tragedy of 'wordCount=3
FT_TERM text='Romeo' wordCount=1 , termPosition=0
TEXTtext=' and 'wordCount=1
FT_TERMtext='Juliet'wordCount=1 , termPosition=1
END
(Note that this is only an example: actual snippet could be different).

A convenience method makeSnippet provides simpler means of building a snippet. Example (assuming 'session' is a XQuerySession or Library):

 Expression ftquery =
    session.compileExpression(". ftcontains 'romeo juliet' all words");
 FullTextSnippetExtractor ftx = new FullTextSnippetExtractor(ftquery);
 Node snippet = ftx.makeSnippet(node, session.getQName("div"),
                                session.getQName("span"),
                                session.getQName("class"), "ft_");
Results would be a node of the form:
  <div>The Tragedy of <span class="ft_0">Romeo</span> and <span class="ft_1">Juliet</span></div>
 


Field Summary
static int GAP
          A pseudo-event that represents skipped words
 
Fields inherited from interface com.qizx.api.fulltext.FullTextPullStream
FT_TERM
 
Fields inherited from interface com.qizx.api.XMLPullStream
COMMENT, DOCUMENT_END, DOCUMENT_START, ELEMENT_END, ELEMENT_START, END, PROCESSING_INSTRUCTION, START, TEXT
 
Constructor Summary
FullTextSnippetExtractor(Expression query)
          Creates a FullTextSnippetExtractor from a compiled expression.
FullTextSnippetExtractor(com.qizx.queries.FullText.Selection query, FullTextFactory ftFactory)
          For internal use.
FullTextSnippetExtractor(String simpleSyntaxQuery, FullTextFactory fulltextFactory, String language)
          Creates a FullTextSnippetExtractor from a query string using the simple full-text syntax.
 
Method Summary
 Node getCurrentNode()
          Returns the current node, if the implementation of this object is able to.
 int getMaxSnippetSize()
          Gets the current maximum number of words in a snippet.
 int getMaxWorkSize()
          Gets the maximum number of words examined to create a snippet.
 QName getName()
          Returns the name of the current element node, or if the node is not an element, returns the name of the parent element.
 int getQueryTermCount()
          Returns the number of terms in the query.
 String[] getQueryTerms()
          Returns the terms of the query as a String array.
 int getTermPosition()
          On a FT_TERM event, returns the rank of the term (word, wildcard) in the full-text query.
 String getText()
          Returns the textual contents of an atomic node.
 int getWordCount()
          On a TEXT or FT_TERM event, returns the number of words in the text chunk.
 Node makeSnippet(Node node, QName wrapperElement, QName hiliterElement, QName styleAttribute, String stylePrefix)
          Directly builds a snippet from a source Node.
 int moveToNextEvent()
          Moves the event stream one step forward.
 void setMaxSnippetSize(int maxSnippetSize)
          Sets the maximum number of words in a snippet.
 void setMaxWorkSize(int maxWorkSize)
          Sets the maximum number of words examined to create a snippet.
 void start(Node node)
          Searches snippet components in a source XML document or node.
 
Methods inherited from class com.qizx.xdm.XMLPullStreamBase
getAttributeCount, getAttributeName, getAttributeValue, getCurrentEvent, getDTDName, getDTDPublicId, getDTDSystemId, getEncoding, getInternalSubset, getNamespaceCount, getNamespacePrefix, getNamespaceURI, getTarget, getTextLength
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface com.qizx.api.XMLPullStream
getAttributeCount, getAttributeName, getAttributeValue, getCurrentEvent, getDTDName, getDTDPublicId, getDTDSystemId, getEncoding, getInternalSubset, getNamespaceCount, getNamespacePrefix, getNamespaceURI, getTarget, getTextLength
 

Field Detail

GAP

public static final int GAP
A pseudo-event that represents skipped words

See Also:
Constant Field Values
Constructor Detail

FullTextSnippetExtractor

public FullTextSnippetExtractor(Expression query)
                         throws EvaluationException
Creates a FullTextSnippetExtractor from a compiled expression. The expression must be either of:

Parameters:
query - a compiled full-text predicate, or a string using the simple full-text syntax.
Throws:
EvaluationException
See Also:
FullTextHighlighter

FullTextSnippetExtractor

public FullTextSnippetExtractor(String simpleSyntaxQuery,
                                FullTextFactory fulltextFactory,
                                String language)
                         throws DataModelException
Creates a FullTextSnippetExtractor from a query string using the simple full-text syntax.

Parameters:
simpleSyntaxQuery - a query using the simple full-text syntax.
fulltextFactory - a FullTextFactory used with the language parameter to get a tokenizer (both at compile-time and run-time).
language - language used for the options of the full-text query
Throws:
DataModelException - if the query is incorrect

FullTextSnippetExtractor

public FullTextSnippetExtractor(com.qizx.queries.FullText.Selection query,
                                FullTextFactory ftFactory)
                         throws EvaluationException
For internal use.

Throws:
EvaluationException
Method Detail

getMaxSnippetSize

public int getMaxSnippetSize()
Gets the current maximum number of words in a snippet.


setMaxSnippetSize

public void setMaxSnippetSize(int maxSnippetSize)
Sets the maximum number of words in a snippet. Default is 25.

Parameters:
maxSnippetSize - a positive integer

getMaxWorkSize

public int getMaxWorkSize()
Gets the maximum number of words examined to create a snippet. Default is 500.


setMaxWorkSize

public void setMaxWorkSize(int maxWorkSize)
Sets the maximum number of words examined to create a snippet.

When the scanned document or node belongs to an indexed XML Library, indexes are used to skip directly to occurrences of full-text terms, thus reducing the work load.

Parameters:
maxWorkSize - a positive integer.

makeSnippet

public Node makeSnippet(Node node,
                        QName wrapperElement,
                        QName hiliterElement,
                        QName styleAttribute,
                        String stylePrefix)
                 throws DataModelException
Directly builds a snippet from a source Node.

Parameters:
node - source document
wrapperElement - name of an element used to wrap the whole snippet
hiliterElement - name of an element used to wrap each highlighted term
styleAttribute - optional name (can be null) of an attribute of the hiliter element bearing a style indication
stylePrefix - a prefix for the style indicator, if styleAttribute is used.
Returns:
a Node representing the full-text snippet
Throws:
DataModelException - if there is a problem accessing the input node

start

public void start(Node node)
           throws DataModelException
Searches snippet components in a source XML document or node. The method moveToNextEvent can then be used to extract those components.

Parameters:
node - a node (document or element) from which to extract a snippet.
Throws:
DataModelException - raised by problems accessing to the source node

moveToNextEvent

public int moveToNextEvent()
                    throws DataModelException
Description copied from interface: XMLPullStream
Moves the event stream one step forward.

Specified by:
moveToNextEvent in interface XMLPullStream
Returns:
the next event. If the stream has reached its end, returns XMLPullStream.END.
Throws:
DataModelException - may be thrown by the stream implementation in case access to data is impossible (deleted document, closed Library).

getTermPosition

public int getTermPosition()
Description copied from interface: FullTextPullStream
On a FT_TERM event, returns the rank of the term (word, wildcard) in the full-text query. Depends on the actual implementation of this interface.

Example: in the following query, terms 'romeo' has position 0, and term 'juliet' has position 1.

 . ftcontains "romeo juliet" all words
 

Note that excluded terms (following ftnot or not in) are ignored.

Specified by:
getTermPosition in interface FullTextPullStream

getWordCount

public int getWordCount()
Description copied from interface: FullTextPullStream
On a TEXT or FT_TERM event, returns the number of words in the text chunk. For a FT_TERM, the value returned is 1, because phrases are not recognized as a whole.

Specified by:
getWordCount in interface FullTextPullStream

getText

public String getText()
Description copied from interface: XMLPullStream
Returns the textual contents of an atomic node. On PROCESSING_INSTRUCTION, returns the contents without the target name. On element and document events, return null.

Specified by:
getText in interface XMLPullStream
Overrides:
getText in class com.qizx.xdm.XMLPullStreamBase
Returns:
a String for the direct contents of the current leaf node

getName

public QName getName()
Description copied from interface: XMLPullStream
Returns the name of the current element node, or if the node is not an element, returns the name of the parent element.

Specified by:
getName in interface XMLPullStream
Returns:
the latest element name

getQueryTermCount

public int getQueryTermCount()
Description copied from interface: FullTextPullStream
Returns the number of terms in the query.

Specified by:
getQueryTermCount in interface FullTextPullStream

getQueryTerms

public String[] getQueryTerms()
Description copied from interface: FullTextPullStream
Returns the terms of the query as a String array.

Specified by:
getQueryTerms in interface FullTextPullStream

getCurrentNode

public Node getCurrentNode()
Description copied from interface: XMLPullStream
Returns the current node, if the implementation of this object is able to. Otherwise the null value is returned.

Specified by:
getCurrentNode in interface XMLPullStream

© 2010 Axyana Software