Qizx fe-4.4p2 API

com.qizx.api.fulltext
Interface TextTokenizer

All Known Subinterfaces:
Indexing.WordSieve
All Known Implementing Classes:
DefaultTextTokenizer, DefaultWordSieve

public interface TextTokenizer

Pluggable text tokenizer compatible with standard full-text features. Analyzes text chunks to extract and normalize words.

To parse words, the tokenizer is first initialized with method start(char[], int) on a text chunk. Then the nextToken() method is called repeatedly until the last token is parsed.


Field Summary
static int END
          Code returned by nextToken when the end of the text to tokenize is reached.
static int PARAGRAPH
          Code returned by nextToken when a paragraph boundary is recognized.
static int SENTENCE
          Code returned by nextToken when a sentence boundary is recognized.
static int WORD
          Code returned by nextToken when a word is recognized.
 
Method Summary
 void copyTokenTo(char[] array, int start)
          Copies the current token into a character array.
 void defineSpecialChar(char ch)
          Define a character to recognize when parsing of special characters is enabled.
 int getDigitMax()
          Returns the maximum number of digits a word can contain.
 char[] getTokenChars()
          Returns the current token as a new character array.
 int getTokenLength()
          Returns the original length of the last word returned by nextWord.
 int getTokenOffset()
          Returns the offset (in source text chunk) of the last word returned by nextWord.
 boolean gotWildcard()
          Returns true if wildcard characters have been recognized in the current token.
 boolean isAcceptingWildcards()
          Returns true if wildcard characters are recognized.
 boolean isParsingSpecialChars()
          Returns true if special characters are recognized.
 int nextToken()
          Returns the type of the next token, or END if no more token can be found.
 void setAcceptingWildcards(boolean acceptingWildcards)
          If set to true, wildcard characters are recognized.
 void setDigitMax(int max)
          Sets the maximum number of digits a word can contain.
 void setParsingSpecialChars(boolean parsingSpecialChars)
          If set to true, special characters are recognized.
 void start(char[] text, int length)
          Starts the analysis of a new text chunk.
 void start(CharSequence text)
          Starts the analysis of a new text chunk.
 

Field Detail

END

static final int END
Code returned by nextToken when the end of the text to tokenize is reached.

See Also:
Constant Field Values

WORD

static final int WORD
Code returned by nextToken when a word is recognized.

See Also:
Constant Field Values

SENTENCE

static final int SENTENCE
Code returned by nextToken when a sentence boundary is recognized.

Not yet supported.

See Also:
Constant Field Values

PARAGRAPH

static final int PARAGRAPH
Code returned by nextToken when a paragraph boundary is recognized.

Not yet supported.

See Also:
Constant Field Values
Method Detail

start

void start(char[] text,
           int length)
Starts the analysis of a new text chunk.

Parameters:
text - characters to tokenize
length - number of characters in the text array

start

void start(CharSequence text)
Starts the analysis of a new text chunk.

Parameters:
text - fragment to tokenize

nextToken

int nextToken()
Returns the type of the next token, or END if no more token can be found.

Returns:
the type of the next token: END, WORD, SENTENCE, PARAGRAPH, or the code of a character if the option 'Special Characters' is set on this tokenizer using the method setParsingSpecialChars().

getTokenOffset

int getTokenOffset()
Returns the offset (in source text chunk) of the last word returned by nextWord.

Returns:
an index in the source text fragment

getTokenLength

int getTokenLength()
Returns the original length of the last word returned by nextWord. Most often equal to the length of the array returned by nextWord, but can be different if normalization or stemming is performed.

Returns:
word length

getTokenChars

char[] getTokenChars()
Returns the current token as a new character array.


copyTokenTo

void copyTokenTo(char[] array,
                 int start)
Copies the current token into a character array.

Parameters:
array - destination array. Must fit the size of the token.
start - offset in the destination array.

isParsingSpecialChars

boolean isParsingSpecialChars()
Returns true if special characters are recognized.

See Also:
defineSpecialChar(char)

setParsingSpecialChars

void setParsingSpecialChars(boolean parsingSpecialChars)
If set to true, special characters are recognized. Otherwise, they are ignored like whitespace.

See Also:
defineSpecialChar(char)

getDigitMax

int getDigitMax()
Returns the maximum number of digits a word can contain.


setDigitMax

void setDigitMax(int max)
Sets the maximum number of digits a word can contain. Beyond the specified value (default is 4), a token is not retained as a word.


defineSpecialChar

void defineSpecialChar(char ch)
Define a character to recognize when parsing of special characters is enabled.


isAcceptingWildcards

boolean isAcceptingWildcards()
Returns true if wildcard characters are recognized.

Wildcard character sequences are ".", ".?", ".*", ".+", and ".{n,m}"


setAcceptingWildcards

void setAcceptingWildcards(boolean acceptingWildcards)
If set to true, wildcard characters are recognized. Otherwise, they are ignored.


gotWildcard

boolean gotWildcard()
Returns true if wildcard characters have been recognized in the current token. Requires the the option AcceptingWildcards to be set to true.


© 2010 Axyana Software