Qizx fe-4.4p2 API

com.qizx.api.util.fulltext
Class DefaultTextTokenizer

java.lang.Object
  extended by com.qizx.api.util.fulltext.DefaultTextTokenizer
All Implemented Interfaces:
TextTokenizer
Direct Known Subclasses:
DefaultWordSieve

public class DefaultTextTokenizer
extends Object
implements TextTokenizer

Generic Text Tokenizer, suitable for most Western languages.

Words are 1) a sequence of letters or digits, beginning with a letter; 2) a number, without exponent. Words never contain a dash or an apostrophe.


Field Summary
 
Fields inherited from interface com.qizx.api.fulltext.TextTokenizer
END, PARAGRAPH, SENTENCE, WORD
 
Constructor Summary
DefaultTextTokenizer()
           
 
Method Summary
 void copyTokenTo(char[] array, int start)
          Copies the current token into a character array.
 void defineSpecialChar(char ch)
          Define a character to recognize when parsing of special characters is enabled.
 int getDigitMax()
          Returns the maximum number of digits a word can contain.
 char[] getTokenChars()
          Returns the current token as a new character array.
 int getTokenLength()
          Returns the original length of the last word returned by nextWord.
 int getTokenOffset()
          Returns the offset (in source text chunk) of the last word returned by nextWord.
 boolean gotWildcard()
          Returns true if wildcard characters have been recognized in the current token.
 boolean isAcceptingWildcards()
          Returns true if wildcard characters are recognized.
 boolean isParsingSpecialChars()
          Returns true if special characters are recognized.
 int nextToken()
          Returns the type of the next token, or END if no more token can be found.
 void setAcceptingWildcards(boolean acceptingWildcards)
          If set to true, wildcard characters are recognized.
 void setDigitMax(int max)
          Sets the maximum number of digits a word can contain.
 void setParsingSpecialChars(boolean parsingSpecialChars)
          If set to true, special characters are recognized.
 void start(char[] text, int length)
          Starts the analysis of a new text chunk.
 void start(CharSequence text)
          Starts the analysis of a new text chunk.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DefaultTextTokenizer

public DefaultTextTokenizer()
Method Detail

start

public void start(char[] text,
                  int length)
Description copied from interface: TextTokenizer
Starts the analysis of a new text chunk.

Specified by:
start in interface TextTokenizer
Parameters:
text - characters to tokenize
length - number of characters in the text array

start

public void start(CharSequence text)
Description copied from interface: TextTokenizer
Starts the analysis of a new text chunk.

Specified by:
start in interface TextTokenizer
Parameters:
text - fragment to tokenize

copyTokenTo

public void copyTokenTo(char[] array,
                        int start)
Description copied from interface: TextTokenizer
Copies the current token into a character array.

Specified by:
copyTokenTo in interface TextTokenizer
Parameters:
array - destination array. Must fit the size of the token.
start - offset in the destination array.

getTokenChars

public char[] getTokenChars()
Description copied from interface: TextTokenizer
Returns the current token as a new character array.

Specified by:
getTokenChars in interface TextTokenizer

getTokenOffset

public int getTokenOffset()
Description copied from interface: TextTokenizer
Returns the offset (in source text chunk) of the last word returned by nextWord.

Specified by:
getTokenOffset in interface TextTokenizer
Returns:
an index in the source text fragment

getTokenLength

public int getTokenLength()
Description copied from interface: TextTokenizer
Returns the original length of the last word returned by nextWord. Most often equal to the length of the array returned by nextWord, but can be different if normalization or stemming is performed.

Specified by:
getTokenLength in interface TextTokenizer
Returns:
word length

isAcceptingWildcards

public boolean isAcceptingWildcards()
Description copied from interface: TextTokenizer
Returns true if wildcard characters are recognized.

Wildcard character sequences are ".", ".?", ".*", ".+", and ".{n,m}"

Specified by:
isAcceptingWildcards in interface TextTokenizer

setAcceptingWildcards

public void setAcceptingWildcards(boolean acceptingWildcards)
Description copied from interface: TextTokenizer
If set to true, wildcard characters are recognized. Otherwise, they are ignored.

Specified by:
setAcceptingWildcards in interface TextTokenizer

isParsingSpecialChars

public boolean isParsingSpecialChars()
Description copied from interface: TextTokenizer
Returns true if special characters are recognized.

Specified by:
isParsingSpecialChars in interface TextTokenizer
See Also:
TextTokenizer.defineSpecialChar(char)

setParsingSpecialChars

public void setParsingSpecialChars(boolean parsingSpecialChars)
Description copied from interface: TextTokenizer
If set to true, special characters are recognized. Otherwise, they are ignored like whitespace.

Specified by:
setParsingSpecialChars in interface TextTokenizer
See Also:
TextTokenizer.defineSpecialChar(char)

defineSpecialChar

public void defineSpecialChar(char ch)
Description copied from interface: TextTokenizer
Define a character to recognize when parsing of special characters is enabled.

Specified by:
defineSpecialChar in interface TextTokenizer

gotWildcard

public boolean gotWildcard()
Description copied from interface: TextTokenizer
Returns true if wildcard characters have been recognized in the current token. Requires the the option AcceptingWildcards to be set to true.

Specified by:
gotWildcard in interface TextTokenizer

nextToken

public int nextToken()
Description copied from interface: TextTokenizer
Returns the type of the next token, or END if no more token can be found.

Specified by:
nextToken in interface TextTokenizer
Returns:
the type of the next token: END, WORD, SENTENCE, PARAGRAPH, or the code of a character if the option 'Special Characters' is set on this tokenizer using the method setParsingSpecialChars().

getDigitMax

public int getDigitMax()
Description copied from interface: TextTokenizer
Returns the maximum number of digits a word can contain.

Specified by:
getDigitMax in interface TextTokenizer

setDigitMax

public void setDigitMax(int max)
Description copied from interface: TextTokenizer
Sets the maximum number of digits a word can contain. Beyond the specified value (default is 4), a token is not retained as a word.

Specified by:
setDigitMax in interface TextTokenizer

© 2010 Axyana Software