FULLTEXT Lucene Analyzers
Lucene Analyzers are crucial components that prepare text for indexing and searching. They transform raw text into a stream of tokens (usually words or terms) that Lucene can efficiently store and query.
How Lucene Analyzers Work
An Analyzer is essentially a pipeline composed of:
-
CharFilter (Optional): These operate on the raw input stream, performing character-level transformations. For example, they might strip HTML tags or normalize Unicode characters.
-
Tokenizer: This is the first mandatory step. A Tokenizer breaks the character stream into a sequence of raw tokens. Different tokenizers define what constitutes a "word" or "term". For instance, a simple whitespace tokenizer splits on spaces, while a more sophisticated one might handle hyphens, apostrophes, or language-specific rules.
-
TokenFilter (Optional): After tokenization, one or more TokenFilters can be applied to modify, add, or remove tokens. Common filters include:
- LowercaseFilter: Converts all tokens to lowercase for case-insensitive searching.
- StopFilter: Removes common words (e.g., "the," "a," "is") that are often irrelevant for search. See the “Stopwords Language” setting when configuring a FULLTEXT index to affect this behavior for standard analyzers.
This entire process is applied both when you index documents into Lucene and when you query the index. The same or a compatible analysis chain must be used for both indexing and searching to ensure that the terms in the query match the terms in the index.
Why Choose One Analyzer Over Another
The choice of analyzer is driven by your specific search requirements and the nature of your data. Here are some key factors:
-
Language Specificity: Different languages have different grammatical rules, character sets, and common words. Using a language-specific analyzer (e.g., EnglishAnalyzer, FrenchAnalyzer) that incorporates stemming and stop words for that language generally yields much better search results than a generic one.
-
Case Sensitivity: If you need a case-insensitive search (which is common), you'd choose an analyzer that includes a LowerCaseFilter. If "Apple" and "apple" should be treated as distinct terms, you'd choose an analyzer that does not lowercase.
-
Punctuation and Special Characters: How should punctuation be handled? Should "U.S.A." be treated as one token or three? Should "product-id" be one token or "product" and "id"? Analyzers differ in how they treat these. For example, the WhitespaceAnalyzer only splits on whitespace, while the StandardAnalyzer is more intelligent about punctuation.
-
Stop Words: For many general-purpose searches, common words like "a," "the," "is" provide little value and can be removed to reduce index size and improve performance. However, for some applications (e.g., searching legal documents or code snippets), these words might be crucial.
Ultimately, the best analyzer is the one that produces the most relevant search results for your specific application and user needs. Often, this involves experimentation and sometimes creating a custom analyzer by combining standard tokenizers and filters to achieve the desired behavior.
BASIS Whitespace/Punctuation Analyzer
Class Name
com.basis.textsearch.WhitespaceAndPunctuationAnalyzer
Use Cases
This type of analyzer is useful for search scenarios where:
-
You want to treat punctuation marks and whitespace as clear delimiters between words or terms.
-
You want case-insensitive searching, as all tokens are converted to lowercase.
-
You want to preserve underscores and other non-listed symbols within tokens (e.g., product_id).
How it Works
Tokenization
If the character is not any of the following, it is included as part of the token:
-
Whitespace characters: (e.g., space, tab, newline)
-
Specific punctuation characters:
-
!
-
.
-
,
-
?
-
:
-
;
-
(
-
)
-
%
-
[
-
]
-
{
-
}
-
\
-
/
-
'
-
"
-
$
-
Input String: "Hello, World! How are you?"
-
Tokenization:
Any sequence of characters that is not whitespace and not one of the listed punctuation marks will form a token. When the tokenizer encounters one of these "non-token" characters, it marks the end of the current token and the beginning of a new one (if subsequent characters are token characters).
Filtering
After the WhitespaceAndPunctuationTokenizer breaks the input string into tokens, a lowercase filter is applied. This filter simply takes each token produced by the tokenizer and converts all its characters to their lowercase equivalents. This enables case-insensitive queries against the index.
Processing Examples
Example 1: Simple Sentence
-
"Hello" (, is a delimiter)
-
"World" (! is a delimiter)
-
"How" (space is a delimiter)
-
"are" (space is a delimiter)
-
"you" (? is a delimiter)
-
Filtering:
-
"hello"
-
"world"
-
"how"
-
"are"
-
"you"
-
Final Tokens: ["hello", "world", "how", "are", "you"]
-
Search Query:
- “hell*” would return “Hello”
- “hello,” would return no results
- “hello” would return “Hello”
-
Input String: "Product_ID: AB123 (USD $100)"
-
Tokenization:
Example 2: String with Numbers and Special Characters
- "Product_ID" (: is a delimiter)
-
"AB123" (space is a delimiter)
-
"USD" (space is a delimiter)
-
"100" () is a delimiter)
-
Note:_ is not a specified delimiter, so it remains part of the token. Parentheses () and dollar sign $ are delimiters.
-
Filtering:
- "product_id"
-
"ab123"
-
"usd"
-
"100"
-
Final Tokens: ["product_id", "ab123", "usd", "100"]
-
Search Query:
-
“ab*” would return “AB123”
-
“AB123,” would return “AB123”
-
“$100” would not return any results
-
“100” would return “100”
Example 3: String with Leading/Trailing Delimiters and Multiple Delimiters
-
Input String: ",,,Data!!! (Important)"
-
Tokenization:
-
"Data" (!, , (, ) are all delimiters)
-
"Important"
-
Filtering:
-
"data"
-
"important"
-
Final Tokens: ["data", "important"]
-
Search Query:
-
“(Important)” would return no results
-
“iMpO*” would return “Important”
-
“Data!!!” would return no results
-
“data” would return “Data”
BASIS Case-Insensitive Whitespace Analyzer
Class Name
com.basis.textsearch.CaseInsensitiveWhitespaceAnalyzer
Use Cases
This type of analyzer is useful for search scenarios where:
-
You want to treat ONLY whitespace characters as clear delimiters between words or terms.
-
You want case-insensitive searching, as all tokens are converted to lowercase.
-
You want to index values where all characters that are not whitespace are treated as important, and part of the token values.
How it Works
Tokenization
If the character is not any of the following, it is included as part of the token:
-
Whitespace characters: (e.g., space, tab, newline)
Any sequence of characters that is not whitespace will form a token. When the tokenizer encounters a whitespace character, it marks the end of the current token and the beginning of a new one (if subsequent characters are token characters).
Filtering
After the CaseInsensitiveWhitespaceTokenizer breaks the input string into tokens, a lowercase filter is applied. This filter simply takes each token produced by the tokenizer and converts all its characters to their lowercase equivalents. This enables case-insensitive queries against the index.
Processing Examples
Example 1: Simple Sentence
-
Input String: "Hello, World! How are you?"
-
Tokenization:
-
"Hello," (space is a delimiter)
-
"World!" (space is a delimiter)
-
"How" (space is a delimiter)
-
"are" (space is a delimiter)
-
"you?"
-
Filtering:
-
"hello,"
-
"world!"
-
"how"
-
"are"
-
"you?"
-
Final Tokens: ["hello,", "world!", "how", "are", "you?"]
-
Search Query:
-
“hell*” would return “hello”
-
“hello,” would return “hello”
-
“hello” would not return any results
Example 2: String with Numbers and Special Characters
-
Input String: "AB123-AA.4"
-
Tokenization:
-
"AB123-AA.4"
-
Filtering:
-
"ab123-aa.4"
-
Final Tokens: ["ab123-aa.4"]
-
Search Query:
-
“AB*” would return “AB123-AA.4”
-
“ab*” would return “AB123-AA.4”
-
“Ab123” would return no results
-
“ab123-aa*” would return “AB123-AA.4”
-
“*123-aA.4” would return “AB123-AA.4”
See Also
Full Text Indexing and Searching