Custom Token Filter Lucene"

2015/10/25

In order to implement our filter we will extends the TokenFilter class from the org.apache.lucene.analysis and we will override the incrementToken method. This method returns a boolean value: if a value is still available for processing in the token stream, this method should return true, is the token in the token stream shouldn't be further analyzed this method should return false.

package pl.solr.analysis;

import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public final class ReverseFilter extends TokenFilter {
  private CharTermAttribute charTermAttr;

  protected ReverseFilter(TokenStream ts) {
    super(ts);
    this.charTermAttr = addAttribute(CharTermAttribute.class);
  }

  @Override
  public boolean incrementToken() throws IOException {
    if (!input.incrementToken()) {
      return false;
    }

    int length = charTermAttr.length();
    char[] buffer = charTermAttr.buffer();
    char[] newBuffer = new char[length];
    for (int i = 0; i < length; i++) {
      newBuffer[i] = buffer[length - 1 - i];
    }
    charTermAttr.setEmpty();
    charTermAttr.copyBuffer(newBuffer, 0, newBuffer.length);
    return true;
  }
}

A TokenStream is a class that can produce a series of tokens when requested, but there are two different styles of TokenStreams: Tokenizer and TokenFilter.

A Tokenizer reads characters from a java.io.Reader and creates tokens, whereas a TokenFilter takes tokens in, and produces new tokens by either adding or removing whole tokens or altering the attributes of the incoming tokens.

public TokenStream tokenStream(String fieldName, Reader reader)
{
    return new StopFilter(true, new LowerCaseTokenizer(reader), stopWords);
}

In this anlyzer, LowerCaseTokenizer produces the initial set of tokens from a Reader and feeds them to a StopFilter. The LowerCaseTokenizer emits tokens that are adjacent letters in the original text, lowercasing each of the characters in the process. Following this word tokenizer and lowercasing, StopFilter removes words in a stop-word list while preserving accurate positionIncrements.

Buffering is a feature that's commonly needed in the TokenStream implementations. Low-level Tokenizers do this to buffer up characters to form tokens at boundaries such as whitespace or nonletter characters. TokenFilters that emit additional tokens into the stream they're filtering must queue an incoming token and the additional ones and emit them one at a time.

the TokenStream never explicitly creates a single object holding all attributes for the token. Instead, you interact with a separate reused attribute interface for each element of the token.

TokenStream subclasses from a class called AttributeSource. AttributeSource is a useful and generic means of providing strongly typed yet fully extensible attributes without reuquiring runtime casting, thus resulting in good performance.