Solr入门之官方文档6.0阅读笔记系列(八) 相关过滤器

来源:互联网 发布:gd php extension 编辑:程序博客网 时间:2024/05/21 09:30
第三部分 :  Understanding Analyzers, Tokenizers, and Filters
Filter Descriptions
You configure each filter with a <filter> element in schema.xml as a child of <analyzer>, following the <tokenizer> element. Filter definitions should follow a tokenizer or another filter definition because they take a TokenStream as input. For example.
过滤器的配置位置在<tokenizer> 或者过滤器下,过滤器接收TokenStream 为输入:
<fieldType name="text" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>...
</analyzer>
</fieldType>

过滤器工厂要实现org.apache.solr.analysis.TokenFilterFactory .过滤器和分词器类似,都需要tokenstream生成tokens.你可以在分词器下任意的组合过滤器

Arguments may be passed to tokenizer factories to modify their behavior by setting attributes on the <filter> element. For example:
<fieldType name="semicolonDelimited" class="solr.TextField">
<analyzer type="query">
<tokenizer class="solr.PatternTokenizerFactory" pattern="; " />
<filter class="solr.LengthFilterFactory" min="2" max="7"/>
</analyzer>
</fieldType>

被分词器修改后的属性能被过滤器再次修改,
背刺版本中有的过滤器如下
Filters discussed in this section:
ASCII Folding Filter
Beider-Morse Filter
Classic Filter
Common Grams Filter
Collation Key Filter
Daitch-Mokotoff Soundex Filter
Double Metaphone Filter
Edge N-Gram Filter
English Minimal Stem Filter
Fingerprint Filter
Hunspell Stem Filter
Hyphenated Words Filter
ICU Folding Filter
ICU Normalizer 2 Filter
ICU Transform Filter
Keep Word Filter
KStem Filter
Length Filter
Lower Case Filter
Managed Stop Filter
Managed Synonym Filter
N-Gram Filter
Numeric Payload Token Filter
Pattern Replace Filter
Phonetic Filter
Porter Stem Filter
Remove Duplicates Token Filter
Reversed Wildcard Filter
Shingle Filter
Snowball Porter Stemmer Filter
Standard Filter
Stop Filter
Suggest Stop Filter
Synonym Filter
Token Offset Payload Filter
Trim Filter
Type As Payload Filter
Type Token Filter
Word Delimiter Filter
ASCII Folding Filter
This filter converts alphabetic, numeric, and symbolic Unicode characters which are not in the Basic Latin Unicode block (the first 127 ASCII characters) to their ASCII equivalents, if one exists. This filter converts characters from the following Unicode blocks:
Factory class: solr.ASCIIFoldingFilterFactory
Arguments: None
Example:
<analyzer>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
In: "á" (Unicode character 00E1)
Out: "a" (ASCII character 97)


将unicode变为ascii
这个是可以设置参数的从代码中看到
Beider-Morse Filter

Implements the Beider-Morse Phonetic Matching (BMPM) algorithm, which allows identification of similar names, even if they are spelled differently or in different languages. More information about how this works is available in the section on Phonetic Matching.
Factory class: solr.BeiderMorseFilterFactory
Arguments:
nameType: Types of names. Valid values are GENERIC, ASHKENAZI, or SEPHARDIC. If not processing Ashkenazi or Sephardic names, use GENERIC.
ruleType: Types of rules to apply. Valid values are APPROX or EXACT.
concat: Defines if multiple possible matches should be combined with a pipe ("|").
languageSet: The language set to use. The value "auto" will allow the Filter to identify the language, or a comma-separated list can be supplied.

Example:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX"
concat="true" languageSet="auto">
</filter>
</analyzer>


这个过滤器是用来做人名识别的,用到了一个算法,Beider-Morse Phonetic Matching (BMPM) algorithm,好像很厉害的字样,了解一下.能识别不同语言,不同类型的人名


Classic Filter

This filter takes the output of the Classic Tokenizer and strips periods from acronyms and "'s" from possessives.
Factory class: solr.ClassicFilterFactory
Arguments: None
Example:
<analyzer>
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.ClassicFilterFactory"/>
</analyzer>
In: "I.B.M. cat's can't"
Tokenizer to Filter: "I.B.M", "cat's", "can't"
Out: "IBM", "cat", "can't"


需要配置指定的分词器,去掉点,和's

Common Grams Filter

This filter creates word shingles by combining common tokens such as stop words with regular tokens. This is useful for creating phrase queries containing common words, such as "the cat." Solr normally ignores stop words in queried phrases, so searching for "the cat" would return all matches for the word "cat."
Factory class: solr.CommonGramsFilterFactory
Arguments:
words: (a common word file in .txt format) Provide the name of a common word file, such as stopwords.txt.
format: (optional) If the stopwords list has been formatted for Snowball, you can specify format="snowball" so Solr can read the stopwords file.
ignoreCase: (boolean) If true, the filter ignores the case of words when comparing them to the common word file. The default is false.
Example:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
</analyzer>

In: "the Cat"
Tokenizer to Filter: "the", "Cat"
Out: "the_cat"

过滤指定的词语,有个停用词文件

Collation Key Filter

Collation allows sorting of text in a language-sensitive way. It is usually used for sorting, but can also be used with advanced searches. We've covered this in much more detail in the section on Unicode Collation.

用来进行排序和高级搜索相关.
Daitch-Mokotoff Soundex Filter
Implements the Daitch-Mokotoff Soundex algorithm, which allows identification of similar names, even if they are
spelled differently. More information about how this works is available in the section on Phonetic Matching.
Factory class: solr.DaitchMokotoffSoundexFilterFactory
Arguments:
inject : (true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens
are replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the exact
spelling of the target word may not match.
Example:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DaitchMokotoffSoundexFilterFactory" inject="true"/>
</analyzer>


这个也是一个名字的识别算法编写的过滤器
Double Metaphone Filter

This filter creates tokens using the DoubleMetaphone encoding algorithm from commons-codec. For more
information, see the Phonetic Matching section.
Factory class: solr.DoubleMetaphoneFilterFactory
Arguments:
inject: (true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens
are replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the exact
spelling of the target word may not match.
maxCodeLength: (integer) The maximum length of the code to be generated.
Example:
Default behavior for inject (true): keep the original token and add phonetic token(s) at the same position.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DoubleMetaphoneFilterFactory"/>
</analyzer>

In: "four score and Kuczewski"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "Kuczewski"(4)
Out: "four"(1), "FR"(1), "score"(2), "SKR"(2), "and"(3), "ANT"(3), "Kuczewski"(4), "KSSK"(4), "KXFS"(4)
The phonetic tokens have a position increment of 0, which indicates that they are at the same position as the
token they were derived from (immediately preceding). Note that "Kuczewski" has two encodings, which are
added at the same position.
Example:
Discard original token (inject="false").
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DoubleMetaphoneFilterFactory" inject="false"/>
</analyzer>
In: "four score and Kuczewski"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "Kuczewski"(4)
Out: "FR"(1), "SKR"(2), "ANT"(3), "KSSK"(4), "KXFS"(4)
Note that "Kuczewski" has two encodings, which are added at the same position


又是一个语音匹配算法设计的过滤器,可以有多个音符.感觉对汉语有效吗?应该没有吧

Edge N-Gram Filter

This filter generates edge n-gram tokens of sizes within the given range.
Factory class: solr.EdgeNGramFilterFactory
Arguments:
minGramSize: (integer, default 1) The minimum gram size.
maxGramSize: (integer, default 1) The maximum gram size.
Example:
Default behavior.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory"/>
</analyzer>
In: "four score and twenty"
Tokenizer to Filter: "four", "score", "and", "twenty"
Out: "f", "s", "a", "t"
Example:
A range of 1 to 4.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="4"/>
</analyzer>

In: "four score"
Tokenizer to Filter: "four", "score"
Out: "f", "fo", "fou", "four", "s", "sc", "sco", "scor"
Example:
A range of 4 to 6.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="6"/>
</analyzer>
In: "four score and twenty"
Tokenizer to Filter: "four", "score", "and", "twenty"
Out: "four", "scor", "score", "twen", "twent", "twenty"


这个和前面的edge N-gram tokenzer 起到的效果是一样的.就是没有选取方向的参数而已

English Minimal Stem Filter

This filter stems plural English words to their singular form.
Factory class: solr.EnglishMinimalStemFilterFactory
Arguments: None
Example:
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory "/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
</analyzer>
In: "dogs cats"
Tokenizer to Filter: "dogs", "cats"
Out: "dog", "cat"


英语的词干过滤,对中文效果有吗?

Fingerprint Filter

This filter outputs a single token which is a concatenation of the sorted and de-duplicated set of input tokens.
This can be useful for clustering/linking use cases.
Factory class: solr.FingerprintFilterFactory
Arguments:
separator : The character used to separate tokens combined into the single output token. Defaults to " " (a
space character).
maxOutputTokenSize : The maximum length of the summarized output token. If exceeded, no output token is
emitted. Defaults to 1024.
Example:

<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.FingerprintFilterFactory" separator="_" />
</analyzer>
In: "the quick brown fox jumped over the lazy dog"
Tokenizer to Filter: "the", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog"
Out: "brown_dog_fox_jumped_lazy_over_quick_the"


这个过滤器的作用是将输入的tokens排序后按照给定的字符串联起来.


Hunspell Stem Filter
The Hunspell Stem Filter provides support for several languages. You must provide the dictionary (.dic) and
rules (.aff) files for each language you wish to use with the Hunspell Stem Filter. You can download those
language files here. Be aware that your results will vary widely based on the quality of the provided dictionary
and rules files. For example, some languages have only a minimal word list with no morphological information.
On the other hand, for languages that have no stemmer but do have an extensive dictionary file, the Hunspell
stemmer may be a good choice.
Factory class: solr.HunspellStemFilterFactory
Arguments:
dictionary: (required) The path of a dictionary file.
affix: (required) The path of a rules file.
ignoreCase: (boolean) controls whether matching is case sensitive or not. The default is false.
strictAffixParsing: (boolean) controls whether the affix parsing is strict or not. If true, an error while
reading an affix rule causes a ParseException, otherwise is ignored. The default is true.
Example:
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HunspellStemFilterFactory"
dictionary="en_GB.dic"
affix="en_GB.aff"
ignoreCase="true"
strictAffixParsing="true" />
</analyzer>
In: "jump jumping jumped"
Tokenizer to Filter: "jump", "jumping", "jumped"
Out: "jump", "jump", "jump
"

这个提供按照指定规则,和词典方式来处理词语
Hyphenated Words Filter

This filter reconstructs hyphenated words that have been tokenized as two tokens because of a line break or other intervening whitespace in the field test. If a token ends with a hyphen, it is joined with the following token and the hyphen is discarded. Note that for this filter to work properly, the upstream tokenizer must not remov trailing hyphen characters. This filter is generally only useful at index time.

Factory class: solr.HyphenatedWordsFilterFactory
Arguments: None
Example:
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
</analyzer>
In: "A hyphen- ated word"
Tokenizer to Filter: "A", "hyphen-", "ated", "word"
Out: "A", "hyphenated", "word"


该处理器处理字符间的连接符号,或者词语后的连接符,

Keep Word Filter

This filter discards all tokens except those that are listed in the given word list. This is the inverse of the Stop Words Filter. This filter can be useful for building specialized indices for a constrained set of terms.
Factory class: solr.KeepWordFilterFactory
Arguments:
words: (required) Path of a text file containing the list of keep words, one per line. Blank lines and lines that begin with "#" are ignored. This may be an absolute path, or a simple filename in the Solr config directory.
ignoreCase: (true/false) If true then comparisons are done case-insensitively. If this argument is true, then the
words file is assumed to contain only lowercase words. The default is false.
enablePositionIncrements: if luceneMatchVersion is 4.3 or earlier and enablePositionIncrements="false", no position holes will be left by this filter when it removes tokens. This argument is invalid if luc
eneMatchVersion is 5.0 or later.
Example:
Where keepwords.txt contains:
happy
funny
silly
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.KeepWordFilterFactory" words="keepwords.txt"/>
</analyzer>
In: "Happy, sad or funny"
Tokenizer to Filter: "Happy", "sad", "or", "funny"
Out: "funny"
Example:
Same keepwords.txt, case insensitive:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.KeepWordFilterFactory" words="keepwords.txt"
ignoreCase="true"/>
</analyzer>
In: "Happy, sad or funny"
Tokenizer to Filter: "Happy", "sad", "or", "funny"
Out: "Happy", "funny"
Example:
Using LowerCaseFilterFactory before filtering for keep words, no ignoreCase flag.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeepWordFilterFactory" words="keepwords.txt"/>
</analyzer>
In: "Happy, sad or funny"
Tokenizer to Filter: "Happy", "sad", "or", "funny"
Filter to Filter: "happy", "sad", "or", "funny"
Out: "happy", "funny"


这个过滤器和停用词正好相反,只保存已经存在词典中的词,没有的被过滤掉,一行一个词.
#开头的行无效


KStem Filter

KStem is an alternative to the Porter Stem Filter for developers looking for a less aggressive stemmer. KStem was written by Bob Krovetz, ported to Lucene by Sergio Guzman-Lara (UMASS Amherst). This stemmer is only

appropriate for English language text.
Factory class: solr.KStemFilterFactory
Arguments: None
Example:
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory "/>
<filter class="solr.KStemFilterFactory"/>
</analyzer>
In: "jump jumping jumped"
Tokenizer to Filter: "jump", "jumping", "jumped"
Out: "jump", "jump", "jump"


这个仅仅适用于英文,没有太大作用呀


Length Filter

This filter passes tokens whose length falls within the min/max limit specified. All other tokens are discarded.
Factory class: solr.LengthFilterFactory
Arguments:
min: (integer, required) Minimum token length. Tokens shorter than this are discarded.
max: (integer, required, must be >= min) Maximum token length. Tokens longer than this are discarded.
enablePositionIncrements: if luceneMatchVersion is 4.3 or earlier and enablePositionIncrement
s="false", no position holes will be left by this filter when it removes tokens. This argument is invalid if luc
eneMatchVersion is 5.0 or later.
Example:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="3" max="7"/>
</analyzer>
In: "turn right at Albuquerque"
Tokenizer to Filter: "turn", "right", "at", "Albuquerque"
Out: "turn", "right"


符合设定长度的token被保留,否则丢弃.

Lower Case Filter

Converts any uppercase letters in a token to the equivalent lowercase token. All other characters are left
unchanged.
Factory class: solr.LowerCaseFilterFactory
Arguments: None
Example:


<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
In: "Down With CamelCase"
Tokenizer to Filter: "Down", "With", "CamelCase"
Out: "down", "with", "camelcase"


改变大写字母为小写,其余的不变.




Managed Stop Filter

This is specialized version of the Stop Words Filter Factory that uses a set of stop words that are managed from
a REST API.
Arguments:
managed: The name that should be used for this set of stop words in the managed REST API.
Example:
With this configuration the set of words is named "english" and can be managed via /solr/collection_name
/schema/analysis/stopwords/english
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ManagedStopFilterFactory" managed="english"/>
</analyzer>
See Stop Filter for example input/output



没看明白是如何进行停用词管理的.


Managed Synonym Filter

This is specialized version of the Synonym Filter Factory that uses a mapping on synonyms that is managed
from a REST API.
Arguments:
managed: The name that should be used for this mapping on synonyms in the managed REST API.
Example:
With this configuration the set of mappings is named "english" and can be managed via /solr/collection_n
ame/schema/analysis/synonyms/english
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ManagedSynonymFilterFactory" managed="english"/>
</analyzer>
See Synonym Filter for example input/output.


同义词管理

N-Gram Filter

Generates n-gram tokens of sizes in the given range. Note that tokens are ordered by position and then by gram
size.
Factory class: solr.NGramFilterFactory
Arguments:
minGramSize: (integer, default 1) The minimum gram size.
maxGramSize: (integer, default 2) The maximum gram size.
Example:
Default behavior.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.NGramFilterFactory"/>
</analyzer>
In: "four score"
Tokenizer to Filter: "four", "score"
Out: "f", "o", "u", "r", "fo", "ou", "ur", "s", "c", "o", "r", "e", "sc", "co", "or", "re"
Example:
A range of 1 to 4.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="4"/>
</analyzer>
In: "four score"
Tokenizer to Filter: "four", "score"
Out: "f", "fo", "fou", "four", "s", "sc", "sco", "scor"
Example:
A range of 3 to 5.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="5"/>
</analyzer>
In: "four score"
Tokenizer to Filter: "four", "score"
Out: "fou", "four", "our", "sco", "scor", "score", "cor", "core", "ore"



和上面的N-Gram分词器一样的效果.


Numeric Payload Token Filter

This filter adds a numeric floating point payload value to tokens that match a given type. Refer to the Javadoc for the org.apache.lucene.analysis.Token class for more information about token types and payloads.
Factory class: solr.NumericPayloadTokenFilterFactory
Arguments:
payload: (required) A floating point value that will be added to all matching tokens.
typeMatch: (required) A token type name string. Tokens with a matching type name will have their payload set
to the above floating point value.
Example:
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.NumericPayloadTokenFilterFactory" payload="0.75"
typeMatch="word"/>
</analyzer>
In: "bing bang boom"
Tokenizer to Filter: "bing", "bang", "boom"
Out: "bing"[0.75], "bang"[0.75], "boom"[0.75]


给定匹配类型的字符添加浮点值,我这里看不到具体有哪些类型....

Pattern Replace Filter

This filter applies a regular expression to each token and, for those that match, substitutes the given replacement
string in place of the matched pattern. Tokens which do not match are passed though unchanged.
Factory class: solr.PatternReplaceFilterFactory
Arguments:
pattern: (required) The regular expression to test against each token, as per java.util.regex.Pattern.
replacement: (required) A string to substitute in place of the matched pattern. This string may contain
references to capture groups in the regex pattern. See the Javadoc for java.util.regex.Matcher.
replace: ("all" or "first", default "all") Indicates whether all occurrences of the pattern in the token should be
replaced, or only the first.
Example:
Simple string replace:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="cat" replacement="dog"/>
</analyzer>
In: "cat concatenate catycat"
Tokenizer to Filter: "cat", "concatenate", "catycat"
Out: "dog", "condogenate", "dogydog"
Example:
String replacement, first occurrence only:

<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="cat" replacement="dog"
replace="first"/>
</analyzer>
In: "cat concatenate catycat"
Tokenizer to Filter: "cat", "concatenate", "catycat"
Out: "dog", "condogenate", "dogycat"
Example:
More complex pattern with capture group reference in the replacement. Tokens that start with non-numeric
characters and end with digits will have an underscore inserted before the numbers. Otherwise the token is
passed through.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="(\D+)(\d+)$"
replacement="$1_$2"/>
</analyzer>
In: "cat foo1234 9987 blah1234foo"
Tokenizer to Filter: "cat", "foo1234", "9987", "blah1234foo"
Out: "cat", "foo_1234", "9987", "blah1234foo"



使用正则表达式进行字符替换,满足就替换,不满足表达式的保持不变.可以设置替换第一个还是全部替换

Phonetic Filter

This filter creates tokens using one of the phonetic encoding algorithms in the org.apache.commons.codec.
language package. For more information, see the section on Phonetic Matching.
Factory class: solr.PhoneticFilterFactory
Arguments:
encoder: (required) The name of the encoder to use. The encoder name must be one of the following (case
insensitive): "DoubleMetaphone", "Metaphone", "Soundex", "RefinedSoundex", "Caverphone" (v2.0), "CologneP
honetic", or "Nysiis".
inject: (true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens
are replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the exact
spelling of the target word may not match.
maxCodeLength: (integer) The maximum length of the code to be generated by the Metaphone or Double
Metaphone encoders.
Example:
Default behavior for DoubleMetaphone encoding.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone"/>
</analyzer>

In: "four score and twenty"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)
Out: "four"(1), "FR"(1), "score"(2), "SKR"(2), "and"(3), "ANT"(3), "twenty"(4), "TNT"(4)
The phonetic tokens have a position increment of 0, which indicates that they are at the same position as the
token they were derived from (immediately preceding).
Example:
Discard original token.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone"
inject="false"/>
</analyzer>
In: "four score and twenty"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)
Out: "FR"(1), "SKR"(2), "ANT"(3), "TWNT"(4)
Example:
Default Soundex encoder.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.PhoneticFilterFactory" encoder="Soundex"/>
</analyzer>
In: "four score and twenty"
Tokenizer to Filter: "four"(1), "score"(2), "and"(3), "twenty"(4)
Out: "four"(1), "F600"(1), "score"(2), "S600"(2), "and"(3), "A530"(3), "twenty"(4), "T530"(4)


又是一个基于语音识别算法设计的过滤器.对中文真的有用吗?

Porter Stem Filter

This filter applies the Porter Stemming Algorithm for English. The results are similar to using the Snowball Porter
Stemmer with the language="English" argument. But this stemmer is coded directly in Java and is not based
on Snowball. It does not accept a list of protected words and is only appropriate for English language text.
However, it has been benchmarked as four times faster than the English Snowball stemmer, so can provide a
performance enhancement.
Factory class: solr.PorterStemFilterFactory
Arguments: None
Example:
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory "/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>


In: "jump jumping jumped"
Tokenizer to Filter: "jump", "jumping", "jumped"
Out: "jump", "jump", "jump"


这个只实用于英语....



Remove Duplicates Token Filter

The filter removes duplicate tokens in the stream. Tokens are considered to be duplicates if they have the same
text and position values.
Factory class: solr.RemoveDuplicatesTokenFilterFactory
Arguments: None
Example:
One example of where RemoveDuplicatesTokenFilterFactory is in situations where a synonym file is
being used in conjuntion with a stemmer causes some synonyms to be reduced to the same stem. Consider the
following entry from a synonyms.txt file:
Television, Televisions, TV, TVs
When used in the following configuration:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
In: "Watch TV"
Tokenizer to Synonym Filter: "Watch"(1) "TV"(2)
Synonym Filter to Stem Filter: "Watch"(1) "Television"(2) "Televisions"(2) "TV"(2) "TVs"(2)
Stem Filter to Remove Dups Filter: "Watch"(1) "Television"(2) "Television"(2) "TV"(2) "TV"(2)
Out: "Watch"(1) "Television"(2) "TV"(2)


移除再同一位置出现的重复的词的过滤器.
Reversed Wildcard Filter

This filter reverses tokens to provide faster leading wildcard and prefix queries. Tokens without wildcards are not
reversed.
Factory class: solr.ReversedWildcardFilterFactory
Arguments:
withOriginal (boolean) If true, the filter produces both original and reversed tokens at the same positions. If
false, produces only reversed tokens.
maxPosAsterisk (integer, default = 2) The maximum position of the asterisk wildcard ('*') that triggers the
reversal of the query term. Terms with asterisks at positions above this value are not reversed.
maxPosQuestion (integer, default = 1) The maximum position of the question mark wildcard ('?') that triggers
the reversal of query term. To reverse only pure suffix queries (queries with a single leading asterisk), set this to

0 and maxPosAsterisk to 1.
maxFractionAsterisk (float, default = 0.0) An additional parameter that triggers the reversal if asterisk ('*')
position is less than this fraction of the query token length.
minTrailing (integer, default = 2) The minimum number of trailing characters in a query token after the last
wildcard character. For good performance this should be set to a value larger than 1.
Example:
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2" maxFractionAsterisk="0"/>
</analyzer>
In: "*foo *bar"
Tokenizer to Filter: "*foo", "*bar"
Out: "oof*", "rab*"


反转通配符过滤器,具体功能还不是很清楚,能提高查询效率嘛?
Shingle Filter

This filter constructs shingles, which are token n-grams, from the token stream. It combines runs of tokens into a
single token.
Factory class: solr.ShingleFilterFactory
Arguments:
minShingleSize: (integer, default 2) The minimum number of tokens per shingle.
maxShingleSize: (integer, must be >= 2, default 2) The maximum number of tokens per shingle.
outputUnigrams: (true/false) If true (the default), then each individual token is also included at its original
position.
outputUnigramsIfNoShingles: (true/false) If false (the default), then individual tokens will be output if no
shingles are possible.
tokenSeparator: (string, default is " ") The default string to use when joining adjacent tokens to form a shingle.
Example:
Default behavior.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory"/>
</analyzer>
In: "To be, or what?"
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "To"(1), "To be"(1), "be"(2), "be or"(2), "or"(3), "or what"(3), "what"(4)
Example:
A shingle size of four, do not include original token.

<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="4"
outputUnigrams="false"/>
</analyzer>
In: "To be, or not to be."
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "not"(4), "to"(5), "be"(6)
Out: "To be"(1), "To be or"(1), "To be or not"(1), "be or"(2), "be or not"(2), "be or not to"(2), "or not"(3), "or not
to"(3), "or not to be"(3), "not to"(4), "not to be"(4), "to be"(5)


使用N-Gram 对tokens进行组合,成新的词汇. 可以设置包含或者不包含原生位置tokens词



Snowball Porter Stemmer Filter

Factory class: solr.SnowballPorterFilterFactory
Arguments:
language: (default "English") The name of a language, used to select the appropriate Porter stemmer to use.
Case is significant. This string is used to select a package name in the "org.tartarus.snowball.ext" class
hierarchy.
protected: Path of a text file containing a list of protected words, one per line. Protected words will not be
stemmed. Blank lines and lines that begin with "#" are ignored. This may be an absolute path, or a simple file
name in the Solr config directory.
Example:
Default behavior:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SnowballPorterFilterFactory"/>
</analyzer>
In: "flip flipped flipping"
Tokenizer to Filter: "flip", "flipped", "flipping"
Out: "flip", "flip", "flip"
Example:
French stemmer, English words:

<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="French"/>
</analyzer>
In: "flip flipped flipping"
Tokenizer to Filter: "flip", "flipped", "flipping"
Out: "flip", "flipped", "flipping"
Example:
Spanish stemmer, Spanish words:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Spanish"/>
</analyzer>
In: "cante canta"
Tokenizer to Filter: "cante", "canta"
Out: "cant", "cant"


也是一个词根处理的过滤器,可以使用与多语言,中文就没办法,了.可以用来处理英语


Standard Filter
This filter removes dots from acronyms and the substring "'s" from the end of tokens. This filter depends on the
tokens being tagged with the appropriate term-type to recognize acronyms and words with apostrophes.
Factory class: solr.StandardFilterFactory

移除同音缩略和字符后的s
Stop Filter

This filter discards, or stops analysis of, tokens that are on the given stop words list. A standard stop words list is
included in the Solr config directory, named stopwords.txt, which is appropriate for typical English language text.
Factory class: solr.StopFilterFactory
Arguments:
words: (optional) The path to a file that contains a list of stop words, one per line. Blank lines and lines that
begin with "#" are ignored. This may be an absolute path, or path relative to the Solr config directory.
format: (optional) If the stopwords list has been formatted for Snowball, you can specify format="snowball"
so Solr can read the stopwords file.
ignoreCase: (true/false, default false) Ignore case when testing for stop words. If true, the stop list should contain lowercase words.

enablePositionIncrements: if luceneMatchVersion is 4.4 or earlier and enablePositionIncrements="false", no position holes will be left by this filter when it removes tokens. This argument is invalid if luceneMatchVersion is 5.0 or later.
Example:
Case-sensitive matching, capitalized words not stopped. Token positions skip stopped words.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"/>
</analyzer>
In: "To be or what?"
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "To"(1), "what"(4)
Example:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
In: "To be or what?"
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "what"(4)


停用词过滤器,给点文件中的词被过滤掉,可以设置大小写敏感

Suggest Stop Filter

Like Stop Filter, this filter discards, or stops analysis of, tokens that are on the given stop words list. Suggest
Stop Filter differs from Stop Filter in that it will not remove the last token unless it is followed by a token
separator. For example, a query "find the" would preserve the 'the' since it was not followed by a space,
punctuation etc., and mark it as a KEYWORD so that following filters will not change or remove it. By contrast, a
query like "find the popsicle" would remove "the" as a stopword, since it's followed by a space. When
using one of the analyzing suggesters, you would normally use the ordinary StopFilterFactory in your index
analyzer and then SuggestStopFilter in your query analyzer.
Factory class: solr.SuggestStopFilterFactory
Arguments:
words: (optional; default: StopAnalyzer#ENGLISH_STOP_WORDS_SET ) The name of a stopwords file to
parse.
format: (optional; default: wordset) Defines how the words file will be parsed. If words is not specified, then f
ormat must not be specified. The valid values for the format option are:
wordset: This is the default format, which supports one word per line (including any intra-word
whitespace) and allows whole line comments begining with the "#" character. Blank lines are ignored.
snowball: This format allows for multiple words specified on each line, and trailing comments may be
specified using the vertical line ("|"). Blank lines are ignored.
ignoreCase: (optional; default: false) If true, matching is case-insensitive.

Example:
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SuggestStopFilterFactory" ignoreCase="true"
words="stopwords.txt" format="wordset"/>
</analyzer>
In: "The The"
Tokenizer to Filter: "the"(1), "the"(2)
Out: "the"(2)


和停用词过滤器相似,但是当停用词后面没有空格时不会被停用掉.
Synonym Filter

This filter does synonym mapping. Each token is looked up in the list of synonyms and if a match is found, then
the synonym is emitted in place of the token. The position value of the new tokens are set such they all occur at
the same position as the original token.
Factory class: solr.SynonymFilterFactory

For the following examples, assume a synonyms file named mysynonyms.txt:

couch,sofa,divan
teh => the
huge,ginormous,humungous => large
small => tiny,teeny,weeny
Example:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="mysynonyms.txt"/>
</analyzer>
In: "teh small couch"
Tokenizer to Filter: "teh"(1), "small"(2), "couch"(3)
Out: "the"(1), "tiny"(2), "teeny"(2), "weeny"(2), "couch"(3), "sofa"(3), "divan"(3)
Example:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory "/>
<filter class="solr.SynonymFilterFactory" synonyms="mysynonyms.txt"/>
</analyzer>
In: "teh ginormous, humungous sofa"
Tokenizer to Filter: "teh"(1), "ginormous"(2), "humungous"(3), "sofa"(4)
Out: "the"(1), "large"(2), "large"(3), "couch"(4), "sofa"(4), "divan"(4)


同义词替换,根据配置文件和规则将匹配的词替换为文本中的词
Token Offset Payload Filter
This filter adds the numeric character offsets of the token as a payload value for that token.
Factory class: solr.TokenOffsetPayloadTokenFilterFactory
Arguments: None
Example:
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.TokenOffsetPayloadTokenFilterFactory"/>
</analyzer>
In: "bing bang boom"
Tokenizer to Filter: "bing", "bang", "boom"
Out: "bing"[0,4], "bang"[5,9], "boom"[10,14]


token的偏移量纪录 绝对位置.

Trim Filter

This filter trims leading and/or trailing whitespace from tokens. Most tokenizers break tokens at whitespace, so this filter is most often used for special situations.
Factory class: solr.TrimFilterFactory
Arguments:
updateOffsets: if luceneMatchVersion is 4.3 or earlier and updateOffsets="true", trimmed tokens'
start and end offsets will be updated to those of the first and last characters (plus one) remaining in the token. T
his argument is invalid if luceneMatchVersion is 5.0 or later.
Example:
The PatternTokenizerFactory configuration used here splits the input on simple commas, it does not remove
whitespace.
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern=","/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
In: "one, two , three ,four "
Tokenizer to Filter: "one", " two ", " three ", "four "
Out: "one", "two", "three", "four"


这个过滤器用来去除字符前后的空格.

Type As Payload Filter

This filter adds the token's type, as an encoded byte sequence, as its payload.
Factory class: solr.TypeAsPayloadTokenFilterFactory
Arguments: None
Example:
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.TypeAsPayloadTokenFilterFactory"/>
</analyzer>
In: "Pay Bob's I.O.U."
Tokenizer to Filter: "Pay", "Bob's", "I.O.U."
Out: "Pay"[<ALPHANUM>], "Bob's"[<APOSTROPHE>], "I.O.U."[<ACRONYM>]


为token添加各自的类型

Type Token Filter

This filter blacklists or whitelists a specified list of token types, assuming the tokens have type metadata
associated with them. For example, the UAX29 URL Email Tokenizer emits "<URL>" and "<EMAIL>" typed
tokens, as well as other types. This filter would allow you to pull out only e-mail addresses from text as tokens, if
you wish.
Factory class: solr.TypeTokenFilterFactory
Arguments:
types: Defines the location of a file of types to filter.

useWhitelist: If true, the file defined in types should be used as include list. If false, or undefined, the file
defined in types is used as a blacklist.
enablePositionIncrements: if luceneMatchVersion is 4.3 or earlier and enablePositionIncrement
s="false", no position holes will be left by this filter when it removes tokens. This argument is invalid if luc
eneMatchVersion is 5.0 or later.
Example:
<analyzer>
<filter class="solr.TypeTokenFilterFactory" types="stoptypes.txt"
useWhitelist="true"/>
</analyzer>


以给定的类型作为黑名单或者白名单进行相关的tokens操作.
Word Delimiter Filter
This filter splits tokens at word delimiters. The rules for determining delimiters are determined as follows:
A change in case within a word: "CamelCase" -> "Camel", "Case". This can be disabled by setting split
OnCaseChange="0".
A transition from alpha to numeric characters or vice versa: "Gonzo5000" -> "Gonzo", "5000" "4500XL" ->
"4500", "XL". This can be disabled by setting splitOnNumerics="0".
Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
A trailing "'s" is removed: "O'Reilly's" -> "O", "Reilly"
Any leading or trailing delimiters are discarded: "--hot-spot--" -> "hot", "spot"
Factory class: solr.WordDelimiterFilterFactory


词语分隔的处理过滤器.规则太多了.不贴了

0 0
原创粉丝点击