Class TokenizersDescriptor
Inheritance
TokenizersDescriptor
Assembly: OpenSearch.Client.dll
Syntax
public class TokenizersDescriptor : IsADictionaryDescriptorBase<TokenizersDescriptor, ITokenizers, string, ITokenizer>, IDescriptor, IPromise<ITokenizers>
Constructors
|
Edit this page
View Source
TokenizersDescriptor()
Declaration
public TokenizersDescriptor()
Methods
|
Edit this page
View Source
CharGroup(string, Func<CharGroupTokenizerDescriptor, ICharGroupTokenizer>)
A list containing a list of characters to tokenize the string on. Whenever a character from this list is encountered, a
new token is started. This accepts either single characters like eg. -, or character groups: whitespace, letter, digit,
punctuation, symbol.
Declaration
public TokenizersDescriptor CharGroup(string name, Func<CharGroupTokenizerDescriptor, ICharGroupTokenizer> selector)
Parameters
Returns
|
Edit this page
View Source
EdgeNGram(string, Func<EdgeNGramTokenizerDescriptor, IEdgeNGramTokenizer>)
A tokenizer of type edgeNGram.
Declaration
public TokenizersDescriptor EdgeNGram(string name, Func<EdgeNGramTokenizerDescriptor, IEdgeNGramTokenizer> selector)
Parameters
Returns
|
Edit this page
View Source
Icu(string, Func<IcuTokenizerDescriptor, IIcuTokenizer>)
Tokenizes text into words on word boundaries, as defined in UAX #29: Unicode Text Segmentation. It behaves much
like the standard tokenizer, but adds better support for some Asian languages by using a dictionary-based approach
to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer
text into syllables.
Part of the analysis-icu
plugin:
Declaration
public TokenizersDescriptor Icu(string name, Func<IcuTokenizerDescriptor, IIcuTokenizer> selector)
Parameters
Returns
|
Edit this page
View Source
Keyword(string, Func<KeywordTokenizerDescriptor, IKeywordTokenizer>)
A tokenizer of type keyword that emits the entire input as a single input.
Declaration
public TokenizersDescriptor Keyword(string name, Func<KeywordTokenizerDescriptor, IKeywordTokenizer> selector)
Parameters
Returns
|
Edit this page
View Source
Kuromoji(string, Func<KuromojiTokenizerDescriptor, IKuromojiTokenizer>)
A tokenizer of type pattern that can flexibly separate text into terms via a regular expression.
Part of the analysis-kuromoji
plugin:
Declaration
public TokenizersDescriptor Kuromoji(string name, Func<KuromojiTokenizerDescriptor, IKuromojiTokenizer> selector)
Parameters
Returns
|
Edit this page
View Source
Letter(string, Func<LetterTokenizerDescriptor, ILetterTokenizer>)
A tokenizer of type letter that divides text at non-letters. That’s to say, it defines tokens as maximal strings of
adjacent letters.
Note, this does a decent job for most European languages, but does a terrible job for some Asian languages, where words
are not
separated by spaces.
Declaration
public TokenizersDescriptor Letter(string name, Func<LetterTokenizerDescriptor, ILetterTokenizer> selector)
Parameters
Returns
|
Edit this page
View Source
Lowercase(string, Func<LowercaseTokenizerDescriptor, ILowercaseTokenizer>)
A tokenizer of type lowercase that performs the function of Letter Tokenizer and Lower Case Token Filter together.
It divides text at non-letters and converts them to lower case.
While it is functionally equivalent to the combination of Letter Tokenizer and Lower Case Token Filter,
there is a performance advantage to doing the two tasks at once, hence this (redundant) implementation.
Declaration
public TokenizersDescriptor Lowercase(string name, Func<LowercaseTokenizerDescriptor, ILowercaseTokenizer> selector)
Parameters
Returns
|
Edit this page
View Source
NGram(string, Func<NGramTokenizerDescriptor, INGramTokenizer>)
A tokenizer of type nGram.
Declaration
public TokenizersDescriptor NGram(string name, Func<NGramTokenizerDescriptor, INGramTokenizer> selector)
Parameters
Returns
|
Edit this page
View Source
Nori(string, Func<NoriTokenizerDescriptor, INoriTokenizer>)
Tokenizer that ships with the analysis-nori plugin
Declaration
public TokenizersDescriptor Nori(string name, Func<NoriTokenizerDescriptor, INoriTokenizer> selector)
Parameters
Returns
|
Edit this page
View Source
PathHierarchy(string, Func<PathHierarchyTokenizerDescriptor, IPathHierarchyTokenizer>)
The path_hierarchy tokenizer takes something like this:
/something/something/else
And produces tokens:
/something
/something/something
/something/something/else
Declaration
public TokenizersDescriptor PathHierarchy(string name, Func<PathHierarchyTokenizerDescriptor, IPathHierarchyTokenizer> selector)
Parameters
Returns
|
Edit this page
View Source
Pattern(string, Func<PatternTokenizerDescriptor, IPatternTokenizer>)
A tokenizer of type pattern that can flexibly separate text into terms via a regular expression.
Declaration
public TokenizersDescriptor Pattern(string name, Func<PatternTokenizerDescriptor, IPatternTokenizer> selector)
Parameters
Returns
|
Edit this page
View Source
Standard(string, Func<StandardTokenizerDescriptor, IStandardTokenizer>)
A tokenizer of type standard providing grammar based tokenizer that is a good tokenizer for most European language
documents.
The tokenizer implements the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.
Declaration
public TokenizersDescriptor Standard(string name, Func<StandardTokenizerDescriptor, IStandardTokenizer> selector = null)
Parameters
Returns
|
Edit this page
View Source
UaxEmailUrl(string, Func<UaxEmailUrlTokenizerDescriptor, IUaxEmailUrlTokenizer>)
A tokenizer of type uax_url_email which works exactly like the standard tokenizer, but tokenizes emails and urls as
single tokens
Declaration
public TokenizersDescriptor UaxEmailUrl(string name, Func<UaxEmailUrlTokenizerDescriptor, IUaxEmailUrlTokenizer> selector)
Parameters
Returns
|
Edit this page
View Source
UserDefined(string, ITokenizer)
Declaration
public TokenizersDescriptor UserDefined(string name, ITokenizer analyzer)
Parameters
Returns
|
Edit this page
View Source
Whitespace(string, Func<WhitespaceTokenizerDescriptor, IWhitespaceTokenizer>)
A tokenizer of type whitespace that divides text at whitespace.
Declaration
public TokenizersDescriptor Whitespace(string name, Func<WhitespaceTokenizerDescriptor, IWhitespaceTokenizer> selector = null)
Parameters
Returns
Implements
Extension Methods