Interface IIcuTokenizer

Tokenizes text into words on word boundaries, as defined in UAX #29: Unicode Text Segmentation. It behaves much like the standard tokenizer, but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.

Inherited Members

ITokenizer.Type

ITokenizer.Version

Namespace: OpenSearch.Client

Assembly: OpenSearch.Client.dll

Syntax

public interface IIcuTokenizer : ITokenizer

Remarks

Requires analysis-icu plugin to be installed

Properties

| Edit this page View Source

RuleFiles

You can customize the icu-tokenizer behavior by specifying per-script rule files, see the RBBI rules syntax reference for a more detailed explanation.

Declaration

[DataMember(Name = "rule_files")]
string RuleFiles { get; set; }

Property Value

Type	Description
string

Extension Methods

SuffixExtensions.Suffix(object, string)