Interface IIcuTokenizer
Tokenizes text into words on word boundaries, as defined in UAX #29: Unicode Text Segmentation. It behaves much like the standard tokenizer, but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.
Namespace: OpenSearch.Client
Assembly: OpenSearch.Client.dll
Syntax
public interface IIcuTokenizer : ITokenizer
Remarks
Requires analysis-icu plugin to be installed
Properties
| Edit this page View SourceRuleFiles
You can customize the icu-tokenizer behavior by specifying per-script rule files, see the RBBI rules syntax reference for a more detailed explanation.
Declaration
[DataMember(Name = "rule_files")]
string RuleFiles { get; set; }
Property Value
Type | Description |
---|---|
string |