Search Results for

    Show / Hide Table of Contents

    Interface IIcuTokenizer

    Tokenizes text into words on word boundaries, as defined in UAX #29: Unicode Text Segmentation. It behaves much like the standard tokenizer, but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.

    Inherited Members
    ITokenizer.Type
    ITokenizer.Version
    Namespace: OpenSearch.Client
    Assembly: OpenSearch.Client.dll
    Syntax
    public interface IIcuTokenizer : ITokenizer
    Remarks

    Requires analysis-icu plugin to be installed

    Properties

    | Edit this page View Source

    RuleFiles

    You can customize the icu-tokenizer behavior by specifying per-script rule files, see the RBBI rules syntax reference for a more detailed explanation.

    Declaration
    [DataMember(Name = "rule_files")]
    string RuleFiles { get; set; }
    Property Value
    Type Description
    string

    Extension Methods

    SuffixExtensions.Suffix(object, string)
    • Edit this page
    • View Source
    In this article
    • Properties
      • RuleFiles
    • Extension Methods
    Back to top Generated by DocFX