自定义分词器

分词器

分词器是对一串语句进行词语分割处理的组件，它由三个部分组成：

char_filter（针对原始文本处理，如去掉某些符号、处理 html 等）
tokenizer 按照规则，将上一步处理后的语句切分为单词
token filter 将切分后的单词进行加工（如转为小写、去除停用词增加同义词等等）

分词的流程也是严格按照 char_filter ——> tokenizer——> filter 这样的顺序进行的。。

1.让 king’s 和 kings 有相同评分

方式一：通过 char_filter 过滤掉 ‘ 符号

DELETE my_index
PUT my_index
{
    "settings":{
        "analysis":{
            "analyzer":{
               "my_custom_analyzer":{
                   "type":"custom",
                   "char_filter":"my_char_filter",
                   "tokenizer":"standard"
               } 
            },
            "char_filter":{
                "my_char_filter":{
                    "type":"mapping",
                    "mappings":[
                    "' => "
                    ]
                }
            }
        }
    },
    "mappings":{
        "properties":{
            "name":{
                "type":"text",
                "analyzer":"my_custom_analyzer"
            }
        }
    }
}

POST my_index/_doc/1
{
  "name":"king's"
}
POST my_index/_doc/2
{
  "name":"kings"
}

GET my_index/_search
{
  "query": {
    "match": {
      "name": "king's"
    }
  }
}
GET my_index/_search
{
  "query": {
    "match": {
      "name": "kings"
    }
  }
}

方式二：使用同义词 synonym token filter

DELETE my_index
PUT my_index
{
    "settings":{
        "analysis":{
            "analyzer":{
               "my_custom_analyzer":{
                   "type":"custom",
                   "filter":"my_token_filter",
                   "tokenizer":"standard"
               } 
            },
            "filter":{
                "my_token_filter":{
                    "type":"synonym",
                    "synonyms":[
                      "king's => kings"
                      ]
                }
            }
        }
    },
    "mappings":{
        "properties":{
            "name":{
                "type":"text",
                "analyzer":"my_custom_analyzer"
            }
        }
    }
}

POST my_index/_doc/1
{
  "name":"king's"
}
POST my_index/_doc/2
{
  "name":"kings"
}

GET my_index/_search
{
  "query": {
    "match": {
      "name": "king's"
    }
  }
}
GET my_index/_search
{
  "query": {
    "match": {
      "name": "kings"
    }
  }
}

2. dog & cat 与 dog and cat 的 match_phrase 查询得分相同

其实这个与上面的思路是类似的，要么使用 char_filter 将 & 映射为 and；要么直接设置为同义词。

char_filter 映射：

DELETE my_index
PUT my_index
{
    "settings":{
        "analysis":{
            "analyzer":{
               "my_custom_analyzer":{
                   "type":"custom",
                   "char_filter":"my_char_filter",
                   "tokenizer":"standard"
               } 
            },
            "char_filter":{
                "my_char_filter":{
                    "type":"mapping",
                    "mappings":[
                    "& => and"
                    ]
                }
            }
        }
    },
    "mappings":{
        "properties":{
            "name":{
                "type":"text",
                "analyzer":"my_custom_analyzer"
            }
        }
    }
}

POST my_index/_doc/1
{
  "name":"dog & cat"
}
POST my_index/_doc/2
{
  "name":"dog and cat"
}

GET my_index/_search
{
  "query": {
    "match_phrase": {
      "name": "dog & cat"
    }
  }
}
GET my_index/_search
{
  "query": {
    "match_phrase": {
      "name": "dog and cat"
    }
  }
}

同义词:

DELETE my_index
PUT my_index
{
    "settings":{
        "analysis":{
            "analyzer":{
               "my_custom_analyzer":{
                   "type":"custom",
                   "filter":"my_token_filter",
                   "tokenizer":"standard"
               } 
            },
            "filter":{
                "my_token_filter":{
                    "type":"synonym",
                    "synonyms":[
                      "dog & cat => dog and cat"
                      ]
                }
            }
        }
    },
    "mappings":{
        "properties":{
            "name":{
                "type":"text",
                "analyzer":"my_custom_analyzer"
            }
        }
    }
}

POST my_index/_doc/1
{
  "name":"dog & cat"
}
POST my_index/_doc/2
{
  "name":"dog and cat"
}

GET my_index/_search
{
  "query": {
    "match_phrase": {
      "name": "dog & cat"
    }
  }
}
GET my_index/_search
{
  "query": {
    "match_phrase": {
      "name": "dog and cat"
    }
  }
}

3. 给 oa 、oA 、OA、Oa，dingding 设置查询得分相同

还是考察同义词的使用：

DELETE my_index
PUT my_index
{
    "settings":{
        "analysis":{
            "analyzer":{
               "my_custom_analyzer":{
                   "type":"custom",
                   "filter":"my_token_filter",
                   "tokenizer":"standard"
               } 
            },
            "filter":{
                "my_token_filter":{
                    "type":"synonym",
                    "synonyms":[
                      "oa,oA,OA,Oa,dingding"
                      ]
                }
            }
        }
    },
    "mappings":{
        "properties":{
            "name":{
                "type":"text",
                "analyzer":"my_custom_analyzer"
            }
        }
    }
}


GET my_index/_search
{
  "query": {
    "match": {
      "name": "dog & cat"
    }
  }
}

4.去除无用字符

这是我实际遇到的一个问题，应用中采用的是外部分词器，ES 只存储拆分好的 token ,但分词接口返回的目前是个字符串，格式类似于 “ [‘张三’,’李四’,’新冠肺炎’,’感染者’]” 这样的。分词只取中文且不再此基础上进行分割。所以采用了如下方式：

DELETE my_index
PUT my_index
{
    "settings":{
        "analysis":{
            "analyzer":{
                "my_custom_analyzer":{
                    "type":"custom",
                    "char_filter":"my_char_filter",
                    "tokenizer":"my_tokenizer"
                }
            },
            "tokenizer":{
              "my_tokenizer":{
                "type":"pattern",
                "pattern":","
              }
            },
            "char_filter":{
                "my_char_filter":{
                    "type":"mapping",
                    "mappings":[
                    "[ => ",
                    "] => ",
                    "' => "
                    ]
                }
            }
        }
    },
    "mappings":{
        "properties":{
            "name":{
                "type":"text",
                "analyzer":"my_custom_analyzer"
            }
        }
    }
}

GET my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": " ['张三','李四','新冠肺炎','感染者']"
}

5.限制中文分词长度，单个的字不需要

解决方式如下，但需要谨慎使用，会引起查询精准度问题，如果分词本身只能分词单个词，使用该过滤器后，将会查询不到结果集。

对比验证两个分词器的结果即可！

DELETE my_index
PUT my_index
{
    "settings":{
        "analysis":{
            "analyzer":{
                "my_custom_analyzer":{
                    "type":"custom",
                    "tokenizer":"ik_max_word",
                    "filter":"my_filter"
                }
            },
            "filter":{
              "my_filter":{
                "type":"length",
                "min":"2",
                "max":"2"
              }
            }
        }
    },
    "mappings":{
        "properties":{
            "name":{
                "type":"text",
                "analyzer":"my_custom_analyzer"
            }
        }
    }
}

GET my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": " 我在人民广场吃炸鸡，喝啤酒，好生快乐"
}
GET my_index/_analyze
{
  "analyzer": "ik_max_word",
  "text": " 我在人民广场吃炸鸡，喝啤酒，好生快乐"
}

保罗-班扬.jpg