分词器
分词器是对一串语句进行词语分割处理的组件,它由三个部分组成:
char_filter
(针对原始文本处理,如去掉某些符号、处理html
等)tokenizer
按照规则,将上一步处理后的语句切分为单词token filter
将切分后的单词进行加工(如转为小写、去除停用词增加同义词等等)
分词的流程也是严格按照 char_filter
——> tokenizer
——> filter
这样的顺序进行的。。
1.让 king’s 和 kings 有相同评分
方式一:通过 char_filter
过滤掉 ‘ 符号1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57DELETE my_index
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"char_filter":"my_char_filter",
"tokenizer":"standard"
}
},
"char_filter":{
"my_char_filter":{
"type":"mapping",
"mappings":[
"' => "
]
}
}
}
},
"mappings":{
"properties":{
"name":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
}
}
POST my_index/_doc/1
{
"name":"king's"
}
POST my_index/_doc/2
{
"name":"kings"
}
GET my_index/_search
{
"query": {
"match": {
"name": "king's"
}
}
}
GET my_index/_search
{
"query": {
"match": {
"name": "kings"
}
}
}
方式二:使用同义词 synonym token filter
1 | DELETE my_index |
2. dog & cat 与 dog and cat 的 match_phrase 查询得分相同
其实这个与上面的思路是类似的,要么使用 char_filter
将 &
映射为 and
;要么直接设置为同义词。
char_filter
映射:
1 | DELETE my_index |
同义词:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57DELETE my_index
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"filter":"my_token_filter",
"tokenizer":"standard"
}
},
"filter":{
"my_token_filter":{
"type":"synonym",
"synonyms":[
"dog & cat => dog and cat"
]
}
}
}
},
"mappings":{
"properties":{
"name":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
}
}
POST my_index/_doc/1
{
"name":"dog & cat"
}
POST my_index/_doc/2
{
"name":"dog and cat"
}
GET my_index/_search
{
"query": {
"match_phrase": {
"name": "dog & cat"
}
}
}
GET my_index/_search
{
"query": {
"match_phrase": {
"name": "dog and cat"
}
}
}
3. 给 oa 、oA 、OA、Oa,dingding 设置查询得分相同
还是考察同义词的使用:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41DELETE my_index
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"filter":"my_token_filter",
"tokenizer":"standard"
}
},
"filter":{
"my_token_filter":{
"type":"synonym",
"synonyms":[
"oa,oA,OA,Oa,dingding"
]
}
}
}
},
"mappings":{
"properties":{
"name":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
}
}
GET my_index/_search
{
"query": {
"match": {
"name": "dog & cat"
}
}
}
4.去除无用字符
这是我实际遇到的一个问题,应用中采用的是外部分词器,ES
只存储拆分好的 token
,但分词接口返回的目前是个字符串,格式类似于 “ [‘张三’,’李四’,’新冠肺炎’,’感染者’]” 这样的。分词只取中文且不再此基础上进行分割。所以采用了如下方式:
1 | DELETE my_index |
5.限制中文分词长度,单个的字不需要
解决方式如下,但需要谨慎使用,会引起查询精准度问题,如果分词本身只能分词单个词,使用该过滤器后,将会查询不到结果集。
对比验证两个分词器的结果即可!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41DELETE my_index
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"ik_max_word",
"filter":"my_filter"
}
},
"filter":{
"my_filter":{
"type":"length",
"min":"2",
"max":"2"
}
}
}
},
"mappings":{
"properties":{
"name":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
}
}
GET my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": " 我在人民广场吃炸鸡,喝啤酒,好生快乐"
}
GET my_index/_analyze
{
"analyzer": "ik_max_word",
"text": " 我在人民广场吃炸鸡,喝啤酒,好生快乐"
}