自定义分词器

分词器

分词器是对一串语句进行词语分割处理的组件,它由三个部分组成:

  • char_filter(针对原始文本处理,如去掉某些符号、处理 html 等)
  • tokenizer 按照规则,将上一步处理后的语句切分为单词
  • token filter 将切分后的单词进行加工(如转为小写、去除停用词增加同义词等等)

分词的流程也是严格按照 char_filter ——> tokenizer——> filter 这样的顺序进行的。。

1.让 king’s 和 kings 有相同评分

方式一:通过 char_filter 过滤掉 ‘ 符号

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
DELETE my_index
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"char_filter":"my_char_filter",
"tokenizer":"standard"
}
},
"char_filter":{
"my_char_filter":{
"type":"mapping",
"mappings":[
"' => "
]
}
}
}
},
"mappings":{
"properties":{
"name":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
}
}

POST my_index/_doc/1
{
"name":"king's"
}
POST my_index/_doc/2
{
"name":"kings"
}

GET my_index/_search
{
"query": {
"match": {
"name": "king's"
}
}
}
GET my_index/_search
{
"query": {
"match": {
"name": "kings"
}
}
}

方式二:使用同义词 synonym token filter

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
DELETE my_index
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"filter":"my_token_filter",
"tokenizer":"standard"
}
},
"filter":{
"my_token_filter":{
"type":"synonym",
"synonyms":[
"king's => kings"
]
}
}
}
},
"mappings":{
"properties":{
"name":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
}
}

POST my_index/_doc/1
{
"name":"king's"
}
POST my_index/_doc/2
{
"name":"kings"
}

GET my_index/_search
{
"query": {
"match": {
"name": "king's"
}
}
}
GET my_index/_search
{
"query": {
"match": {
"name": "kings"
}
}
}

2. dog & cat 与 dog and cat 的 match_phrase 查询得分相同

其实这个与上面的思路是类似的,要么使用 char_filter& 映射为 and;要么直接设置为同义词。

char_filter 映射:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
DELETE my_index
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"char_filter":"my_char_filter",
"tokenizer":"standard"
}
},
"char_filter":{
"my_char_filter":{
"type":"mapping",
"mappings":[
"& => and"
]
}
}
}
},
"mappings":{
"properties":{
"name":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
}
}

POST my_index/_doc/1
{
"name":"dog & cat"
}
POST my_index/_doc/2
{
"name":"dog and cat"
}

GET my_index/_search
{
"query": {
"match_phrase": {
"name": "dog & cat"
}
}
}
GET my_index/_search
{
"query": {
"match_phrase": {
"name": "dog and cat"
}
}
}

同义词:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
DELETE my_index
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"filter":"my_token_filter",
"tokenizer":"standard"
}
},
"filter":{
"my_token_filter":{
"type":"synonym",
"synonyms":[
"dog & cat => dog and cat"
]
}
}
}
},
"mappings":{
"properties":{
"name":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
}
}

POST my_index/_doc/1
{
"name":"dog & cat"
}
POST my_index/_doc/2
{
"name":"dog and cat"
}

GET my_index/_search
{
"query": {
"match_phrase": {
"name": "dog & cat"
}
}
}
GET my_index/_search
{
"query": {
"match_phrase": {
"name": "dog and cat"
}
}
}

3. 给 oa 、oA 、OA、Oa,dingding 设置查询得分相同

还是考察同义词的使用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
DELETE my_index
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"filter":"my_token_filter",
"tokenizer":"standard"
}
},
"filter":{
"my_token_filter":{
"type":"synonym",
"synonyms":[
"oa,oA,OA,Oa,dingding"
]
}
}
}
},
"mappings":{
"properties":{
"name":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
}
}


GET my_index/_search
{
"query": {
"match": {
"name": "dog & cat"
}
}
}

4.去除无用字符

这是我实际遇到的一个问题,应用中采用的是外部分词器,ES 只存储拆分好的 token ,但分词接口返回的目前是个字符串,格式类似于 “ [‘张三’,’李四’,’新冠肺炎’,’感染者’]” 这样的。分词只取中文且不再此基础上进行分割。所以采用了如下方式:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
DELETE my_index
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"char_filter":"my_char_filter",
"tokenizer":"my_tokenizer"
}
},
"tokenizer":{
"my_tokenizer":{
"type":"pattern",
"pattern":","
}
},
"char_filter":{
"my_char_filter":{
"type":"mapping",
"mappings":[
"[ => ",
"] => ",
"' => "
]
}
}
}
},
"mappings":{
"properties":{
"name":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
}
}

GET my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": " ['张三','李四','新冠肺炎','感染者']"
}

5.限制中文分词长度,单个的字不需要

解决方式如下,但需要谨慎使用,会引起查询精准度问题,如果分词本身只能分词单个词,使用该过滤器后,将会查询不到结果集。

对比验证两个分词器的结果即可!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
DELETE my_index
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_custom_analyzer":{
"type":"custom",
"tokenizer":"ik_max_word",
"filter":"my_filter"
}
},
"filter":{
"my_filter":{
"type":"length",
"min":"2",
"max":"2"
}
}
}
},
"mappings":{
"properties":{
"name":{
"type":"text",
"analyzer":"my_custom_analyzer"
}
}
}
}

GET my_index/_analyze
{
"analyzer": "my_custom_analyzer",
"text": " 我在人民广场吃炸鸡,喝啤酒,好生快乐"
}
GET my_index/_analyze
{
"analyzer": "ik_max_word",
"text": " 我在人民广场吃炸鸡,喝啤酒,好生快乐"
}

保罗-班扬.jpg

0%