NLTK简述

1. 概述

NLTK不支持中文分词,对于英文,提供了基本的文本处理工具,丰富的语料和预训练模型。

2. 常用文本处理

2.1 词语、句子的Token化

1
2
3
4
5
from nltk.tokenize import word_tokenize
text = "God is Great! I won a lottery."
print(word_tokenize(text))

Output: ['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']
1
2
3
4
5
from nltk.tokenize import sent_tokenize
text = "God is Great! I won a lottery."
print(sent_tokenize(text))

Output: ['God is Great!', 'I won a lottery ']

2.2 词性标注

词性标注例子

1
2
Input: Everything to permit us.
Output: [('Everything', NN),('to', TO), ('permit', VB), ('us', PRP)]

词性举例

Abbreviation Meaning
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there
FW foreign word

2.3 组块(Chunking)

Chunking是在词性标注的基础上,对句子结构的浅层解析,可以用来实体识别。经过处理后的结果就是意元集组(chunks)。 In shallow parsing, there is maximum one level between roots and leaves while deep parsing comprises of more than one level.

组块规则

并没有预先定义的规则,但是根据需求可以组合规则,比如可以组合如下:

1
chunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}

. ? *的意义如下:

Name of symbol Description
. Any character except new line
* Match 0 or more repetitions
? Match 0 or 1 repetitions

组块规则代码示例

1
2
3
4
5
6
7
8
9
10
11
from nltk import pos_tag
from nltk import RegexpParser
text ="learn php from guru99 and make study easy".split()
print("After Split:",text)
tokens_tag = pos_tag(text)
print("After Token:",tokens_tag)
patterns= """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""
chunker = RegexpParser(patterns)
print("After Regex:",chunker)
output = chunker.parse(tokens_tag)
print("After Chunking",output)

Output

1
2
3
4
5
6
7
8
9
10
11
12
After Split: ['learn', 'php', 'from', 'guru99', 'and', 'make', 'study', 'easy']
After Token: [('learn', 'JJ'), ('php', 'NN'), ('from', 'IN'), ('guru99', 'NN'), ('and', 'CC'), ('make', 'VB'), ('study', 'NN'), ('easy', 'JJ')]
After Regex: chunk.RegexpParser with 1 stages:
RegexpChunkParser with 1 rules:
<ChunkRule: '<NN.?>*<VBD.?>*<JJ.?>*<CC>?'>
After Chunking (S
(mychunk learn/JJ)
(mychunk php/NN)
from/IN
(mychunk guru99/NN and/CC)
make/VB
(mychunk study/NN easy/JJ))

组块代码示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
 import nltk
text = "learn php from guru99"
tokens = nltk.word_tokenize(text)
print(tokens)
tag = nltk.pos_tag(tokens)
print(tag)
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp =nltk.RegexpParser(grammar)
result = cp.parse(tag)
print(result)
result.draw() # It will draw the pattern graphically which can be seen in Noun Phrase chunking

# Output

['learn', 'php', 'from', 'guru99'] -- These are the tokens
[('learn', 'JJ'), ('php', 'NN'), ('from', 'IN'), ('guru99', 'NN')] -- These are the pos_tag
(S (NP learn/JJ php/NN) from/IN (NP guru99/NN)) -- Noun Phrase Chunking

Chunking

2.4 Stemming和词形还原

对于两种情况:

  1. 一种意思,多种表达;
  2. 一个词语,多种词形;

对于文本而言,也需要‘归一化’(Normalize)。

有两种处理方式,Stemming和词形还原(Lemmatization),词形还原的效果比Stemming好。

Stemming

1
2
3
4
5
6
7
8
9
10
11
12
from nltk.stem import PorterStemmer
e_words= ["wait", "waiting", "waited", "waits"]
ps =PorterStemmer()
for w in e_words:
rootWord=ps.stem(w)
print(rootWord)

# output
wait
wait
wait
wait

Lemmatization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from collections import defaultdict
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

text = "guru99 is a totally new kind of learning experience."
tokens = word_tokenize(text)
lemma_function = WordNetLemmatizer()
for token, tag in pos_tag(tokens):
lemma = lemma_function.lemmatize(token, tag_map[tag[0]])
print(token, "=>", lemma)

# Output
guru99 => guru99
is => be
totally => totally
new => new
kind => kind
of => of
learning => learn
experience => experience
. => .

Why is Lemmatization better than Stemming?

Stemming algorithm works by cutting the suffix from the word. In a broader sense cuts either the beginning or end of the word.

On the contrary, Lemmatization is a more powerful operation, and it takes into consideration morphological analysis of the words. It returns the lemma which is the base form of all its inflectional forms. In-depth linguistic knowledge is required to create dictionaries and look for the proper form of the word. Stemming is a general operation while lemmatization is an intelligent operation where the proper form will be looked in the dictionary. Hence, lemmatization helps in forming better machine learning features.

3. Reference

  1. NTLK tutorial 英文版·讲的很详细
  2. Python·Category: nltk 英文版·Python社区
  3. NLTK文本分析+jieba中文文本挖掘 简单介绍NLTK和Jiba的使用