1. 概述
NLTK不支持中文分词,对于英文,提供了基本的文本处理工具,丰富的语料和预训练模型。
2. 常用文本处理
2.1 词语、句子的Token化
1 | from nltk.tokenize import word_tokenize |
1 | from nltk.tokenize import sent_tokenize |
2.2 词性标注
词性标注例子
1 | Input: Everything to permit us. |
词性举例
Abbreviation | Meaning |
---|---|
CC | coordinating conjunction |
CD | cardinal digit |
DT | determiner |
EX | existential there |
FW | foreign word |
2.3 组块(Chunking)
Chunking是在词性标注的基础上,对句子结构的浅层解析,可以用来实体识别。经过处理后的结果就是意元集组(chunks)。 In shallow parsing, there is maximum one level between roots and leaves while deep parsing comprises of more than one level.
组块规则
并没有预先定义的规则,但是根据需求可以组合规则,比如可以组合如下:
1 | chunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?} |
. ? *的意义如下:
Name of symbol | Description |
---|---|
. | Any character except new line |
* | Match 0 or more repetitions |
? | Match 0 or 1 repetitions |
组块规则代码示例
1 | from nltk import pos_tag |
Output
1 | After Split: ['learn', 'php', 'from', 'guru99', 'and', 'make', 'study', 'easy'] |
组块代码示例
1 | import nltk |
2.4 Stemming和词形还原
对于两种情况:
- 一种意思,多种表达;
- 一个词语,多种词形;
对于文本而言,也需要‘归一化’(Normalize)。
有两种处理方式,Stemming和词形还原(Lemmatization),词形还原的效果比Stemming好。
Stemming
1 | from nltk.stem import PorterStemmer |
Lemmatization
1 | from nltk.corpus import wordnet as wn |
Why is Lemmatization better than Stemming?
Stemming algorithm works by cutting the suffix from the word. In a broader sense cuts either the beginning or end of the word.
On the contrary, Lemmatization is a more powerful operation, and it takes into consideration morphological analysis of the words. It returns the lemma which is the base form of all its inflectional forms. In-depth linguistic knowledge is required to create dictionaries and look for the proper form of the word. Stemming is a general operation while lemmatization is an intelligent operation where the proper form will be looked in the dictionary. Hence, lemmatization helps in forming better machine learning features.
3. Reference
- NTLK tutorial 英文版·讲的很详细
- Python·Category: nltk 英文版·Python社区
- NLTK文本分析+jieba中文文本挖掘 简单介绍NLTK和Jiba的使用