多样化的自然措辞处理(NLP)是真的很棒,我们以前从未想象过的事情现在只是几行代码就可做到。
这真的令人愉快。

运用Python中的NLTK和spaCy删除停用词与文本标准化_年夜众_词形 计算机

但利用文本数据会带来一系列寻衅。
机器在处理原始文本方面有着较大的困难。
在利用NLP技能处理文本数据之前,我们须要实行一些称为预处理的步骤。

错过了这些步骤,我们会得到一个不好的模型。
这些是你须要在代码,框架和项目中加入的基本NLP技能。

我们将谈论如何利用一些非常盛行的NLP库(NLTK,spaCy,Gensim和TextBlob)删除停用词并在Python中实行文本标准化。

目录

什么是停用词?为什么我们须要删除停用词?我们该当何时删除停用词?删除停用词的不同方法利用NLTK利用spaCy利用Gensim文本标准化简介什么是词干化和词形还原?实行词干化和词形还原的方法利用NLTK利用spaCy利用TextBlob1. 什么是停用词?

在任何自然措辞中停用词是最常用的词。
为了剖析文本数据和构建NLP模型,这些停用词可能对构成文档的意义没有太多代价。

常日,英语文本中利用的最常用词是\公众the\"大众,\公众is\"大众,\"大众in\"大众,\公众for\"大众,\公众where\公众,\公众when\"大众,\"大众to\"大众,\"大众at\公众等。

考虑这个文本,\"大众There is a pen on the table\"大众。
现在,单词\公众is\"大众,\公众a\"大众,\"大众on\公众和\"大众the\公众在解析它时对语句没有任何意义。
而像\"大众there\"大众,\"大众book\"大众和\"大众table\"大众这样的词是关键词,并见告我们这句话是什么意思。

一样平常来说在去除停用词之前要实行分词操作。

以下是一份停用词列表,可能对你有用

a about after all also always am an and any are at be been being but by came can cant come

could did didn't do does doesn't doing don't else for from get give goes going had happen

has have having how i if ill i'm in into is isn't it its i've just keep let like made make

many may me mean more most much no not now of only or our really say see some something

take tell than that the their them then they thing this to try up us use used uses very

want was way we what when where which who why will with without wont you your youre

2. 为什么我们须要删除停用词?

这是一个你必须考虑到的非常主要的问题

在NLP中删除停用词并不是一项严格的规则。
这取决于我们正在进行的任务。
对付文本分类等(将文本分类为不同的种别)任务,从给定文本中删除或打消停用词,可以更多地关注定义文本含义的词。

正如我们在上一节中看到的那样,单词there,book要比单词is,on来得更加故意义。

但是,在机器翻译和文本择要等任务中,却不建议删除停用词。

以下是删除停用词的几个紧张好处:

在删除停用词时,数据集大小减小,演习模型的韶光也减少删除停用词可能有助于提高性能,由于只剩下更少且唯一故意义的词。
因此,它可以提高分类准确性乃至像Google这样的搜索引擎也会删除停用词,以便从数据库中快速地检索数据3. 我们该当什么时候删除停用词?

我把它归纳为两个部分:删除停用词的情形以及当我们避免删除停用词的情形。

删除停用词

我们可以在实行以下任务时删除停用词:

文本分类垃圾邮件过滤措辞分类文体(Genre)分类标题天生自动标记(Auto-Tag)天生

避免删除停用词

机器翻译措辞建模文本择要问答(QA)系统4. 删除停用词的不同方法

4.1. 利用NLTK删除停用词

NLTK是文本预处理的自然措辞工具包。
这是我最喜好的Python库之一。
NLTK有16种不同措辞的停用词列表。

你可以利用以下代码查看NLTK中的停用词列表:

import nltkfrom nltk.corpus import stopwordsset(stopwords.words('english'))

现在,要利用NLTK删除停用词,你可以利用以下代码块

# 下面的代码是利用nltk从句子中去除停用词# 导入包import nltkfrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenize set(stopwords.words('english'))# 例句text = \"大众\"大众\公众He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had indeed the vaguest idea where the wood and river in question were.\"大众\"大众\"大众# 停用词凑集stop_words = set(stopwords.words('english')) # 分词word_tokens = word_tokenize(text) filtered_sentence = [] for w in word_tokens: if w not in stop_words: filtered_sentence.append(w) print(\公众\n\nOriginal Sentence \n\n\公众)print(\"大众 \公众.join(word_tokens)) print(\公众\n\nFiltered Sentence \n\n\"大众)print(\"大众 \公众.join(filtered_sentence))

这是我们分词后的句子:

He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rightshad become much less valuable, and he had indeed the vaguest idea where the wood and river in question were.

删除停用词后:

He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts. He ready becuase rights become much less valuable, indeed vaguest idea wood river question.

请把稳,文本的大小险些减少到一半!
你能想象一下删除停用词的用途吗?

4.2. 利用spaCy删除停用词

spaCy是NLP中功能最多,利用最广泛的库之一。
我们可以利用SpaCy快速有效地从给定文本中删除停用词。
它有一个自己的停用词列表,可以从spacy.lang.en.stop_words类导入。

以下是在Python中利用spaCy删除停用词的方法:

from spacy.lang.en import English# 加载英语分词器、标记器、解析器、NER和单词向量nlp = English()text = \公众\"大众\"大众He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had indeed the vaguest idea where the wood and river in question were.\"大众\"大众\公众# \"大众nlp\公众工具用于创建具有措辞注释的文档。
my_doc = nlp(text)# 构建词列表token_list = []for token in my_doc: token_list.append(token.text)from spacy.lang.en.stop_words import STOP_WORDS# 去除停用词后创建单词列表filtered_sentence =[] for word in token_list: lexeme = nlp.vocab[word] if lexeme.is_stop == False: filtered_sentence.append(word) print(token_list)print(filtered_sentence)

这是我们在分词后得到的列表:

He determined to drop his litigation with the monastry and relinguish his claims to the wood-cuting and \n fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had \n indeed the vaguest idea where the wood and river in question were.

删除停用词后的列表:

determined drop litigation monastry, relinguish claims wood-cuting \n fishery rihgts. readybecuase rights become valuable, \n vaguest idea wood river question

须要把稳的一点是,去除停用词并不会删除标点符号或换行符,我们须要手动删除它们。

4.3. 利用Gensim删除停用词

Gensim是一个非常方便的库,可以处理NLP任务。
在预处理时,gensim也供应了去除停用词的方法。
我们可以从类gensim.parsing.preprocessing轻松导入remove_stopwords方法。

考试测验利用Gensim去除停用词:

# 以下代码利用Gensim去除停用词from gensim.parsing.preprocessing import remove_stopwords# pass the sentence in the remove_stopwords functionresult = remove_stopwords(\"大众\"大众\公众He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had indeed the vaguest idea where the wood and river in question were.\"大众\公众\"大众)print('\n\n Filtered Sentence \n\n')print(result) He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts once.He ready becuase rights valuable, vaguest idea wood river question were.

利用gensim去除停用词时,我们可以直接在原始文本上进行。
在删除停用词之前无需实行分词。
这可以节省我们很多韶光。

5. 文本标准化(text normalization)简介

在任何自然措辞中,根据情形,可以以多种形式书写或说出单词。
这便是措辞的精美之处。
例如:

Lisa ate the food and washed the dishes.They were eating noodles at a cafe.Don’t you want to eat before we leave?We have just eaten our breakfast.It also eats fruit and vegetables.

在所有这些句子中,我们可以看到\"大众eat\"大众这个词有多种形式。
对我们来说,很随意马虎理解\"大众eat\公众便是这里详细的活动。
以是对我们来说,无论是'eat','ate'还是'eaten'都没紧要,由于我们知道发生了什么。

不幸的是,机器并非如此。
他们差异对待这些词。
因此,我们须要将它们标准化为它们的根词,在我们的例子中是\"大众eat\公众。

因此,文本标准化是将单词转换为单个规范形式的过程。
这可以通过两个过程来实现,即词干化(stemming)和词形还原(lemmatization)。
让我们详细理解它们的含义。

6. 什么是词干化和词形还原?

词干化和词形还原只是单词的标准化,这意味着将单词缩减为其根形式。

在大多数自然措辞中,根词可以有许多变体。
例如,\"大众play\公众一词可以用作\"大众playing\"大众,\"大众played\"大众,\"大众plays\"大众等。
你可以想到类似的例子(并且有很多)。

词干化

让我们先理解词干化:

词干化是一种文本标准化技能,它通过考虑可以在该词中找到的公共前缀或后缀列表来割断单词的结尾或开头。
这是一个基于规则的基本过程,从单词中删除后缀(\"大众ing\"大众,\公众ly\公众,\"大众es\公众,\"大众s\"大众等)

 词形还原

另一方面,词形还原是一种构造化的程序,用于得到单词的根形式。
它利用了词汇(词汇的字典主要性程度)和形态剖析(词汇构造和语法关系)。

为什么我们须要实行词干化或词形还原?

让我们考虑以下两句话:

He was drivingHe went for a drive

我们可以很随意马虎地说两句话都传达了相同的含义,即过去的驾驶活动。
机器将以不同的办法处理两个句子。
因此,为了使文本可以理解,我们须要实行词干化或词形还原。

文本标准化的另一个好处是它减少了文本数据中词典的大小。
这有助于缩短机器学习模型的演习韶光。

我们该当选择哪一个?

词干化算法通过从词中剪切后缀或前缀来事情。
词形还原是一种更强大的操作,由于它考虑了词的形态剖析。

词形还原返回词根,词根是其所有变形形式的根词。

我们可以说词干化是一种快速但不那么好的方法,可以将词语切割成词根形式,而另一方面,词形还原是一种智能操作,它利用由深入的措辞知识创建的词典。
因此,词形还原有助于形成更好的效果。

7. 实行文本标准化的方法

7.1. 利用NLTK进行文本标准化

NLTK库有许多令人惊奇的方法来实行不同的数据预处理步骤。
有些方法如PorterStemmer()和WordNetLemmatizer()分别实行词干化和词形还原。

让我们看看他们的实际效果。

词干化

from nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenize from nltk.stem import PorterStemmerset(stopwords.words('english'))text = \公众\"大众\"大众He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had indeed the vaguest idea where the wood and river in question were.\公众\"大众\公众stop_words = set(stopwords.words('english')) word_tokens = word_tokenize(text) filtered_sentence = [] for w in word_tokens: if w not in stop_words: filtered_sentence.append(w) Stem_words = []ps =PorterStemmer()for w in filtered_sentence: rootWord=ps.stem(w) Stem_words.append(rootWord)print(filtered_sentence)print(Stem_words)He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts. He ready becuase rights become much less valuable, indeed vaguest idea wood river question.He determin drop litig monastri, relinguish claim wood-cut fisheri rihgt. He readi becuasright become much less valuabl, inde vaguest idea wood river question.

我们在这里就可以很清晰看到不同点了,我们连续对这段文本实行词形还原

词形还原

from nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenize import nltkfrom nltk.stem import WordNetLemmatizerset(stopwords.words('english'))text = \"大众\公众\"大众He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had indeed the vaguest idea where the wood and river in question were.\"大众\"大众\"大众stop_words = set(stopwords.words('english')) word_tokens = word_tokenize(text) filtered_sentence = [] for w in word_tokens: if w not in stop_words: filtered_sentence.append(w) print(filtered_sentence) lemma_word = []import nltkfrom nltk.stem import WordNetLemmatizerwordnet_lemmatizer = WordNetLemmatizer()for w in filtered_sentence: word1 = wordnet_lemmatizer.lemmatize(w, pos = \公众n\"大众) word2 = wordnet_lemmatizer.lemmatize(word1, pos = \"大众v\公众) word3 = wordnet_lemmatizer.lemmatize(word2, pos = (\"大众a\"大众)) lemma_word.append(word3)print(lemma_word)He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts. He ready becuase rights become much less valuable, indeed vaguest idea wood river question.He determined drop litigation monastry, relinguish claim wood-cuting fishery rihgts. He ready becuase right become much le valuable, indeed vaguest idea wood river question.

在这里,v表示动词,a代表形容词和n代表名词。
该词根提取器(lemmatizer)仅与lemmatize方法的pos参数匹配的词语进行词形还原。

词形还原基于词性标注(POS标记)完成。

7.2. 利用spaCy进行文本标准化

正如我们之前看到的,spaCy是一个精良的NLP库。
它供应了许多工业级方法来实行词形还原。
不幸的是,spaCy没有用于词干化(stemming)的方法。
要实行词形还原,请查看以下代码:

#确保利用\"大众python -m spacy download en\公众***英语模型import en_core_web_smnlp = en_core_web_sm.load()doc = nlp(u\"大众\"大众\"大众He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had indeed the vaguest idea where the wood and river in question were.\公众\公众\"大众)lemma_word1 = [] for token in doc: lemma_word1.append(token.lemma_)lemma_word1-PRON- determine to drop -PRON- litigation with the monastry, and relinguish -PRON- claimto the wood-cuting and \n fishery rihgts at once. -PRON- be the more ready to do this becuase the right have become much less valuable, and -PRON- have \n indeed the vague ideawhere the wood and river in question be.

这里-PRON-是代词的符号,可以利用正则表达式轻松删除。
spaCy的好处是我们不必通报任何pos参数来实行词形还原。

7.3. 利用TextBlob进行文本标准化

TextBlob是一个专门用于预处理文本数据的Python库。
它基于NLTK库。
我们可以利用TextBlob来实行词形还原。
但是,TextBlob中没有用于词干化的模块。

那么让我们看看如何在Python中利用TextBlob实行词形还原:

# from textblob lib import Word method from textblob import Word text = \公众\公众\公众He determined to drop his litigation with the monastry, and relinguish his claims to the wood-cuting and fishery rihgts at once. He was the more ready to do this becuase the rights had become much less valuable, and he had indeed the vaguest idea where the wood and river in question were.\"大众\公众\"大众lem = []for i in text.split(): word1 = Word(i).lemmatize(\公众n\"大众) word2 = Word(word1).lemmatize(\"大众v\"大众) word3 = Word(word2).lemmatize(\"大众a\"大众) lem.append(Word(word3).lemmatize())print(lem)He determine to drop his litigation with the monastry, and relinguish his claim to the wood-cuting and fishery rihgts at once. He wa the more ready to do this becuase the righthave become much le valuable, and he have indeed the vague idea where the wood and riverin question were.

就像我们在NLTK小节中看到的那样,TextBlob也利用POS标记来实行词形还原。

8. 结束

停用词在感情剖析,问答系统等问题中反而起着重要浸染。
这便是为什么删除停用词可能会严重影响我们模型的准确性。