当前位置: 首页>>代码示例 >>用法及示例精选 >>正文


Python TextaCy用法及代码示例


在本文中,我们将介绍Python中的TextaCy模块,该模块通常用于对文本执行各种NLP任务。它基于 Python 中的 SpaCy 模块构建。

TextaCy 模块的一些函数如下:

  • 它通过在使用 spaCy 处理文本之前替换和删除文本中的标点符号、额外的空格、数字等来提供文本清理和预处理的函数。
  • 它包括自动语言检测、对文档进行标记化和矢量化,然后训练和解释主题模型。
  • 可以添加自定义扩展来扩展 spaCy 处理一个或多个文档的主要函数。
  • 加载包含文本内容和信息的准备好的数据集,例如 Reddit 评论、国会演讲和历史书籍。
  • 它提供了从处理的文档中提取 n-grams、实体、首字母缩略词、关键短语和 SVO 三元组等特征作为结构化数据的工具。
  • 可以使用各种类似的指标来比较字符串和序列。
  • 计算文本可读性和词汇多样性数据,例如Type-Token 比率、多语言 Flesch 阅读难度和Flesch-Kincaid 年级水平。

TextaCy模块的安装:

我们可以使用 pip 安装 textaCy 模块。

pip install textacy

如果有人使用 conda 则编写以下命令 -

conda install -c conda-forge textacy

其一些函数的示例:

在这里我们将看到 textaCy 模块的一些显著函数。

删除标点符号

使用 textacy 模块的预处理类,我们可以轻松地从文本中删除标点符号。

Python3


from textacy import preprocessing
ex = """
Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
# Remove Punctuation
rm_punc = preprocessing.remove.punctuation(ex)
print(rm_punc)

这里使用的文本是从外部网站随机生成的文本。首先,我们导入textacy模块的预处理类,然后使用remove和punctuation方法去除标点符号。

输出:

Now is the winter of our discontent
Made glorious summer by this sun of York 
And all the clouds that lour d upon our house
In the deep bosom of the ocean buried 
Now are our brows bound with victorious wreaths 
Our bruised arms hung up for monuments 
Our stern alarums changed to merry meetings 
Our dreadful marches to delightful measures 
Grim visaged war hath smooth d his wrinkled front 
And now  instead of mounting barded steeds
To fright the souls of fearful adversaries
He capers nimbly in a lady s chamber
To the lascivious pleasing of a lute
But I  that am not shaped for sportive tricks
Nor made to court an amorous looking glass
I  that am rudely stamp d  and want love s majesty
To strut before a wanton ambling nymph
I  that am curtail d of this fair proportion

删除不必要的空白

我们可以从文本中删除不必要的空格。它将删除我们拥有的所有多余空格,并将它们全部剪切为每个单词后仅一个空格。

Python3


from textacy import preprocessing
ex = """
Now is the winter of our      discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the         deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern        alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of       mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I,       that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut        before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
# Remove Whitespace
rm_wsp = preprocessing.normalize.whitespace(ex)
print(rm_wsp)

这里我们使用了标准化类和空白方法来删除空白。

输出:

在输出中,我们可以看到所有多余的空格都被删除,但标点符号仍然存在。因此,如果我们也想删除它,那么我们可以合并这两个操作。

Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,

一起删除标点符号和空格

Python3


from textacy import preprocessing
ex = """
Now is the winter of our      discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the         deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern        alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of       mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I,       that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut        before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
# Remove Punctuation
rm_punc = preprocessing.remove.punctuation(ex)
# Remove Whitespace
rm_wsp = preprocessing.normalize.whitespace(ex)
# Remove Punctuation and Whitespace both
rm_all = preprocessing.normalize.whitespace(rm_punc)
print(rm_all)

输出:

Now is the winter of our discontent
Made glorious summer by this sun of York
And all the clouds that lour d upon our house
In the deep bosom of the ocean buried
Now are our brows bound with victorious wreaths
Our bruised arms hung up for monuments
Our stern alarums changed to merry meetings
Our dreadful marches to delightful measures
Grim visaged war hath smooth d his wrinkled front
And now instead of mounting barded steeds
To fright the souls of fearful adversaries
He capers nimbly in a lady s chamber
To the lascivious pleasing of a lute
But I that am not shaped for sportive tricks
Nor made to court an amorous looking glass
I that am rudely stamp d and want love s majesty
To strut before a wanton ambling nymph
I that am curtail d of this fair proportion

对文本进行分区

有时我们收到或使用的文本是‘raw’,意味着非结构化、杂乱等,因此在分析之前,在预处理阶段,我们可能需要对它们进行清理并根据某些标准对其进行分区。

Python3


from textacy import preprocessing
from textacy import extract
ex = """
Now is the winter of our      discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the         deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern        alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of       mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I,       that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut        before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
# Remove Punctuation
rm_punc = preprocessing.remove.punctuation(ex)
# Remove Whitespace
rm_wsp = preprocessing.normalize.whitespace(ex)
# Remove Punctuation and Whitespace both
rm_all = preprocessing.normalize.whitespace(rm_punc)
# Extracting text
ext = list(extract.keyword_in_context(
    rm_all, 'I', window_width=20, pad_context=True))
print(ext)

输出:

现在输出看起来有点复杂,因为此处使用的文本不适合此原因。但由于我使用的文本已经没有标点符号和空格,所以我们看不到任何标点符号或额外的空格。此处创建的空格是由于 window_width 造成的,文本中的所有空格均已与标点符号一起删除。

[('                Now ', 'i', 's the winter of our '), 
('        Now is the w', 'i', 'nter of our disconte'), 
(' the winter of our d', 'i', 'scontent\nMade glorio'), 
('discontent\nMade glor', 'i', 'ous summer by this s'), 
('lorious summer by th', 'i', 's sun of York \nAnd a'), 
('ur d upon our house\n', 'I', 'n the deep bosom of '), 
('som of the ocean bur', 'i', 'ed \nNow are our brow'), 
('re our brows bound w', 'i', 'th victorious wreath'), 
('r brows bound with v', 'i', 'ctorious wreaths \nOu'), 
('ws bound with victor', 'i', 'ous wreaths \nOur bru'), 
('ous wreaths \nOur bru', 'i', 'sed arms hung up for'), 
('hanged to merry meet', 'i', 'ngs \nOur dreadful ma'), 
('adful marches to del', 'i', 'ghtful measures \nGri'), 
('ightful measures \nGr', 'i', 'm visaged war hath s'), 
('ful measures \nGrim v', 'i', 'saged war hath smoot'), 
(' war hath smooth d h', 'i', 's wrinkled front \nAn'), 
('hath smooth d his wr', 'i', 'nkled front \nAnd now'), 
('kled front \nAnd now ', 'i', 'nstead of mounting b'), 
('now instead of mount', 'i', 'ng barded steeds\nTo '), 
(' barded steeds\nTo fr', 'i', 'ght the souls of fea'), 
(' of fearful adversar', 'i', 'es \nHe capers nimbly'), 
('rsaries \nHe capers n', 'i', 'mbly in a lady s cha'), 
('s \nHe capers nimbly ', 'i', 'n a lady s chamber\nT'), 
(' chamber\nTo the lasc', 'i', 'vious pleasing of a '), 
('hamber\nTo the lasciv', 'i', 'ous pleasing of a lu'), 
('the lascivious pleas', 'i', 'ng of a lute \nBut I '), 
('sing of a lute \nBut ', 'I', ' that am not shaped '), 
('not shaped for sport', 'i', 've tricks \nNor made '), 
('aped for sportive tr', 'i', 'cks \nNor made to cou'), 
('ourt an amorous look', 'i', 'ng glass \nI that am '), 
('rous looking glass \n', 'I', ' that am rudely stam'), 
('before a wanton ambl', 'i', 'ng nymph \nI that am '), 
('nton ambling nymph \n', 'I', ' that am curtail d o'), 
('mph \nI that am curta', 'i', 'l d of this fair pro'), 
('t am curtail d of th', 'i', 's fair proportion   '), 
('curtail d of this fa', 'i', 'r proportion        '), 
('of this fair proport', 'i', 'on                  ')]

下面的部分显示了如果我们不提前删除标点符号或空格的结果,我没有包含整个输出,因为它很大,并且所有标点符号都可以与空格一起使用,所以看起来会很混乱。

[('               \nNow ', 'i', 's the winter of our '), 
('       \nNow is the w', 'i', 'nter of our      dis'), 
('winter of our      d', 'i', 'scontent\nMade glorio'), 
('discontent\nMade glor', 'i', 'ous summer by this s'), 
('lorious summer by th', 'i', 's sun of York;\nAnd a'), 
("ur'd upon our house\n", 'I', 'n the         deep b'), 
('som of the ocean bur', 'i', 'ed.\nNow are our brow').......]

将文本中的 URL 替换为其他文本

我们可以从文本中删除任何不必要的 URL,并将其替换为其他文本 -

Python3


from textacy import preprocessing
# Replace URLs
txt = "https://www.geeksforgeeks.org/ is the best place to learn anything"
rm_url = preprocessing.replace.urls(txt,"GeeksforGeeks")
print(rm_url)

输出:

将电子邮件替换为其他文本

Python3


from textacy import preprocessing
# Replace Emails
mail = "Send me a mail in the following address - example@gmail.com"
rm_mail = preprocessing.replace.emails(mail,"UserMail")
print(rm_mail)

输出:

更换电话号码

Python3


from textacy import preprocessing
# Replace phone number
num = "Call me at 12345678910"
rm_num = preprocessing.replace.phone_numbers(num,"NUM")
print(rm_num)

输出:

如果我们传递多个数字,那么这会将它们全部替换为 NUM。

Python3


from textacy import preprocessing
# Replace phone number
num = "Call me at 12345678910 or 7896451235"
rm_num = preprocessing.replace.phone_numbers(num,"NUM")
print(rm_num)

输出 -

替换任意数字

Python3


from textacy import preprocessing
# Replace Number
n = "Any number like 12 or 86 , maybe 100 etc"
rm_n = preprocessing.replace.numbers(n,"Numbers")
print(rm_n)

输出:

删除括号和方括号包围的文本:

Python3


from textacy import preprocessing
txt = """Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde 
& Informatica (CWI) in the Netherlands 
as a successor to the ABC programming language, which was inspired by SETL, 
capable of exception handling (from the start plus new capabilities in Python 3.11)"""
print(preprocessing.remove.brackets(txt))

输出:

Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde 
& Informatica  in the Netherlands
as a successor to the ABC programming language, which was inspired by SETL,
capable of exception handling

我们还可以传递一个名为 only 的关键字参数,并传递我们只想删除的类型括号列表。它支持三个值:方形、 curl 、圆形。

Python3


from textacy import preprocessing
txt = """Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde 
& Informatica (CWI) in the Netherlands 
as a successor to the [ABC programming language], which was inspired by SETL, 
capable of exception handling {from the start plus new capabilities in Python 3.11}"""
print(preprocessing.remove.brackets(txt,only=["round","square"]))

输出:

Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde 
& Informatica  in the Netherlands
as a successor to the , which was inspired by SETL,
capable of exception handling {from the start plus new capabilities in Python 3.11}


相关用法


注:本文由纯净天空筛选整理自dwaipayan_bandyopadhyay大神的英文原创作品 TextaCy module in Python。非经特殊声明,原始代码版权归原作者所有,本译文未经允许或授权,请勿转载或复制。