在本文中,我們將介紹Python中的TextaCy模塊,該模塊通常用於對文本執行各種NLP任務。它基於 Python 中的 SpaCy 模塊構建。
TextaCy 模塊的一些函數如下:
- 它通過在使用 spaCy 處理文本之前替換和刪除文本中的標點符號、額外的空格、數字等來提供文本清理和預處理的函數。
- 它包括自動語言檢測、對文檔進行標記化和矢量化,然後訓練和解釋主題模型。
- 可以添加自定義擴展來擴展 spaCy 處理一個或多個文檔的主要函數。
- 加載包含文本內容和信息的準備好的數據集,例如 Reddit 評論、國會演講和曆史書籍。
- 它提供了從處理的文檔中提取 n-grams、實體、首字母縮略詞、關鍵短語和 SVO 三元組等特征作為結構化數據的工具。
- 可以使用各種類似的指標來比較字符串和序列。
- 計算文本可讀性和詞匯多樣性數據,例如Type-Token 比率、多語言 Flesch 閱讀難度和Flesch-Kincaid 年級水平。
TextaCy模塊的安裝:
我們可以使用 pip 安裝 textaCy 模塊。
pip install textacy
如果有人使用 conda 則編寫以下命令 -
conda install -c conda-forge textacy
其一些函數的示例:
在這裏我們將看到 textaCy 模塊的一些顯著函數。
刪除標點符號
使用 textacy 模塊的預處理類,我們可以輕鬆地從文本中刪除標點符號。
Python3
from textacy import preprocessing
ex = """
Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
# Remove Punctuation
rm_punc = preprocessing.remove.punctuation(ex)
print(rm_punc)
這裏使用的文本是從外部網站隨機生成的文本。首先,我們導入textacy模塊的預處理類,然後使用remove和punctuation方法去除標點符號。
輸出:
Now is the winter of our discontent Made glorious summer by this sun of York And all the clouds that lour d upon our house In the deep bosom of the ocean buried Now are our brows bound with victorious wreaths Our bruised arms hung up for monuments Our stern alarums changed to merry meetings Our dreadful marches to delightful measures Grim visaged war hath smooth d his wrinkled front And now instead of mounting barded steeds To fright the souls of fearful adversaries He capers nimbly in a lady s chamber To the lascivious pleasing of a lute But I that am not shaped for sportive tricks Nor made to court an amorous looking glass I that am rudely stamp d and want love s majesty To strut before a wanton ambling nymph I that am curtail d of this fair proportion
刪除不必要的空白
我們可以從文本中刪除不必要的空格。它將刪除我們擁有的所有多餘空格,並將它們全部剪切為每個單詞後僅一個空格。
Python3
from textacy import preprocessing
ex = """
Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
# Remove Whitespace
rm_wsp = preprocessing.normalize.whitespace(ex)
print(rm_wsp)
這裏我們使用了標準化類和空白方法來刪除空白。
輸出:
在輸出中,我們可以看到所有多餘的空格都被刪除,但標點符號仍然存在。因此,如果我們也想刪除它,那麽我們可以合並這兩個操作。
Now is the winter of our discontent Made glorious summer by this sun of York; And all the clouds that lour'd upon our house In the deep bosom of the ocean buried. Now are our brows bound with victorious wreaths; Our bruised arms hung up for monuments; Our stern alarums changed to merry meetings, Our dreadful marches to delightful measures. Grim-visaged war hath smooth'd his wrinkled front; And now, instead of mounting barded steeds To fright the souls of fearful adversaries, He capers nimbly in a lady's chamber To the lascivious pleasing of a lute. But I, that am not shaped for sportive tricks, Nor made to court an amorous looking-glass; I, that am rudely stamp'd, and want love's majesty To strut before a wanton ambling nymph; I, that am curtail'd of this fair proportion,
一起刪除標點符號和空格
Python3
from textacy import preprocessing
ex = """
Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
# Remove Punctuation
rm_punc = preprocessing.remove.punctuation(ex)
# Remove Whitespace
rm_wsp = preprocessing.normalize.whitespace(ex)
# Remove Punctuation and Whitespace both
rm_all = preprocessing.normalize.whitespace(rm_punc)
print(rm_all)
輸出:
Now is the winter of our discontent Made glorious summer by this sun of York And all the clouds that lour d upon our house In the deep bosom of the ocean buried Now are our brows bound with victorious wreaths Our bruised arms hung up for monuments Our stern alarums changed to merry meetings Our dreadful marches to delightful measures Grim visaged war hath smooth d his wrinkled front And now instead of mounting barded steeds To fright the souls of fearful adversaries He capers nimbly in a lady s chamber To the lascivious pleasing of a lute But I that am not shaped for sportive tricks Nor made to court an amorous looking glass I that am rudely stamp d and want love s majesty To strut before a wanton ambling nymph I that am curtail d of this fair proportion
對文本進行分區
有時我們收到或使用的文本是‘raw’,意味著非結構化、雜亂等,因此在分析之前,在預處理階段,我們可能需要對它們進行清理並根據某些標準對其進行分區。
Python3
from textacy import preprocessing
from textacy import extract
ex = """
Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,
"""
# Remove Punctuation
rm_punc = preprocessing.remove.punctuation(ex)
# Remove Whitespace
rm_wsp = preprocessing.normalize.whitespace(ex)
# Remove Punctuation and Whitespace both
rm_all = preprocessing.normalize.whitespace(rm_punc)
# Extracting text
ext = list(extract.keyword_in_context(
rm_all, 'I', window_width=20, pad_context=True))
print(ext)
輸出:
現在輸出看起來有點複雜,因為此處使用的文本不適合此原因。但由於我使用的文本已經沒有標點符號和空格,所以我們看不到任何標點符號或額外的空格。此處創建的空格是由於 window_width 造成的,文本中的所有空格均已與標點符號一起刪除。
[(' Now ', 'i', 's the winter of our '), (' Now is the w', 'i', 'nter of our disconte'), (' the winter of our d', 'i', 'scontent\nMade glorio'), ('discontent\nMade glor', 'i', 'ous summer by this s'), ('lorious summer by th', 'i', 's sun of York \nAnd a'), ('ur d upon our house\n', 'I', 'n the deep bosom of '), ('som of the ocean bur', 'i', 'ed \nNow are our brow'), ('re our brows bound w', 'i', 'th victorious wreath'), ('r brows bound with v', 'i', 'ctorious wreaths \nOu'), ('ws bound with victor', 'i', 'ous wreaths \nOur bru'), ('ous wreaths \nOur bru', 'i', 'sed arms hung up for'), ('hanged to merry meet', 'i', 'ngs \nOur dreadful ma'), ('adful marches to del', 'i', 'ghtful measures \nGri'), ('ightful measures \nGr', 'i', 'm visaged war hath s'), ('ful measures \nGrim v', 'i', 'saged war hath smoot'), (' war hath smooth d h', 'i', 's wrinkled front \nAn'), ('hath smooth d his wr', 'i', 'nkled front \nAnd now'), ('kled front \nAnd now ', 'i', 'nstead of mounting b'), ('now instead of mount', 'i', 'ng barded steeds\nTo '), (' barded steeds\nTo fr', 'i', 'ght the souls of fea'), (' of fearful adversar', 'i', 'es \nHe capers nimbly'), ('rsaries \nHe capers n', 'i', 'mbly in a lady s cha'), ('s \nHe capers nimbly ', 'i', 'n a lady s chamber\nT'), (' chamber\nTo the lasc', 'i', 'vious pleasing of a '), ('hamber\nTo the lasciv', 'i', 'ous pleasing of a lu'), ('the lascivious pleas', 'i', 'ng of a lute \nBut I '), ('sing of a lute \nBut ', 'I', ' that am not shaped '), ('not shaped for sport', 'i', 've tricks \nNor made '), ('aped for sportive tr', 'i', 'cks \nNor made to cou'), ('ourt an amorous look', 'i', 'ng glass \nI that am '), ('rous looking glass \n', 'I', ' that am rudely stam'), ('before a wanton ambl', 'i', 'ng nymph \nI that am '), ('nton ambling nymph \n', 'I', ' that am curtail d o'), ('mph \nI that am curta', 'i', 'l d of this fair pro'), ('t am curtail d of th', 'i', 's fair proportion '), ('curtail d of this fa', 'i', 'r proportion '), ('of this fair proport', 'i', 'on ')]
下麵的部分顯示了如果我們不提前刪除標點符號或空格的結果,我沒有包含整個輸出,因為它很大,並且所有標點符號都可以與空格一起使用,所以看起來會很混亂。
[(' \nNow ', 'i', 's the winter of our '), (' \nNow is the w', 'i', 'nter of our dis'), ('winter of our d', 'i', 'scontent\nMade glorio'), ('discontent\nMade glor', 'i', 'ous summer by this s'), ('lorious summer by th', 'i', 's sun of York;\nAnd a'), ("ur'd upon our house\n", 'I', 'n the deep b'), ('som of the ocean bur', 'i', 'ed.\nNow are our brow').......]
將文本中的 URL 替換為其他文本
我們可以從文本中刪除任何不必要的 URL,並將其替換為其他文本 -
Python3
from textacy import preprocessing
# Replace URLs
txt = "https://www.geeksforgeeks.org/ is the best place to learn anything"
rm_url = preprocessing.replace.urls(txt,"GeeksforGeeks")
print(rm_url)
輸出:
將電子郵件替換為其他文本
Python3
from textacy import preprocessing
# Replace Emails
mail = "Send me a mail in the following address - example@gmail.com"
rm_mail = preprocessing.replace.emails(mail,"UserMail")
print(rm_mail)
輸出:
更換電話號碼
Python3
from textacy import preprocessing
# Replace phone number
num = "Call me at 12345678910"
rm_num = preprocessing.replace.phone_numbers(num,"NUM")
print(rm_num)
輸出:
如果我們傳遞多個數字,那麽這會將它們全部替換為 NUM。
Python3
from textacy import preprocessing
# Replace phone number
num = "Call me at 12345678910 or 7896451235"
rm_num = preprocessing.replace.phone_numbers(num,"NUM")
print(rm_num)
輸出 -
替換任意數字
Python3
from textacy import preprocessing
# Replace Number
n = "Any number like 12 or 86 , maybe 100 etc"
rm_n = preprocessing.replace.numbers(n,"Numbers")
print(rm_n)
輸出:
刪除括號和方括號包圍的文本:
Python3
from textacy import preprocessing
txt = """Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde
& Informatica (CWI) in the Netherlands
as a successor to the ABC programming language, which was inspired by SETL,
capable of exception handling (from the start plus new capabilities in Python 3.11)"""
print(preprocessing.remove.brackets(txt))
輸出:
Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde & Informatica in the Netherlands as a successor to the ABC programming language, which was inspired by SETL, capable of exception handling
我們還可以傳遞一個名為 only 的關鍵字參數,並傳遞我們隻想刪除的類型括號列表。它支持三個值:方形、 curl 、圓形。
Python3
from textacy import preprocessing
txt = """Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde
& Informatica (CWI) in the Netherlands
as a successor to the [ABC programming language], which was inspired by SETL,
capable of exception handling {from the start plus new capabilities in Python 3.11}"""
print(preprocessing.remove.brackets(txt,only=["round","square"]))
輸出:
Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde & Informatica in the Netherlands as a successor to the , which was inspired by SETL, capable of exception handling {from the start plus new capabilities in Python 3.11}
相關用法
- Python TextCalendar formatmonth()用法及代碼示例
- Python TextCalendar formatyear()用法及代碼示例
- Python TextCalendar prmonth()用法及代碼示例
- Python TextCalendar pryear()用法及代碼示例
- Python TextBlob.correct()用法及代碼示例
- Python TextBlob.noun_phrases()用法及代碼示例
- Python TextBlob.sentiment()用法及代碼示例
- Python TextBlob.Word.spellcheck()用法及代碼示例
- Python TextBlob.word_counts()用法及代碼示例
- Python Text轉Speech用法及代碼示例
- Python TextTable用法及代碼示例
- Python Tensorflow abs()用法及代碼示例
- Python Tensorflow acos()用法及代碼示例
- Python Tensorflow acosh()用法及代碼示例
- Python Tensorflow asin()用法及代碼示例
- Python Tensorflow asinh()用法及代碼示例
- Python Tensorflow atan()用法及代碼示例
- Python Tensorflow atanh()用法及代碼示例
- Python Tensorflow cos()用法及代碼示例
- Python Tensorflow cosh()用法及代碼示例
- Python Tensorflow exp()用法及代碼示例
- Python Tensorflow log()用法及代碼示例
- Python Tensorflow logical_and()用法及代碼示例
- Python Tensorflow logical_not()用法及代碼示例
- Python Tensorflow logical_or()用法及代碼示例
注:本文由純淨天空篩選整理自dwaipayan_bandyopadhyay大神的英文原創作品 TextaCy module in Python。非經特殊聲明,原始代碼版權歸原作者所有,本譯文未經允許或授權,請勿轉載或複製。