正则表达式清洗文本数据

正则表达式

正则表达式是一种文本模式，包括普通字符（例如，a 到 z 之间的字母）和特殊字符（称为"元字符"）。

正则表达式使用单个字符串来描述、匹配一系列匹配某个句法规则的字符串。

Python中的re模块中的内容就完全支持正则表达式，而且内置很多方法，以达到我们不同的目的。例如查找、分割、替换…

正则表达式的应用范围太广了，所以学习这玩意还蛮重要的，不管是数据分析处理还是前后端开发都离不开正则表达式

常用正则表达式

校验数字

数字：^[0-9]*$
n位的数字：^\d{n}$
至少n位的数字：^\d{n,}$
m-n位的数字：^\d{m,n}$
零和非零开头的数字：^(0|[1-9][0-9]*)$
非零开头的最多带两位小数的数字：^([1-9][0-9]*)+(\.[0-9]{1,2})?$
带1-2位小数的正数或负数：^(\-)?\d+(\.\d{1,2})$
正数、负数、和小数：^(\-|\+)?\d+(\.\d+)?$
有两位小数的正实数：^[0-9]+(\.[0-9]{2})?$
有1~3位小数的正实数：^[0-9]+(\.[0-9]{1,3})?$
非零的正整数：^[1-9]\d*$ 或 ^([1-9][0-9]*){1,3}$ 或 ^\+?[1-9][0-9]*$
非零的负整数：^\-[1-9][]0-9"*$ 或 ^-[1-9]\d*$
非负整数：^\d+$ 或 ^[1-9]\d*|0$
非正整数：^-[1-9]\d*|0$ 或 ^((-\d+)|(0+))$
非负浮点数：^\d+(\.\d+)?$ 或 ^[1-9]\d*\.\d*|0\.\d*[1-9]\d*|0?\.0+|0$
非正浮点数：^((-\d+(\.\d+)?)|(0+(\.0+)?))$ 或 ^(-([1-9]\d*\.\d*|0\.\d*[1-9]\d*))|0?\.0+|0$
正浮点数：^[1-9]\d*\.\d*|0\.\d*[1-9]\d*$ 或 ^(([0-9]+\.[0-9]*[1-9][0-9]*)|([0-9]*[1-9][0-9]*\.[0-9]+)|([0-9]*[1-9][0-9]*))$
负浮点数：^-([1-9]\d*\.\d*|0\.\d*[1-9]\d*)$ 或 ^(-(([0-9]+\.[0-9]*[1-9][0-9]*)|([0-9]*[1-9][0-9]*\.[0-9]+)|([0-9]*[1-9][0-9]*)))$
浮点数：^(-?\d+)(\.\d+)?$ 或 ^-?([1-9]\d*\.\d*|0\.\d*[1-9]\d*|0?\.0+|0)$

校验字符串

中文字符：^[\u4e00-\u9fa5]{0,}$
英文和数字：^[A-Za-z0-9]+$ 或 ^[A-Za-z0-9]{4,40}$
长度为3-20的所有字符：^.{3,20}$
由26个英文字母组成的字符串：^[A-Za-z]+$
由26个大写英文字母组成的字符串：^[A-Z]+$
由26个小写英文字母组成的字符串：^[a-z]+$
由数字和26个英文字母组成的字符串：^[A-Za-z0-9]+$
由数字、26个英文字母或者下划线组成的字符串：^\w+$ 或 ^\w{3,20}$
中文、英文、数字包括下划线：^[\u4E00-\u9FA5A-Za-z0-9_]+$
中文、英文、数字但不包括下划线等符号：^[\u4E00-\u9FA5A-Za-z0-9]+$ 或 ^[\u4E00-\u9FA5A-Za-z0-9]{2,20}$
可以输入含有^%&',;=?$\"等字符：[^%&',;=?$\x22]+
禁止输入含有~的字符：[^~\x22]+

正则表达式清洗文本

原理

正则表达式清洗文本数据的原理其实很简单，就是针对一个字符串进行正则匹配，匹配不需要的字符，并将其替换为空格或者其他内容，那么很容易就能想到我们要用的是re库中的sub方法

代码测试

实现从html标签中提取文本

import re
text = "<div><p>Python是一种跨平台的计算机程序设计语言。 </p><p><br></p><p>是一个高层次的结合了解释性、编译性、互动性和面向对象的脚本语言。&nbsp;</p><br><a>快来学习Python吧！</a></div>"
result = re.sub(r"<.*?>|&nbsp;|\n", "", text)
print(result)

输出

实现将英文文本中的其他符合删除

import re
text = "oveliest of trees, the cherry now Is hung with bloom along the bough, And stands about the woodland ride Wearing white for Eastertide. Now, of my threescore years and ten,Twenty will not come again,And take from seventy springs a score,It only leaves me fifty more.And since to look at things in bloomFifty springs are little room,About the woodlands I will goTo see the cherry hung with snow."
result = re.sub(r"[^A-Za-z]", " ", text)
print(result)

输出

完整清洗代码实现

# -*- coding: utf-8 -*-
# @Time : 2022/5/1 11:52
# @Author : MinChess
# @File : 正则表达式.py
# @Software: PyCharm 
import re

def textParse(str_doc):
    r1 = '[a-zA-Z%0-9’!"#$&\'()*+,-./:：;；|<=>?@—★、…【】《》？“”‘’！[\\]^_`{|}~]+'
    str_doc=re.sub(r1, ' ', str_doc)
    str_doc = str_doc.replace(' ','')
    return str_doc


def readFile(path):
    str_doc = ""
    with open(path,'r',encoding='utf-8') as f:
        str_doc = f.read()
    return str_doc

if __name__=='__main__':
    path= r'评论清洗2.txt'
    str_doc = readFile(path)
    word_list=textParse(str_doc)
    print(word_list)

    result2txt = str(word_list)
    with open('清洗结果2.txt', 'a',encoding='utf-8') as file_handle:
        file_handle.write(result2txt)
        file_handle.write('\n')

目录CONTENT