How to remove Chinese punctuation in Python
Asked Answered
E

2

6

I have the following sentences, that I want to remove all punctuation.

首页 » 政策法规 » 正文吉林省实施《中华人民共和国老年人权益保障法》若干规定   发布时间: 2008-01-04              

I want to remove all Chinese punctuation, including empty space " ". Below is my code:

line = line.decode("utf8")
line = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——!,。?、~@#¥%……&*():;《)《》“”()»〔〕-]+".decode("utf8"), "".decode("utf8"),line)

However, I still got empty space not deleted. I wonder if there exist easier ways to remove Chinese punctuation?

Engobe answered 15/4, 2016 at 7:15 Comment(2)
I suppose to add another sentence 想做/ 兼_职/学生_/ 的 、加,我Q: 1 5. 8 0. !!?? 8 6 。0. 2。 3 有,惊,喜,哦 in my question, but I can not post it.Engobe
​​​​​​​​​​​​​​​Well, I think your example is enough and that sentence could be spam, so don't add it into the question.Spillman
D
9

Cuz most Chinese punctuations are unicode, we have to convert the string to unicode in order to remove Chinese punctuation.

# !/usr/bin/env python2
# -*- coding:utf-8 -*-  


import re
punc = "!?。。"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏."
punc = punc.decode("utf-8")
line = "测试。。去除标点。。,、!"
print re.sub(ur"[%s]+" %punc, "", line.decode("utf-8"))
Dialogue answered 24/5, 2019 at 7:28 Comment(2)
That's good! Just remind to replace ur by r in Python 3; in Pandas, the similar strategy can be applied: comments['chinese_review'].str.replace(r"[%s]+"%punc, "").astype(str)Innermost
and in Python 3, there is no need to .decode() as well.Syringa
F
-1

re.sub is sub(pattern, repl, string, count=0, flags=0)

as your code, pattern is unicode, repl is unicode too ( actually, not needed to decode ),

but string is utf-8 encoded strings not unicode.

Try this,

print re.sub(ur"[\s+\.\!\/_,$%^*(+\"\']+|[+——!,。?、~@#¥%……&*():;《)《》“”()»〔〕-]+", "", s.decode("utf8"))
Flesher answered 15/4, 2016 at 8:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.