I want to keep only arabic characters, no numbers, i got this regex instruction from github.
generalPath="C:/Users/Desktop/Code/dataset/"
outputPath= "C:/Users/Desktop/Code/output/"
files = os.listdir(generalPath)
for onefile in files:
# relative or absolute file path, e.g.:
localPath=generalPath+onefile
localOutputPath=outputPath+onefile
print(localPath)
print(localOutputPath)
with open(localPath, 'rb') as infile, open(localOutputPath, 'w') as outfile:
data = infile.read().decode('utf-8')
new_data = t = re.sub(r'[^0-9\u0600-\u06ff\u0750-\u077f\ufb50-\ufbc1\ufbd3-\ufd3f\ufd50-\ufd8f\ufd50-\ufd8f\ufe70-\ufefc\uFDF0-\uFDFD]+', ' ', data)
outfile.write(new_data)
In this code i got this error: Traceback (most recent call last): File ".\cleanText.py", line 23, in outfile.write(new_data) File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to
my arabic text is diacritised and i want to keep it like that
It looks like your program is trying to read your text file with CP1252 encoding instead of UTF-8. Specify unicode on opening as shown below. Also, since it's a text file you can read using 'r'
instead of 'rb'
.
with open(localPath, 'r', encoding='utf8') as infile
As for your regex, if you just want to remove numbers, you can use
data = re.sub(r'[0-9]+', '', data)
You don't need to specify the whole Arabic alphabet as characters to keep. But it looks like you have strings like "(1/6)." To get rid of all parentheses and slashes as well, use:
data = re.sub(r'[0-9\(\)/]+', '', data)