I am working on a Machine Learning Project which filters spam/phishing emails out of all emails. For this, I am using the SpamAssassin dataset. The dataset contains different mails in this format:
For identifying phishing emails, first thing I have to do is finding out how many web-links the email has. For doing that, I have written the following code:
wordsInLine = []
tempWord = []
urlList = []
base_dir = "C:/Users/keert/Downloads/Spam_Assassin/spam"
def count():
flag = 0
print("Reading all file names in sorted order")
for filename in sorted(os.listdir(base_dir)):
file=open(os.path.join(base_dir, filename))
count1 = 0
for line in file:
wordsInLine = line.split(' ')
for word in wordsInLine:
if re.search('href="http',word,re.I):
count1=count1+1
file.close()
urlList.append(count1)
if flag!=0:
print("File Name = " + filename)
print ("Number of links = ",count1)
flag = flag + 1
count()
final = urlList[1:]
print("List of number of links in each email")
print(final)
with open('count_links.csv', 'wb') as myfile:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
for val in final:
wr.writerow([val])
print("CSV file generated")
But this code is giving me an error saying that: 'charmap' codec can't decode byte 0x81 in position 3124: character maps to
I have even tried opening the file by adding encoding = 'utf8'
option. But still, the clash remains and I got an error like: 'utf-8' codec can't decode byte 0x81 in position 3124: character maps to
I guess this is due to the special characters that are in the file. Is there any way to deal with this because I can not skip the special characters as they are also important. Please suggest me a way for doing this. Thank you in advance
You have to open and read the file using the same encoding that was used to write the file. In this case, that might be a bit difficult, since you are dealing with e-mails and they can be in any encoding, dependent on the sender. In the example file you showed, the message is encoded using 'iso-8859-1' encoding.
However, e-mails are a bit strange, since they consist of a header (which is in ASCII format as far as I know), followed by an empty line and the body. The body is encoded in the encoding that was specified in the header. So two different encodings could be used in the same file!
If you're sure that all the e-mails use iso-8859-1 encoding and you're looking for a quick-and-dirty solution, then you could also just open the file using 'iso-8859-1' encoding, since e-mail headers are compatible with iso-8859-1. However, be prepared that you will have to deal with other e-mail formatting/encoding/escaping issues as well, or your script might not work completely as expected.
I think the best solution would be to look for a Python module that can handle e-mails, so it will deal with all the decoding stuff and you don't have to worry about that. It will also solve other problems such as escape characters and line breaks.
I don't have experience with this myself, but it seems that Python has built-in support for parsing e-mails using the e-mail package. I recommend to take a look at that.