I am trying to segregate documents into different folders based on whether some keywords(keyword1
and keyword2
) occur in the text present in the document or not. I am using regex for this purpose.
Case 1 : If keyword1
occurs then create a folder named keyword1
and store that document in it
Case 2 : If keyword2
occurs then create a folder named keyword2
and store that document in it
Case 3 : If neither of the keyword occurs then create an unknown
folder and store those documents in it.
The logic is working fine for the first 2 cases but it is not working for the last case. If neither of the keywords appear even then the documents are getting stored in the keyword2
folder.
Below is my python implementation:
keyword = "keyword1"
for k, text_list in text_dict.items():
file_name1 = k.split('.')[0]
match = re.search(r"keyword1", text_list, flags = re.DOTALL|re.IGNORECASE)
if match:
print(f"The keyword '{keyword}' is present in the text.--->", k)
os.makedirs('keyword1', exist_ok = True)
shutil.copytree(os.path.join('imgs', file_name1), os.path.join('keyword1', file_name1), dirs_exist_ok=True)
elif not match:
print(f"The keyword '{keyword2}' is present in the text.--->", k)
os.makedirs('keyword2', exist_ok = True)
shutil.copytree(os.path.join('imgs', file_name1), os.path.join('keyword2', file_name1), dirs_exist_ok=True)
else:
print(f"The keywords '{keyword1}' and {keyword2} are not present in the text.--->", k)
os.makedirs('unknown', exist_ok = True)
shutil.copytree(os.path.join('imgs', file_name1), os.path.join('unknown', file_name1), dirs_exist_ok=True)
text_list
is a dictionary where the keys are the filename and the values are the text present in the file. Basically it will iterate through the dictionary and search for they keywords in the values. If found i.e. match
is True
then it will create a folder of that name and store the file in it.
The issue is in the last else, if neither of the keywords are found then an unknown
folder should be created and those files should be stored in that folder. But those files are being stored in the keyword2
folder.
Any help is appreciated!
Your current implemented logic cannot work because as pointed out in the comments, there would only be 2 cases, whether match1 is True or False. Hence your 3rd else loop is never getting satisfied.
What you can do is create 2 regex expressions for keyword1
and keyword2
. Then you can check whether the conditions are getting satisfied.
Try this:
keyword = "keyword1"
keyword = "keyword2"
for k, text_list in text_dict.items():
file_name1 = k.split('.')[0]
match1 = re.search(r"keyword1", text_list, flags = re.DOTALL|re.IGNORECASE)
match2 = re.search(r"keyword2", text_list, flags = re.DOTALL|re.IGNORECASE)
if (match1) and (not match2):
print(f"The keyword '{keyword}' is present in the text.--->", k)
os.makedirs('keyword1', exist_ok = True)
shutil.copytree(os.path.join('imgs', file_name1), os.path.join('keyword1', file_name1), dirs_exist_ok=True)
elif (not match1) and (match2):
print(f"The keyword '{keyword2}' is present in the text.--->", k)
os.makedirs('keyword2', exist_ok = True)
shutil.copytree(os.path.join('imgs', file_name1), os.path.join('keyword2', file_name1), dirs_exist_ok=True)
elif (not match1) and (not match2):
print(f"The keywords '{keyword1}' and {keyword2} are not present in the text.--->", k)
os.makedirs('unknown', exist_ok = True)
shutil.copytree(os.path.join('imgs', file_name1), os.path.join('unknown', file_name1), dirs_exist_ok=True)
else:
print('whatever')
Cheers!