I'm trying to compare two files and get the difference using a function.
The first file contains English words - one after the other (engwrds.txt) and the second file is a text file of web scraped text (ws.txt). What I want to achieve is to compare the two files and remove the words from ws.txt and write them to a different file.
In the web scraped file, there are words and sentences. But in the other file, the words are placed one after the other.
I tried the following code but it creates a blank output file.
with open('ws.txt', 'r', encoding='utf-8') as file1:
with open('engwrds.txt', 'r', encoding='utf-8') as file2:
same = set(file1).intersection(file2)
with open('output_file.txt', 'w', encoding='utf-8') as file_out:
for line in same:
Then I tried this one, which doesn't print any output at all.
from pathlib import Path
with open('engwrds.txt', 'r', encoding='utf-8') as fin:
exclude = set(line.rstrip() for line in fin)
with fileinput.input('ws.txt', inplace=True) as f:
for line in f:
if not exclude.intersection(Path(line.rstrip()).parts):
print(line, end='')
The following code also doesn't print any output.
with open('op11-Copy1.txt', 'r') as file1:
with open('commonwords.txt', 'r') as file2:
dif = set(file1).difference(file2)
with open('diff.txt', 'w') as file_out:
for line in dif:
Can you please explain the mistakes I'm making here? I referred multiple examples like this, this. But I can't figure out the issue. Ideally, I want to come up with a function that achieves this task.
This is what the ws.txt file looks like.
Just open your files in different variables and compare them. For Example:
Suppose that the file ws.txt (scraped file) contains:
your world is beautiful
And the file engwrds.txt contains these words (one after the other):
while world want wild
Open each one in a different variable:
with open('engwrds.txt', 'r', encoding='utf-8') as file:
engwrds = file.read()
with open('ws.txt', 'r', encoding='utf-8') as file:
ws = file.read()
From here engwrds and ws are strings, so you can compare them in many different ways:
differences = set(engwrds.split()).symmetric_difference(set(ws.split()))
Output: {'beautiful', 'is', 'want', 'while', 'wild', 'your'}
Obviously, this comparison only works if your words are separated by spaces, but from here you will have a better idea of how to solve the problem.