pythonutf-8cyrillic

Remove all - at every end of lines in text file with cyrillic text


I have a .txt file with cyrillic text where a lot of lines end with a short hyphen (-). I want these removed, but without removing the hyphens anywhere else in the file.

Have made this thus far, where my idea is to line by line in file f1 copy the text into f2, without a hyphen at the end.

f2 = open('n_dim.txt','w')
with open('dim.txt','r',encoding='utf-8') as f1:
    for line in f1:
        f2.write(line.removesuffix('-'))

Currently receiving zero errors. I managed to copy the file content, but the hyphens persist. How can I properly remove them?


Solution

  • The reason this is not working as intended is that each line that you get while iterating over a file pointer includes the \n or \r\n at the end of each line. We can see that by adding a print of the repr of each line while iterating over the file.

    I will use the following example file content for the rest of the answer:

    Hello-there-
    Привет--
    Hello-
    

    If we print the repr of each line, we can see:

    with open('dim.txt', 'r', encoding='utf-8') as f_in:
        for line in f_in:
            print(repr(line))
    

    ->

    'Hello-there-\n'
    'Привет--\n'
    'Hello-\n'
    

    To fix this, we can strip all whitespace at the end of each line before calling removesuffix:

    with open('dim.txt', 'r', encoding='utf-8') as f_in:
        with open('n_dim.txt', 'w', encoding='utf-8') as f_out:
            for line in f_in:
                f_out.write(line.rstrip().removesuffix('-') + '\n')
    

    This results in the following:

    Hello-there
    Привет-
    Hello
    

    Note that if there may be more than 1 trailing dash per line and you want to remove all trailing dashes, then you would need to use rstrip instead:

    with open('dim.txt', 'r', encoding='utf-8') as f_in:
        with open('n_dim.txt', 'w', encoding='utf-8') as f_out:
            for line in f_in:
                f_out.write(line.rstrip().rstrip('-') + '\n')
    

    This results in the following:

    Hello-there
    Привет
    Hello
    

    If you need to support opening the file in older Windows programs, then you would need to use + '\r\n' instead of + '\n' when writing the output.

    If the input file is small enough, another approach would be to read the whole file and use splitlines once instead of rstrip on each line. Using splitlines would preserve any other trailing whitespace, while rstrip will remove it. Example:

    with open('dim.txt', 'r', encoding='utf-8') as f_in:
        with open('n_dim.txt', 'w', encoding='utf-8') as f_out:
            for line in f_in.read().splitlines():
                f_out.write(line.rstrip('-') + '\n')