pythonpython-3.xregexwinget

Regex - display / match strings containing one or more ++ at the end of string


I have a text file which includes different packages (name, id, current version, new version, source) extracted from winget (winget upgrade) (I removed the first two lines and the last line)

Content of the text file:

Brave                        Brave.Brave         111.1.49.120         111.1.49.128        winget
Git                          Git.Git             2.39.2               2.40.0              winget
Notepad++ (64-bit x64)       Notepad++.Notepad++ 8.5                  8.5.1               winget
Spotify                      Spotify.Spotify     1.2.7.1277.g2b3ce637 1.2.8.907.g36fbeacc winget
Teams Machine-Wide Installer Microsoft.Teams     1.5.0.30767          1.6.00.4472         winget
PDFsam Basic                 PDFsam.PDFsam       5.0.3.0              5.1.1.0             winget

I am trying to use Python3 to filter out all package ids, cause the output of winget upgrade is just text based.

What I have tried so far:

import re

with open(r"C:\Users\Username\Desktop\winget_upgrade.txt", "r") as f:
    for line in f:
        match = re.search(r"\b([a-zA-Z]+[a-zA-Z0-9!@#$%^&*()+\-.]*\.[a-zA-Z]+[a-zA-Z0-9!@#$%^&*()+\-.]*\+*)\b", line)
        if match:
            print(match.group(1))

The output is:

Brave.Brave
Git.Git
Notepad++.Notepad
Spotify.Spotify
Microsoft.Teams
PDFsam.PDFsam

The problem here is that the package notepad is missing two + characters at the end. How can I edit my regex syntax to successfully display:

notepad++.notepad++ instead of notepad++.notepad

I think I must change something at the + filter: ()+\-.]*\+*)
But I am not sure what.
Can you help me?


Solution

  • Problem is caused by \b, as transition from + to space is not word boundary.

    Use lookahead (?=\s) instead:

    import re
    
    lines = [
    'Brave                        Brave.Brave         111.1.49.120         111.1.49.128        winget',
    'Git                          Git.Git             2.39.2               2.40.0              winget',
    'Notepad++ (64-bit x64)       Notepad++.Notepad++ 8.5                  8.5.1               winget',
    'Spotify                      Spotify.Spotify     1.2.7.1277.g2b3ce637 1.2.8.907.g36fbeacc winget',
    'Teams Machine-Wide Installer Microsoft.Teams     1.5.0.30767          1.6.00.4472         winget',
    'PDFsam Basic                 PDFsam.PDFsam       5.0.3.0              5.1.1.0             winget',
        ]
    
    for line in lines:
        match = re.search(r"\b([a-zA-Z]+[a-zA-Z0-9!@#$%^&*()+\-.]*\.[a-zA-Z]+[a-zA-Z0-9!@#$%^&*()+\-.]*\+*)(?=\s)", line)
        if match:
            print(match.group(1))
    

    Output:

    Brave.Brave
    Git.Git
    Notepad++.Notepad++
    Spotify.Spotify
    Microsoft.Teams
    PDFsam.PDFsam