pythonpython-3.xpython-repython-3.10python-regex

python regex lookbehind to remove _sublabel1 in string like "__label__label1_sublabel1"


i have dataset that prepare for train in fasttext and i wanna remove sublabels from dataset for example:

__label__label1_sublabel1 __label__label2_sublabel1 __label__label3 __label__label1_sublabel4 sometext some sentce som data.

Any help much appreciated thanks

im tried this:

r'(?<=__label__[^_]+)\w+'

isnt working exact code:

ptrn = r'(?<=__label__[^_]+)\w+'

re.sub(ptrn, '', test_String)

and this error was occured: error:

error Traceback (most recent call last) c:\Users\THoseini\Desktop\projects\ensani_classification\tes4t.ipynb Cell 3 in <cell line: 3>() 1 ptrn = r'(?<=label[^_]+)\w+' ----> 3 re.sub(ptrn, '', test_String)

File c:\Users\THoseini\AppData\Local\Programs\Python\Python310\lib\re.py:209, in sub(pattern, repl, string, count, flags) 202 def sub(pattern, repl, string, count=0, flags=0): 203 """Return the string obtained by replacing the leftmost 204 non-overlapping occurrences of the pattern in string by the 205 replacement repl. repl can be either a string or a callable; 206 if a string, backslash escapes in it are processed. If it is 207 a callable, it's passed the Match object and must return 208 a replacement string to be used.""" --> 209 return _compile(pattern, flags).sub(repl, string, count)

File c:\Users\THoseini\AppData\Local\Programs\Python\Python310\lib\re.py:303, in _compile(pattern, flags) 301 if not sre_compile.isstring(pattern): 302 raise TypeError("first argument must be string or compiled pattern") --> 303 p = sre_compile.compile(pattern, flags) 304 if not (flags & DEBUG): 305 if len(_cache) >= _MAXCACHE: 306 # Drop the oldest item

File c:\Users\THoseini\AppData\Local\Programs\Python\Python310\lib\sre_compile.py:792, in compile(p, flags) --> 198 raise error("look-behind requires fixed-width pattern") 199 emit(lo) # look behind 200 _compile(code, av[1], flags)

error: look-behind requires fixed-width pattern


Solution

  • try this regex:

    (__label__[^_\s]+)\w*

    after \w star instead of plus to avoid remove whole next label when label doesn't have sublabel

    and a sample code in python:

    import re
    test_string = """__label__label1_sublabel1 __label__label2_sublabel1 __label__label3 __label__label1_sublabel4 sometext some sentce som data."""
    
    ptrn = r'(__label__[^_\s]+)\w*'
    re.sub(ptrn, r'\1', test_string) 
    

    The re.sub() function stands for a substring and returns a string with replaced values. [^character_group] means negation: Matches any single character that is not in character_group. and \w matches any word character. \s matches any white-space character.

    and output are like expected:

    __label__label1 __label__label2 __label__label __label__label1 sometext some sentce som data.