i have dataset that prepare for train in fasttext and i wanna remove sublabels from dataset for example:
__label__label1_sublabel1 __label__label2_sublabel1 __label__label3 __label__label1_sublabel4 sometext some sentce som data.
Any help much appreciated thanks
im tried this:
r'(?<=__label__[^_]+)\w+'
isnt working exact code:
ptrn = r'(?<=__label__[^_]+)\w+'
re.sub(ptrn, '', test_String)
and this error was occured: error:
error Traceback (most recent call last) c:\Users\THoseini\Desktop\projects\ensani_classification\tes4t.ipynb Cell 3 in <cell line: 3>() 1 ptrn = r'(?<=label[^_]+)\w+' ----> 3 re.sub(ptrn, '', test_String)
File c:\Users\THoseini\AppData\Local\Programs\Python\Python310\lib\re.py:209, in sub(pattern, repl, string, count, flags) 202 def sub(pattern, repl, string, count=0, flags=0): 203 """Return the string obtained by replacing the leftmost 204 non-overlapping occurrences of the pattern in string by the 205 replacement repl. repl can be either a string or a callable; 206 if a string, backslash escapes in it are processed. If it is 207 a callable, it's passed the Match object and must return 208 a replacement string to be used.""" --> 209 return _compile(pattern, flags).sub(repl, string, count)
File c:\Users\THoseini\AppData\Local\Programs\Python\Python310\lib\re.py:303, in _compile(pattern, flags) 301 if not sre_compile.isstring(pattern): 302 raise TypeError("first argument must be string or compiled pattern") --> 303 p = sre_compile.compile(pattern, flags) 304 if not (flags & DEBUG): 305 if len(_cache) >= _MAXCACHE: 306 # Drop the oldest item
File c:\Users\THoseini\AppData\Local\Programs\Python\Python310\lib\sre_compile.py:792, in compile(p, flags) --> 198 raise error("look-behind requires fixed-width pattern") 199 emit(lo) # look behind 200 _compile(code, av[1], flags)
error: look-behind requires fixed-width pattern
try this regex:
(__label__[^_\s]+)\w*
after \w star instead of plus to avoid remove whole next label when label doesn't have sublabel
and a sample code in python:
import re
test_string = """__label__label1_sublabel1 __label__label2_sublabel1 __label__label3 __label__label1_sublabel4 sometext some sentce som data."""
ptrn = r'(__label__[^_\s]+)\w*'
re.sub(ptrn, r'\1', test_string)
The re.sub()
function stands for a substring and returns a string with replaced values.
[^character_group]
means negation: Matches any single character that is not in character_group. and \w
matches any word character. \s
matches any white-space character.
and output are like expected:
__label__label1 __label__label2 __label__label __label__label1 sometext some sentce som data.