i have very specific function. I have 2 strings, one that is backup of input of the code, and second one, that is modified by steps like replacing spaces, extract of information etc (not important for this case).
I need to find a match in those strings, even when the first one is modified. After the match is found, i need to store the match from original string (without modification), and remove it from "sub_str"/"modified_sub_str".
def find_and_save(sub_str, main_str):
# Convert both strings to lowercase and remove spaces, commas, and hyphens for case-insensitive matching
sub_str_mod = re.escape(sub_str.lower().replace(" ", "").replace(",", "").replace("-", ""))
main_str_mod = main_str.lower().replace(" ", "").replace(",", "").replace("-", "")
# Use re.search() to find the substring in the modified main string
match = re.search(sub_str_mod, main_str_mod)
if match:
start = match.start()
end = match.end()
count = 0
original_start = 0
original_end = 0
for i, c in enumerate(main_str):
if c not in [' ', ',', '-']:
count += 1
if count == start + 1:
original_start = i
if count == end:
original_end = i + 1
break
original_sub_str = main_str[original_start:original_end]
# If the whole sub_str is matching with some part of main_str, return an empty string as modified_sub_str
if original_sub_str.lower().replace(" ", "").replace(",", "").replace("-", "") == sub_str_mod:
modified_sub_str = ""
else:
# Remove the matching part from sub_str in a case-insensitive manner
modified_sub_str = re.sub(re.escape(original_sub_str), '', sub_str, flags=re.IGNORECASE)
return modified_sub_str, original_sub_str # Returns the modified sub_str and the matched string in its original form
else:
return sub_str, None # Returns sub_str as it was and None if no match is found
But i have a specific problems with this code. For example if i have inputs like
sub_str = "internationalworkshopongraphene/ceramiccomposites2016,wgcc2016"
and
main_str = "Roč. 37, č. 12, International Workshop on Graphene/Ceramic Composites 2016, WGCC 2016 (2017), s. 3773-3780 [print, online]"
This code can find match, can return "original_sub_str", but cannot remove the match from "modified_sub_str".
The same problem for those inputs: "sub_str" - "main_str"
"isnnm-2016,internationalsymposiumon"
"Roč. 2017, č. 65, ISNNM-2016, International Symposium on Novel and Nano Materials (2017), s. 76-82 [print, online]"
"fractographyofadvancedceramics5“fractographyfrommacro-tonano-scale”"
"Roč. 37, č. 14, Fractography of Advanced Ceramics 5 “Fractography from MACRO- to NANO-scale” (2017), s. 4315-4322 [print, online]"
"73.zjazdchemikov,zborníkabstraktov"
"Roč. 17, č. 1, 73. Zjazd chemikov, zborník abstraktov (2021), s. 246-246 [print, online]"
I cant find a solution even with use of AI, but i know theres a problem with replace function, unique symbols, case sensitivity.
Your sub_str_mod
was a regex escaped string. .
is converted to \.
, now original_sub_str
can not be found because original_sub_str
has no backslash. (Next time use a debugger)
Removed re
and do all with literal string find.
Removed the else
because the if
test is always True
def clean_str(s) -> str:
return s.lower().replace(" ", "").replace(",", "").replace("-", "")
def find_and_save(sub_str, main_str):
# Convert both strings to lowercase and remove spaces, commas, and hyphens for case-insensitive matching
sub_str_mod = clean_str(sub_str)
main_str_mod = clean_str(main_str)
# find the substring in the modified main string
start = main_str_mod.find(sub_str_mod)
if start == -1:
return sub_str, None # Returns sub_str as it was and None if no match is found
end = start + len(sub_str_mod)
count = 0
original_start = 0
original_end = 0
for i, c in enumerate(main_str):
if c not in [' ', ',', '-']:
count += 1
if count == start + 1:
original_start = i
if count == end:
original_end = i + 1
break
original_sub_str = main_str[original_start:original_end]
# If the whole sub_str is matching with some part of main_str, return an empty string as modified_sub_str
modified_sub_str = ""
if clean_str(original_sub_str) == sub_str_mod: # always True
modified_sub_str = ""
return modified_sub_str, original_sub_str # Returns the modified sub_str and the matched string in its original form
Output of the 4 cases:
('', 'International Workshop on Graphene/Ceramic Composites 2016, WGCC 2016')
('', 'ISNNM-2016, International Symposium on')
('', 'Fractography of Advanced Ceramics 5 “Fractography from MACRO- to NANO-scale”')
('', '73. Zjazd chemikov, zborník abstraktov')