In short, I have two files, one in Romanian, the other has been translated into English. In the RO file there are some tags that have not been translated into EN. So I want to display in an html output all the tags in EN that have corresponding tags in RO, but also those tags in RO that do not appear in EN.
I have this files:
ro_file_path = r'd:\3\ro\incotro-vezi-tu-privire.html'
en_file_path = r'd:\3\en\where-do-you-see-look.html'
Output = d:\3\Output\where-do-you-see-look.html
TASK: Compare the 3 tags below, in both files.
<p class="text_obisnuit">(.*?)</p>
<p class="text_obisnuit2">(.*?)</p>
<p class="text_obisnuit"><span class="text_obisnuit2">(.*?)</span>(.*?)</p>
Requirements:
<!-- START ARTICLE -->
and <!-- FINAL ARTICLE -->
<!-- ARTICOL START -->
<p class="text_obisnuit"><span class="text_obisnuit2">Stiu ca este dificil sa conduci la inceput, </span>dar dupa 4-5 luni inveti.</p>
<p class="text_obisnuit2">Imi place sa merg la scoala si sa invat, mai ales in timpul saptamanii.</p>
<p class="text_obisnuit">Sunt un bun conducator auto, dar am facut si greseli din care am invatat.</p>
<p class="text_obisnuit">În fond, cele scrise de mine, sunt adevarate.</p>
<p class="text_obisnuit">Iubesc sa conduc masina.</p>
<p class="text_obisnuit"><span class="text_obisnuit2">Ma iubesti?</p>
<p class="text_obisnuit"><span class="text_obisnuit2">Stiu ca este dificil sa conduci la inceput, </span>dar dupa 4-5 luni inveti.</p>
<p class="text_obisnuit">Totul se repetă, chiar și ochii care nu se vad.</p>
<p class="text_obisnuit2">BEE servesc o cafea 2 mai buna</p>
<!-- ARTICOL FINAL -->
<!-- ARTICOL START -->
<p class="text_obisnuit2">I like going to school and learning, especially during the week.</p>
<p class="text_obisnuit">I'm a good driver, but I've also made mistakes that I've learned from.</p>
<p class="text_obisnuit">Basically, what I wrote is true.</p>
<p class="text_obisnuit">I love driving.</p>
<p class="text_obisnuit"><span class="text_obisinuit2">I know it's difficult to drive at first, </span> but after 4-5 months you learn.</p>
<p class="text_obisnuit">Everything is repeated, even the eyes that can't see.</p>
<!-- ARTICOL FINAL -->
<!-- ARTICOL START -->
<p class="text_obisnuit"><span class="text_obisnuit2">Stiu ca este dificil sa conduci la inceput, </span> dar dupa 4-5 luni inveti.</p>
<p class="text_obisnuit2">I like going to school and learning, especially during the week.</p>
<p class="text_obisnuit">I'm a good driver, but I've also made mistakes that I've learned from.</p>
<p class="text_obisnuit">Basically, what I wrote is true.</p>
<p class="text_obisnuit"><span class="text_obisnuit2">Ma iubesti?</p>
<p class="text_obisnuit">I love driving.</p>
<p class="text_obisnuit"><span class="text_obisinuit2">I know it's difficult to drive at first, </span> but after 4-5 months you learn.</p>
<p class="text_obisnuit">Everything is repeated, even the eyes that can't see.</p>
<p class="text_obisnuit2">BEE servesc o cafea 2 mai buna</p>
<!-- ARTICOL FINAL -->
Python code must compares the html tags in RO with the html tags in EN and displays in Output the unique tags in both files, taking into account that most of the tags in RO have their corresponding translation in the tags in EN. But the idea of the code is that the code also finds those html tags in RO that were omitted from being translated into EN.
Here's how I came up with the solution in Python code. I followed a simple calculation.
First method:
First, you have to count all the tags in ro, then all the tags in en. Then you have to memorize each type of tag in ro, but then also in en. Then you have to count the words in each tag in ro and the words in each tag in en. Don't forget that there can be 2 identical tags, but on different lines, just like in RO. Then you have to statistically calculate the result. How much are the tags in ro minus the tags in en?
The second method, to verify the output, is to print the screen. Compare the entire ro part and the entire en part separately through OCR, then line by line, see which tags in ro are plus compared to the tags in en
PYTHON CODE:
import re
import os
def extract_tags(content):
start = content.find('<!-- ARTICOL START -->')
end = content.find('<!-- ARTICOL FINAL -->')
if start == -1 or end == -1:
raise ValueError("Marcajele 'ARTICOL START' sau 'ARTICOL FINAL' lipsesc.")
section_content = content[start:end]
pattern = re.compile(r'<p class="text_obisnuit(?:2)?">(?:<span class="text_obisnuit2">)?.*?</p>', re.DOTALL)
tags = []
for idx, match in enumerate(pattern.finditer(section_content), 1):
tag = match.group(0)
text = re.sub(r'<[^>]+>', '', tag).strip()
if '<span class="text_obisnuit2">' in tag or '<span class="text_obisinuit2">' in tag:
tag_type = 'span'
elif 'class="text_obisnuit2"' in tag:
tag_type = 'text_obisnuit2'
else:
tag_type = 'text_obisnuit'
tags.append({
'index': idx,
'tag': tag,
'text': text,
'type': tag_type,
'word_count': len(text.split())
})
return tags
def find_matching_pairs(ro_tags, en_tags):
matched_indices = set()
used_en = set()
for i, ro_tag in enumerate(ro_tags):
for j, en_tag in enumerate(en_tags):
if j in used_en:
continue
if ro_tag['type'] == en_tag['type']:
word_diff = abs(ro_tag['word_count'] - en_tag['word_count'])
if word_diff <= 3:
matched_indices.add(i)
used_en.add(j)
break
return matched_indices
def fix_duplicates(output_content, ro_content):
"""Corectează poziția tag-urilor duplicate"""
ro_tags = extract_tags(ro_content)
output_tags = extract_tags(output_content)
# Găsim tag-urile care apar în RO și OUTPUT
for ro_idx, ro_tag in enumerate(ro_tags):
for out_idx, out_tag in enumerate(output_tags):
if ro_tag['tag'] == out_tag['tag'] and ro_idx != out_idx:
# Am găsit un tag care apare în poziții diferite
# Verificăm dacă este cazul de duplicat care trebuie mutat
ro_lines = ro_content.split('\n')
out_lines = output_content.split('\n')
if ro_tag['tag'] in ro_lines[ro_idx+1] and out_tag['tag'] in out_lines[out_idx+1]:
# Mutăm tag-ul la poziția corectă
out_lines.remove(out_tag['tag'])
out_lines.insert(ro_idx+1, out_tag['tag'])
output_content = '\n'.join(out_lines)
break
return output_content
def generate_output(ro_tags, en_tags, original_content):
start = original_content.find('<!-- ARTICOL START -->')
end = original_content.find('<!-- ARTICOL FINAL -->')
if start == -1 or end == -1:
raise ValueError("Marcajele 'ARTICOL START' sau 'ARTICOL FINAL' lipsesc.")
output_content = original_content[:start + len('<!-- ARTICOL START -->')] + "\n"
matched_indices = find_matching_pairs(ro_tags, en_tags)
en_index = 0
for i, ro_tag in enumerate(ro_tags):
if i in matched_indices:
output_content += en_tags[en_index]['tag'] + "\n"
en_index += 1
else:
output_content += ro_tag['tag'] + "\n"
while en_index < len(en_tags):
output_content += en_tags[en_index]['tag'] + "\n"
en_index += 1
output_content += original_content[end:]
return output_content
def main():
try:
ro_file_path = r'd:\3\ro\incotro-vezi-tu-privire.html'
en_file_path = r'd:\3\en\where-do-you-see-look.html'
output_file_path = r'd:\3\Output\where-do-you-see-look.html'
with open(ro_file_path, 'r', encoding='utf-8') as ro_file:
ro_content = ro_file.read()
with open(en_file_path, 'r', encoding='utf-8') as en_file:
en_content = en_file.read()
ro_tags = extract_tags(ro_content)
en_tags = extract_tags(en_content)
# Generăm primul output
initial_output = generate_output(ro_tags, en_tags, en_content)
# Corectăm pozițiile tag-urilor duplicate
final_output = fix_duplicates(initial_output, ro_content)
with open(output_file_path, 'w', encoding='utf-8') as output_file:
output_file.write(final_output)
print(f"Output-ul a fost generat la {output_file_path}")
except Exception as e:
print(f"Eroare: {str(e)}")
if __name__ == "__main__":
main()
My Python code is almost perfect, but not perfect. The problem occurs when I introduce other tags in RO, such as:
<!-- ARTICOL START -->
<p class="text_obisnuit">Laptopul meu este de culoare neagra.</p>
<p class="text_obisnuit2">Imi place sa merg la scoala si sa invat, mai ales in timpul saptamanii.</p>
<p class="text_obisnuit">Sunt un bun conducator auto, dar am facut si greseli din care am invatat.</p>
<p class="text_obisnuit"><span class="text_obisnuit2">Stiu ca este dificil sa conduci la inceput, </span>dar dupa 4-5 luni inveti.</p>
<p class="text_obisnuit">În fond, cele scrise de mine, sunt adevarate.</p>
<p class="text_obisnuit">Iubesc sa conduc masina.</p>
<p class="text_obisnuit"><span class="text_obisnuit2">Stiu ca este dificil sa conduci la inceput, </span>dar dupa 4-5 luni inveti.</p>
<p class="text_obisnuit">Totul se repetă, chiar și ochii care nu se vad.</p>
<!-- ARTICOL FINAL -->
SECOND, and the BEST SOLUTION.
Finally I solved the problem, but not with ChatGPT or Claude. No other AI could find the solution, because it didn't know how to think about the solution.
In fact, to find the solution to this problem, you had to assign some identifiers to each tag
, and do multiple searches.
ChatGPT or Claude, or other AIs, will have to seriously consider this type of solution for such problems.
Here are the specifications, the way I thought about solving the problem. It's a different way of thinking about doing PARSINGS.
https://pastebin.com/as2yw1UQ
Python code made by a friend of mine. I think the solution, he made the code:
from bs4 import BeautifulSoup
import re
def count_words(text):
"""Numără cuvintele dintr-un text."""
return len(text.strip().split())
def get_greek_identifier(word_count):
"""Determină identificatorul grecesc bazat pe numărul de cuvinte."""
if word_count < 7:
return 'α'
elif word_count <= 14:
return 'β'
else:
return 'γ'
def get_tag_type(tag):
"""Determină tipul tagului (A, B, sau C)."""
if tag.find('span'):
return 'A'
elif 'text_obisnuit2' in tag.get('class', []):
return 'B'
return 'C'
def analyze_tags(content):
"""Analizează tagurile și returnează informații despre fiecare tag."""
soup = BeautifulSoup(content, 'html.parser')
tags_info = []
article_content = re.search(r'<!-- ARTICOL START -->(.*?)<!-- ARTICOL FINAL -->',
content, re.DOTALL)
if article_content:
content = article_content.group(1)
soup = BeautifulSoup(content, 'html.parser')
for i, tag in enumerate(soup.find_all('p', recursive=False)):
text_content = tag.get_text(strip=True)
tag_type = get_tag_type(tag)
word_count = count_words(text_content)
greek_id = get_greek_identifier(word_count)
tags_info.append({
'number': i + 1,
'type': tag_type,
'greek': greek_id,
'content': str(tag),
'text': text_content
})
return tags_info
def compare_tags(ro_tags, en_tags):
"""Compară tagurile și găsește diferențele."""
wrong_tags = []
i = 0
j = 0
while i < len(ro_tags):
ro_tag = ro_tags[i]
if j >= len(en_tags):
wrong_tags.append(ro_tag)
i += 1
continue
en_tag = en_tags[j]
if ro_tag['type'] != en_tag['type']:
wrong_tags.append(ro_tag)
i += 1
continue
i += 1
j += 1
return wrong_tags
def format_results(wrong_tags):
"""Formatează rezultatele pentru afișare și salvare."""
type_counts = {'A': 0, 'B': 0, 'C': 0}
type_content = {'A': [], 'B': [], 'C': []}
for tag in wrong_tags:
type_counts[tag['type']] += 1
type_content[tag['type']].append(tag['content'])
# Creăm rezultatul formatat
result = []
# Prima linie cu sumarul
summary_parts = []
for tag_type in ['A', 'B', 'C']:
if type_counts[tag_type] > 0:
summary_parts.append(f"{type_counts[tag_type]} taguri de tipul ({tag_type})")
result.append("In RO exista in plus fata de EN urmatoarele: " + " si ".join(summary_parts))
# Detaliile pentru fiecare tip de tag
for tag_type in ['A', 'B', 'C']:
if type_counts[tag_type] > 0:
result.append(f"\n{type_counts[tag_type]}({tag_type}) adica asta {'taguri' if type_counts[tag_type] > 1 else 'tag'}:")
for content in type_content[tag_type]:
result.append(content)
result.append("") # Linie goală pentru separare
return "\n".join(result)
def merge_content(ro_tags, en_tags, wrong_tags):
"""Combină conținutul RO și EN, inserând tagurile wrong în pozițiile lor originale."""
merged_tags = []
# Creăm un dicționar pentru tagurile wrong indexat după numărul lor original
wrong_dict = {tag['number']: tag for tag in wrong_tags}
# Parcurgem pozițiile și decidem ce tag să punem în fiecare poziție
current_en_idx = 0
for i in range(max(len(ro_tags), len(en_tags))):
position = i + 1
# Verificăm dacă această poziție este pentru un tag wrong
if position in wrong_dict:
merged_tags.append(wrong_dict[position]['content'])
elif current_en_idx < len(en_tags):
merged_tags.append(en_tags[current_en_idx]['content'])
current_en_idx += 1
return merged_tags
def save_results(merged_content, results, output_path):
"""Salvează conținutul combinat și rezultatele în fișierul de output."""
final_content = '<!-- REZULTATE ANALIZA -->\n'
final_content += '<!-- ARTICOL START -->\n'
# Adaugă conținutul combinat
for tag in merged_content:
final_content += tag + '\n'
final_content += '<!-- ARTICOL FINAL -->\n'
final_content += '<!-- FINAL REZULTATE ANALIZA -->\n'
# Adaugă rezultatele analizei
final_content += results
# Salvează în fișier
with open(output_path, 'w', encoding='utf-8') as file:
file.write(final_content)
# Citește fișierele
with open(r'd:/3/ro/incotro-vezi-tu-privire.html', 'r', encoding='utf-8') as file:
ro_content = file.read()
with open(r'd:/3/en/where-do-you-see-look.html', 'r', encoding='utf-8') as file:
en_content = file.read()
# Definește calea pentru fișierul de output
output_path = r'd:/3/Output/where-do-you-see-look.html'
# Analizează tagurile
ro_tags = analyze_tags(ro_content)
en_tags = analyze_tags(en_content)
# Găsește diferențele
wrong_tags = compare_tags(ro_tags, en_tags)
# Formatează rezultatele
results = format_results(wrong_tags)
# Generează conținutul combinat
merged_content = merge_content(ro_tags, en_tags, wrong_tags)
# Afișează rezultatele în consolă
print(results)
# Salvează rezultatele în fișierul de output
save_results(merged_content, results, output_path)