I'm trying to webscrape some text from a website, the problem is its HTML formatting.
<div class="coptic-text html">
<div class="htmlvis"><t class="translation" title="The book of the genealogy of Jesus Christ, the son of David, the son of Abraham."><div class="verse" verse="1"><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϫⲱⲱⲙⲉ' target='_new'>ϫⲱⲱⲙⲉ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲙ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡⲉ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϫⲡⲟ' target='_new'>ϫⲡⲟ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲛ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲓⲏⲥⲟⲩⲥ' target='_new'>ⲓⲏⲥⲟⲩⲥ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡⲉ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲭⲣⲓⲥⲧⲟⲥ' target='_new'>ⲭⲣⲓⲥⲧⲟⲥ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϣⲏⲣⲉ' target='_new'>ϣⲏⲣⲉ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲛ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲇⲁⲩⲉⲓⲇ' target='_new'>ⲇⲁⲩⲉⲓⲇ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϣⲏⲣⲉ' target='_new'>ϣⲏⲣⲉ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲛ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲁⲃⲣⲁϩⲁⲙ' target='_new'>ⲁⲃⲣⲁϩⲁⲙ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=.' target='_new'>.</a></span></span></div></t><!--
--></span></div></t></div>
My desired output:
1: ⲡϫⲱⲱⲙⲉ ⲙⲡⲉϫⲡⲟ ⲛⲓⲏⲥⲟⲩⲥ ⲡⲉⲭⲣⲓⲥⲧⲟⲥ ⲡϣⲏⲣⲉ ⲛⲇⲁⲩⲉⲓⲇ ⲡϣⲏⲣⲉ ⲛⲁⲃⲣⲁϩⲁⲙ.
My output:
ⲡϫⲱⲱⲙⲉⲙⲡⲉϫⲡⲟⲛⲓ ⲏⲥⲟⲩⲥⲡⲉⲭⲣⲓ ⲥⲧⲟⲥⲡϣⲏⲣⲉⲛⲇⲁⲩⲉⲓ ⲇⲡϣⲏⲣⲉⲛⲁⲃⲣⲁϩⲁⲙ.
My code so far:
#coding: utf-8
import requests
from bs4 import BeautifulSoup
import signal
import sys
import os.path
signal.signal(signal.SIGINT, lambda x, y: sys.exit(0))
if len(sys.argv) != 4:
print("Usage: %s <book name> <first chapter> <last chapter>" % os.path.basename(__file__))
quit()
book_name = sys.argv[1]
start = int(sys.argv[2])
stop = int(sys.argv[3])
while start <= stop:
out_file = open(f"./{book_name}_{str(start)}.txt", "a")
try:
response = requests.get(f'https://data.copticscriptorium.org/texts/new-testament/{book_name}_{str(start)}/sahidica')
soup = BeautifulSoup(response.text, "lxml")
content_list = soup.find_all("span", class_="norm")
text = []
print(f"[{str(start)}/{str(stop)}] https://data.copticscriptorium.org/texts/new-testament/{book_name}_{str(start)}/sahidica")
for element in content_list:
text.append(element.get_text())
text = ''.join(text).strip()
out_file.write("%s\n" % text)
except:
print("Error")
start += 1
P.S. Language is old Coptic.
EDIT:
I think the problem is the formatting is made with css, can somehow use the css style with BeautiFulsoup?
.word{ white-space: inherit; }
.word:after{content: " ";}
div.verse{display: block; padding-top: 6px; padding-bottom: 6px; text-indent: -15px; padding-left: 15px; }
div.verse:before{content: attr(verse)": "; font-weight:bold}
.norm a{text-decoration: none !important; color:inherit}
.norm a:hover{text-decoration: underline !important; color: blue}
EDIT:
Seems content_list = soup.find_all("span", class_="word")
outputs the desired result but still can't output the verse number.
Found the answer myself. I have to select firstly the div
class verse
then iterate through it to get the text selecting the tag span
class word
.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://data.copticscriptorium.org/texts/new-testament/40_matthew_1/sahidica")
soup = BeautifulSoup(r.content, "html.parser")
select_verse = soup.find_all("div", class_="verse")
for verse in select_verse:
text = []
content = verse.find_all("span", class_="word")
for element in content:
text.append(element.get_text())
text = ' '.join(text).strip()
print(f"{verse.get('verse')}: {text}")
Output:
1: ⲡϫⲱⲱⲙⲉ ⲙⲡⲉϫⲡⲟ ⲛⲓⲏⲥⲟⲩⲥ ⲡⲉⲭⲣⲓⲥⲧⲟⲥ ⲡϣⲏⲣⲉ ⲛⲇⲁⲩⲉⲓⲇ ⲡϣⲏⲣⲉ ⲛⲁⲃⲣⲁϩⲁⲙ .
2: ⲁⲃⲣⲁϩⲁⲙ ⲁϥϫⲡⲟ ⲛⲓⲥⲁⲁⲕ ⲓⲥⲁⲁⲕ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲁⲕⲱⲃ ⲓⲁⲕⲱⲃ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲟⲩⲇⲁⲥ ⲙⲛⲛⲉϥⲥⲛⲏⲩ .
3: ⲓⲟⲩⲇⲁⲥ ⲇⲉ ⲁϥϫⲡⲟ ⲙⲫⲁⲣⲉⲥ ⲙⲛⲍⲁⲣⲁ ⲉⲃⲟⲗ ϩⲛⲑⲁⲙⲁⲣ ⲫⲁⲣⲉⲥ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲉⲥⲣⲱⲙ . ⲉⲥⲣⲱⲙ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲁⲣⲁⲙ .
4: ⲁⲣⲁⲙ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲁⲙⲓⲛⲁⲇⲁⲃ . ⲁⲙⲓⲛⲁⲇⲁⲃ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲛⲁⲁⲥⲥⲱⲛ ⲛⲁⲁⲥⲥⲱⲛ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲥⲁⲗⲙⲱⲛ .
5: ⲥⲁⲗⲙⲱⲛ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲃⲟⲉⲥ ⲉⲃⲟⲗ ϩⲛϩⲣⲁⲭⲁⲃ . ⲃⲟⲉⲥ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲱⲃⲏⲇ ⲉⲃⲟⲗ ϩⲛϩⲣⲟⲩⲑ . ⲓⲱⲃⲏⲇ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲉⲥⲥⲁⲓ .
6: ⲓⲉⲥⲥⲁⲓ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲇⲁⲩⲉⲓⲇ ⲡⲣⲣⲟ . ⲇⲁⲩⲉⲓⲇ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲥⲟⲗⲟⲙⲱⲛ ⲉⲃⲟⲗ ϩⲛⲧϩⲓⲙⲉ ⲛⲟⲩⲣⲓⲁⲥ .
7: ⲥⲟⲗⲟⲙⲱⲛ ⲇⲉ ⲁϥϫⲡⲟ ⲛϩⲣⲟⲃⲟⲁⲙ ϩⲣⲟⲃⲟⲁⲙ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲁⲃⲓⲁ ⲁⲃⲓⲁⲥ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲁⲥⲁⲫ .
8: ⲁⲥⲁⲫ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲱⲥⲁⲫⲁⲧ ⲓⲱⲥⲁⲫⲁⲧ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲱⲣⲁⲙ ⲓⲱⲣⲁⲙ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲟⲍⲉⲓⲁⲥ .
9: ⲟⲍⲉⲓⲁⲥ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲱⲁⲑⲁⲙ . ⲓⲱⲛⲁⲑⲁⲙ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲁⲭⲁⲍ ⲁⲭⲁⲍ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲉⲍⲉⲕⲉⲓⲁⲥ .
10: ⲉⲍⲉⲕⲉⲓⲁⲥ ⲇⲉ ⲁϥϫⲡⲟ ⲙⲙⲁⲛⲁⲥⲥⲏ ⲙⲁⲛⲁⲥⲥⲏ ⲇⲉ ⲁϥϫⲡⲟ ⲛϩⲁⲙⲱⲥ . ϩⲁⲙⲱⲥ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲱⲥⲓⲁⲥ .
11: ⲓⲱⲥⲓⲁⲥ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲉⲭⲟⲛⲓⲁⲥ ⲙⲛⲛⲉϥⲥⲛⲏⲩ ϩⲓⲡⲡⲱⲱⲛⲉ ⲉⲃⲟⲗ ⲛⲧⲃⲁⲃⲩⲗⲱⲛ .
12: ⲙⲛⲛⲥⲁⲡⲡⲱⲱⲛⲉ ⲇⲉ ⲉⲃⲟⲗ ⲛⲧⲃⲁⲃⲩⲗⲱⲛ ⲓⲉⲭⲟⲛⲓⲁⲥ ⲁϥϫⲡⲟ ⲛⲥⲁⲗⲁⲑⲓⲏⲗ ⲥⲁⲗⲁⲑⲓⲏⲗ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲍⲟⲣⲟⲃⲁⲃⲉⲗ .
13: ⲍⲟⲣⲟⲃⲁⲃⲉⲗ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲁⲃⲓⲟⲩⲇ ⲁⲃⲓⲟⲩⲇ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲉⲗⲓⲁⲕⲓⲙ . ⲉⲗⲓⲁⲕⲓⲙ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲁⲍⲱⲣⲁ .
14: ⲁⲍⲱⲣⲁⲥ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲥⲁⲇⲱⲕ ⲥⲁⲇⲱⲕ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲁⲭⲉⲓⲙ ⲁⲭⲉⲓⲙ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲉⲗⲓⲟⲩⲇ .
15: ⲉⲗⲓⲟⲩⲇ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲉⲗⲉⲁⲍⲁⲣ ⲉⲗⲉⲁⲍⲁⲣ ⲇⲉ ⲁϥϫⲡⲟ ⲙⲙⲁⲧⲑⲁⲙ ⲙⲁⲧⲑⲁⲙ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲁⲕⲱⲃ .
16: ⲓⲁⲕⲱⲃ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲱⲥⲏⲫ ⲡϩⲁⲓ ⲙⲙⲁⲣⲓⲁ . ⲧⲁⲓ ⲛⲧⲁⲩϫⲡⲉ ⲓⲏⲥⲟⲩⲥ ⲉⲃⲟⲗ ⲛϩⲏⲧⲥ . ⲡⲁⲓ ⲛϣⲁⲩⲙⲟⲩⲧⲉ ⲉⲣⲟϥ ϫⲉⲡⲉⲭⲣⲓⲥⲧⲟⲥ .
17: ⲅⲉⲛⲉⲁ ϭⲉ ⲛⲓⲙ ϫⲓⲛⲁⲃⲣⲁϩⲁⲙ ϣⲁⲉϩⲣⲁⲓ ⲉⲇⲁⲩⲉⲓⲇ ⲙⲛⲧⲁϥⲧⲉ ⲛⲅⲉⲛⲉⲁ . ⲁⲩⲱ ϫⲓⲛⲇⲁⲩⲉⲓⲇ ϣⲁⲉϩⲣⲁⲓ ⲉⲡⲡⲱⲱⲛⲉ ⲉⲃⲟⲗ ⲛⲧⲃⲁⲃⲩⲗⲱⲛ ⲙⲛⲧⲁϥⲧⲉ ⲛⲅⲉⲛⲉⲁ . ⲁⲩⲱ ϫⲓⲛⲉⲡⲡⲱⲱⲛⲉ ⲉⲃⲟⲗ ⲛⲧⲃⲁⲃⲩⲗⲱⲛ ϣⲁⲉϩⲣⲁⲓ ⲉⲡⲉⲭⲣⲓⲥⲧⲟⲥ ⲙⲛⲧⲁϥⲧⲉ ⲛⲅⲉⲛⲉⲁ .
18: ⲡⲉϫⲡⲟ ⲇⲉ ⲛⲓⲏⲥⲟⲩⲥ ⲡⲉⲭⲣⲓⲥⲧⲟⲥ ⲛⲉⲩⲧⲉⲓϩⲉ ⲡⲉ ⲛⲧⲉⲣⲟⲩϣⲡ ⲧⲟⲟⲧⲥ ⲛⲧⲉϥⲙⲁⲁⲩ ⲙⲁⲣⲓⲁ ⲛⲓⲱⲥⲏⲫ ⲉⲙⲡⲁⲧⲟⲩⲃⲱⲕ ⲉϩⲟⲩⲛ ϣⲁⲛⲉⲩⲉⲣⲏⲩ ⲁⲩϩⲉ ⲉⲣⲟⲥ ⲉⲥⲉⲉⲧ ⲉⲃⲟⲗ ϩⲛⲟⲩⲡⲛⲉⲩⲙⲁ ⲉϥⲟⲩⲁⲁⲃ .
19: ⲓⲱⲥⲏⲫ ⲇⲉ ⲡⲉⲥϩⲁⲓ ⲉⲛⲉⲩⲇⲓⲕⲁⲓⲟⲥ ⲡⲉ . ⲁⲩⲱ ⲛⲉϥⲟⲩⲱϣ ⲁⲛ ⲉϯ ⲙⲡⲉⲥⲥⲟⲉⲓⲧ ⲁϥⲟⲩⲱϣ ⲉⲛⲟϫⲥ ⲉⲃⲟⲗ ⲛϫⲓⲟⲩⲉ .
20: ⲛⲁⲓ ⲇⲉ ⲛⲧⲉⲣⲉϥⲙⲉⲉⲩⲉ ⲉⲣⲟⲟⲩ ⲉⲓⲥ ⲡⲁⲅⲅⲉⲗⲟⲥ ⲙⲡϫⲟⲉⲓⲥ ⲁϥⲟⲩⲱⲛϩ ⲛⲁϥ ⲉⲃⲟⲗ ϩⲛⲟⲩⲣⲁⲥⲟⲩ ⲉϥϫⲱ ⲙⲙⲟⲥ ϫⲉⲓⲱⲥⲏⲫ ⲡϣⲏⲣⲉ ⲛⲇⲁⲩⲉⲓⲇ ⲙⲡⲣⲣⲟ ϩⲟⲧⲉ ⲉϫⲓ ⲙⲙⲁⲣⲓⲁ ⲧⲉⲕⲥϩⲓⲙⲉ . ⲡⲉⲧⲟⲩⲛⲁϫⲡⲟϥ ⲅⲁⲣ ⲉⲃⲟⲗ ⲛϩⲏⲧⲥ ⲟⲩⲉⲃⲟⲗ ϩⲛⲟⲩⲡⲛⲉⲩⲙⲁ ⲉϥⲟⲩⲁⲁⲃ ⲡⲉ .
21: ⲥⲛⲁϫⲡⲟ ⲇⲉ ⲛⲟⲩϣⲏⲣⲉ . ⲛⲅⲙⲟⲩⲧⲉ ⲉⲡⲉϥⲣⲁⲛ ϫⲉⲓⲏⲥⲟⲩⲥ . ⲛⲧⲟϥ ⲅⲁⲣ ⲡⲉⲧⲛⲁⲧⲟⲩϫⲟ ⲙⲡⲉϥⲗⲁⲟⲥ ⲉⲃⲟⲗ ϩⲛⲛⲉⲩⲛⲟⲃⲉ .
22: ⲡⲁⲓ ⲇⲉ ⲧⲏⲣϥ ⲛⲧⲁϥϣⲱⲡⲉ ϫⲉⲕⲁⲁⲥ ⲉϥⲉϫⲱⲕ ⲉⲃⲟⲗ ⲛϭⲓⲡⲉⲛⲧⲁⲡϫⲟⲉⲓⲥ ϫⲟⲟϥ ϩⲓⲧⲙⲡⲉⲡⲣⲟⲫⲏⲧⲏⲥ ⲉϥϫⲱ ⲙⲙⲟⲥ .
23: ϫⲉⲉⲓⲥⲧⲡⲁⲣⲑⲉⲛⲟⲥ ⲛⲁⲱ ⲛⲥϫⲡⲟ ⲛⲟⲩϣⲏⲣⲉ ⲛⲥⲉⲙⲟⲩⲧⲉ ⲉⲡⲉϥⲣⲁⲛ ϫⲉⲉⲙⲙⲁⲛⲟⲩⲏⲗ ⲉⲧⲉⲡⲁⲓ ⲡⲉ ⲛϣⲁⲩⲟⲩⲁϩⲙⲉϥ ϫⲉⲡⲛⲟⲩⲧⲉ ⲛⲙⲙⲁⲛ .
24: ⲁϥⲧⲱⲟⲩⲛ ⲇⲉ ⲛϭⲓⲓⲱⲥⲏⲫ ⲉϥⲛⲕⲟⲧⲕ ⲁϥⲉⲓⲣⲉ ⲕⲁⲧⲁⲧϩⲉ ⲛⲧⲁϥϩⲱⲛ ⲉⲧⲟⲟⲧϥ ⲛϭⲓⲡⲁⲅⲅⲉⲗⲟⲥ ⲙⲡϫⲟⲉⲓⲥ . ⲁϥϫⲓ ⲙⲙⲁⲣⲓⲁ ⲧⲉϥⲥϩⲓⲙⲉ .
25: ⲙⲡⲉϥⲥⲟⲩⲱⲛⲥ ϣⲁⲛⲧⲉⲥϫⲡⲟ ⲙⲡⲉⲥϣⲏⲣⲉ . ⲁϥⲙⲟⲩⲧⲉ ⲉⲡⲉϥⲣⲁⲛ ϫⲉⲓⲏⲥⲟⲩⲥ .