pythonbeautifulsouppython-requests

BeautifulSoup output not properly formatted


I'm trying to webscrape some text from a website, the problem is its HTML formatting.

        <div class="coptic-text html">
            <div class="htmlvis"><t class="translation" title="The book of the genealogy of Jesus Christ, the son of David, the son of Abraham."><div class="verse" verse="1"><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϫⲱⲱⲙⲉ' target='_new'>ϫⲱⲱⲙⲉ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲙ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡⲉ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϫⲡⲟ' target='_new'>ϫⲡⲟ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲛ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲓⲏⲥⲟⲩⲥ' target='_new'>ⲓⲏⲥⲟⲩⲥ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡⲉ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲭⲣⲓⲥⲧⲟⲥ' target='_new'>ⲭⲣⲓⲥⲧⲟⲥ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϣⲏⲣⲉ' target='_new'>ϣⲏⲣⲉ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲛ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲇⲁⲩⲉⲓⲇ' target='_new'>ⲇⲁⲩⲉⲓⲇ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϣⲏⲣⲉ' target='_new'>ϣⲏⲣⲉ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲛ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲁⲃⲣⲁϩⲁⲙ' target='_new'>ⲁⲃⲣⲁϩⲁⲙ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=.' target='_new'>.</a></span></span></div></t><!--
--></span></div></t></div>

My desired output:

1: ⲡϫⲱⲱⲙⲉ ⲙⲡⲉϫⲡⲟ ⲛⲓⲏⲥⲟⲩⲥ ⲡⲉⲭⲣⲓⲥⲧⲟⲥ ⲡϣⲏⲣⲉ ⲛⲇⲁⲩⲉⲓⲇ ⲡϣⲏⲣⲉ ⲛⲁⲃⲣⲁϩⲁⲙ.

My output:

ⲡϫⲱⲱⲙⲉⲙⲡⲉϫⲡⲟⲛⲓ ⲏⲥⲟⲩⲥⲡⲉⲭⲣⲓ ⲥⲧⲟⲥⲡϣⲏⲣⲉⲛⲇⲁⲩⲉⲓ ⲇⲡϣⲏⲣⲉⲛⲁⲃⲣⲁϩⲁⲙ.

My code so far:

#coding: utf-8

import requests
from bs4 import BeautifulSoup
import signal
import sys
import os.path

signal.signal(signal.SIGINT, lambda x, y: sys.exit(0))

if len(sys.argv) != 4:
    print("Usage: %s <book name> <first chapter> <last chapter>" % os.path.basename(__file__))
    quit()

book_name = sys.argv[1]
start = int(sys.argv[2])
stop = int(sys.argv[3])

while start <= stop:
    out_file = open(f"./{book_name}_{str(start)}.txt", "a")

    try:
        response = requests.get(f'https://data.copticscriptorium.org/texts/new-testament/{book_name}_{str(start)}/sahidica')
        soup = BeautifulSoup(response.text, "lxml")
        content_list = soup.find_all("span", class_="norm")

        text = []
        print(f"[{str(start)}/{str(stop)}] https://data.copticscriptorium.org/texts/new-testament/{book_name}_{str(start)}/sahidica")
        for element in content_list:
            text.append(element.get_text())

        text = ''.join(text).strip()
        out_file.write("%s\n" % text)

    except:
        print("Error")
    start += 1

P.S. Language is old Coptic.

EDIT:

I think the problem is the formatting is made with css, can somehow use the css style with BeautiFulsoup?

.word{ white-space: inherit; }
.word:after{content: " ";}
div.verse{display: block; padding-top: 6px; padding-bottom: 6px; text-indent: -15px; padding-left: 15px; }
div.verse:before{content: attr(verse)": "; font-weight:bold}
.norm a{text-decoration: none !important; color:inherit}
.norm a:hover{text-decoration: underline !important; color: blue}

EDIT:

Seems content_list = soup.find_all("span", class_="word") outputs the desired result but still can't output the verse number.


Solution

  • Found the answer myself. I have to select firstly the div class verse then iterate through it to get the text selecting the tag span class word.

    from bs4 import BeautifulSoup
    import requests
    
    r = requests.get("https://data.copticscriptorium.org/texts/new-testament/40_matthew_1/sahidica")
    soup = BeautifulSoup(r.content, "html.parser")
    select_verse = soup.find_all("div", class_="verse")
         
    for verse in select_verse:
        text = []
        content = verse.find_all("span", class_="word")
        for element in content:
            text.append(element.get_text())
        text = ' '.join(text).strip()
        print(f"{verse.get('verse')}: {text}")
        
    

    Output:

    1: ⲡϫⲱⲱⲙⲉ ⲙⲡⲉϫⲡⲟ ⲛⲓⲏⲥⲟⲩⲥ ⲡⲉⲭⲣⲓⲥⲧⲟⲥ ⲡϣⲏⲣⲉ ⲛⲇⲁⲩⲉⲓⲇ ⲡϣⲏⲣⲉ ⲛⲁⲃⲣⲁϩⲁⲙ .
    2: ⲁⲃⲣⲁϩⲁⲙ ⲁϥϫⲡⲟ ⲛⲓⲥⲁⲁⲕ ⲓⲥⲁⲁⲕ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲁⲕⲱⲃ ⲓⲁⲕⲱⲃ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲟⲩⲇⲁⲥ ⲙⲛⲛⲉϥⲥⲛⲏⲩ .
    3: ⲓⲟⲩⲇⲁⲥ ⲇⲉ ⲁϥϫⲡⲟ ⲙⲫⲁⲣⲉⲥ ⲙⲛⲍⲁⲣⲁ ⲉⲃⲟⲗ ϩⲛⲑⲁⲙⲁⲣ ⲫⲁⲣⲉⲥ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲉⲥⲣⲱⲙ . ⲉⲥⲣⲱⲙ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲁⲣⲁⲙ .
    4: ⲁⲣⲁⲙ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲁⲙⲓⲛⲁⲇⲁⲃ . ⲁⲙⲓⲛⲁⲇⲁⲃ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲛⲁⲁⲥⲥⲱⲛ ⲛⲁⲁⲥⲥⲱⲛ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲥⲁⲗⲙⲱⲛ .
    5: ⲥⲁⲗⲙⲱⲛ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲃⲟⲉⲥ ⲉⲃⲟⲗ ϩⲛϩⲣⲁⲭⲁⲃ . ⲃⲟⲉⲥ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲱⲃⲏⲇ ⲉⲃⲟⲗ ϩⲛϩⲣⲟⲩⲑ . ⲓⲱⲃⲏⲇ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲉⲥⲥⲁⲓ .
    6: ⲓⲉⲥⲥⲁⲓ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲇⲁⲩⲉⲓⲇ ⲡⲣⲣⲟ . ⲇⲁⲩⲉⲓⲇ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲥⲟⲗⲟⲙⲱⲛ ⲉⲃⲟⲗ ϩⲛⲧϩⲓⲙⲉ ⲛⲟⲩⲣⲓⲁⲥ .
    7: ⲥⲟⲗⲟⲙⲱⲛ ⲇⲉ ⲁϥϫⲡⲟ ⲛϩⲣⲟⲃⲟⲁⲙ ϩⲣⲟⲃⲟⲁⲙ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲁⲃⲓⲁ ⲁⲃⲓⲁⲥ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲁⲥⲁⲫ .
    8: ⲁⲥⲁⲫ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲱⲥⲁⲫⲁⲧ ⲓⲱⲥⲁⲫⲁⲧ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲱⲣⲁⲙ ⲓⲱⲣⲁⲙ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲟⲍⲉⲓⲁⲥ .
    9: ⲟⲍⲉⲓⲁⲥ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲱⲁⲑⲁⲙ . ⲓⲱⲛⲁⲑⲁⲙ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲁⲭⲁⲍ ⲁⲭⲁⲍ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲉⲍⲉⲕⲉⲓⲁⲥ .
    10: ⲉⲍⲉⲕⲉⲓⲁⲥ ⲇⲉ ⲁϥϫⲡⲟ ⲙⲙⲁⲛⲁⲥⲥⲏ ⲙⲁⲛⲁⲥⲥⲏ ⲇⲉ ⲁϥϫⲡⲟ ⲛϩⲁⲙⲱⲥ . ϩⲁⲙⲱⲥ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲱⲥⲓⲁⲥ .
    11: ⲓⲱⲥⲓⲁⲥ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲉⲭⲟⲛⲓⲁⲥ ⲙⲛⲛⲉϥⲥⲛⲏⲩ ϩⲓⲡⲡⲱⲱⲛⲉ ⲉⲃⲟⲗ ⲛⲧⲃⲁⲃⲩⲗⲱⲛ .
    12: ⲙⲛⲛⲥⲁⲡⲡⲱⲱⲛⲉ ⲇⲉ ⲉⲃⲟⲗ ⲛⲧⲃⲁⲃⲩⲗⲱⲛ ⲓⲉⲭⲟⲛⲓⲁⲥ ⲁϥϫⲡⲟ ⲛⲥⲁⲗⲁⲑⲓⲏⲗ ⲥⲁⲗⲁⲑⲓⲏⲗ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲍⲟⲣⲟⲃⲁⲃⲉⲗ .
    13: ⲍⲟⲣⲟⲃⲁⲃⲉⲗ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲁⲃⲓⲟⲩⲇ ⲁⲃⲓⲟⲩⲇ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲉⲗⲓⲁⲕⲓⲙ . ⲉⲗⲓⲁⲕⲓⲙ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲁⲍⲱⲣⲁ .
    14: ⲁⲍⲱⲣⲁⲥ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲥⲁⲇⲱⲕ ⲥⲁⲇⲱⲕ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲁⲭⲉⲓⲙ ⲁⲭⲉⲓⲙ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲉⲗⲓⲟⲩⲇ .
    15: ⲉⲗⲓⲟⲩⲇ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲉⲗⲉⲁⲍⲁⲣ ⲉⲗⲉⲁⲍⲁⲣ ⲇⲉ ⲁϥϫⲡⲟ ⲙⲙⲁⲧⲑⲁⲙ ⲙⲁⲧⲑⲁⲙ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲁⲕⲱⲃ .
    16: ⲓⲁⲕⲱⲃ ⲇⲉ ⲁϥϫⲡⲟ ⲛⲓⲱⲥⲏⲫ ⲡϩⲁⲓ ⲙⲙⲁⲣⲓⲁ . ⲧⲁⲓ ⲛⲧⲁⲩϫⲡⲉ ⲓⲏⲥⲟⲩⲥ ⲉⲃⲟⲗ ⲛϩⲏⲧⲥ . ⲡⲁⲓ ⲛϣⲁⲩⲙⲟⲩⲧⲉ ⲉⲣⲟϥ ϫⲉⲡⲉⲭⲣⲓⲥⲧⲟⲥ .
    17: ⲅⲉⲛⲉⲁ ϭⲉ ⲛⲓⲙ ϫⲓⲛⲁⲃⲣⲁϩⲁⲙ ϣⲁⲉϩⲣⲁⲓ ⲉⲇⲁⲩⲉⲓⲇ ⲙⲛⲧⲁϥⲧⲉ ⲛⲅⲉⲛⲉⲁ . ⲁⲩⲱ ϫⲓⲛⲇⲁⲩⲉⲓⲇ ϣⲁⲉϩⲣⲁⲓ ⲉⲡⲡⲱⲱⲛⲉ ⲉⲃⲟⲗ ⲛⲧⲃⲁⲃⲩⲗⲱⲛ ⲙⲛⲧⲁϥⲧⲉ ⲛⲅⲉⲛⲉⲁ . ⲁⲩⲱ ϫⲓⲛⲉⲡⲡⲱⲱⲛⲉ ⲉⲃⲟⲗ ⲛⲧⲃⲁⲃⲩⲗⲱⲛ ϣⲁⲉϩⲣⲁⲓ ⲉⲡⲉⲭⲣⲓⲥⲧⲟⲥ ⲙⲛⲧⲁϥⲧⲉ ⲛⲅⲉⲛⲉⲁ .
    18: ⲡⲉϫⲡⲟ ⲇⲉ ⲛⲓⲏⲥⲟⲩⲥ ⲡⲉⲭⲣⲓⲥⲧⲟⲥ ⲛⲉⲩⲧⲉⲓϩⲉ ⲡⲉ ⲛⲧⲉⲣⲟⲩϣⲡ ⲧⲟⲟⲧⲥ ⲛⲧⲉϥⲙⲁⲁⲩ ⲙⲁⲣⲓⲁ ⲛⲓⲱⲥⲏⲫ ⲉⲙⲡⲁⲧⲟⲩⲃⲱⲕ ⲉϩⲟⲩⲛ ϣⲁⲛⲉⲩⲉⲣⲏⲩ ⲁⲩϩⲉ ⲉⲣⲟⲥ ⲉⲥⲉⲉⲧ ⲉⲃⲟⲗ ϩⲛⲟⲩⲡⲛⲉⲩⲙⲁ ⲉϥⲟⲩⲁⲁⲃ .
    19: ⲓⲱⲥⲏⲫ ⲇⲉ ⲡⲉⲥϩⲁⲓ ⲉⲛⲉⲩⲇⲓⲕⲁⲓⲟⲥ ⲡⲉ . ⲁⲩⲱ ⲛⲉϥⲟⲩⲱϣ ⲁⲛ ⲉϯ ⲙⲡⲉⲥⲥⲟⲉⲓⲧ ⲁϥⲟⲩⲱϣ ⲉⲛⲟϫⲥ ⲉⲃⲟⲗ ⲛϫⲓⲟⲩⲉ .
    20: ⲛⲁⲓ ⲇⲉ ⲛⲧⲉⲣⲉϥⲙⲉⲉⲩⲉ ⲉⲣⲟⲟⲩ ⲉⲓⲥ ⲡⲁⲅⲅⲉⲗⲟⲥ ⲙⲡϫⲟⲉⲓⲥ ⲁϥⲟⲩⲱⲛϩ ⲛⲁϥ ⲉⲃⲟⲗ ϩⲛⲟⲩⲣⲁⲥⲟⲩ ⲉϥϫⲱ ⲙⲙⲟⲥ ϫⲉⲓⲱⲥⲏⲫ ⲡϣⲏⲣⲉ ⲛⲇⲁⲩⲉⲓⲇ ⲙⲡⲣⲣⲟ ϩⲟⲧⲉ ⲉϫⲓ ⲙⲙⲁⲣⲓⲁ ⲧⲉⲕⲥϩⲓⲙⲉ . ⲡⲉⲧⲟⲩⲛⲁϫⲡⲟϥ ⲅⲁⲣ ⲉⲃⲟⲗ ⲛϩⲏⲧⲥ ⲟⲩⲉⲃⲟⲗ ϩⲛⲟⲩⲡⲛⲉⲩⲙⲁ ⲉϥⲟⲩⲁⲁⲃ ⲡⲉ .
    21: ⲥⲛⲁϫⲡⲟ ⲇⲉ ⲛⲟⲩϣⲏⲣⲉ . ⲛⲅⲙⲟⲩⲧⲉ ⲉⲡⲉϥⲣⲁⲛ ϫⲉⲓⲏⲥⲟⲩⲥ . ⲛⲧⲟϥ ⲅⲁⲣ ⲡⲉⲧⲛⲁⲧⲟⲩϫⲟ ⲙⲡⲉϥⲗⲁⲟⲥ ⲉⲃⲟⲗ ϩⲛⲛⲉⲩⲛⲟⲃⲉ .
    22: ⲡⲁⲓ ⲇⲉ ⲧⲏⲣϥ ⲛⲧⲁϥϣⲱⲡⲉ ϫⲉⲕⲁⲁⲥ ⲉϥⲉϫⲱⲕ ⲉⲃⲟⲗ ⲛϭⲓⲡⲉⲛⲧⲁⲡϫⲟⲉⲓⲥ ϫⲟⲟϥ ϩⲓⲧⲙⲡⲉⲡⲣⲟⲫⲏⲧⲏⲥ ⲉϥϫⲱ ⲙⲙⲟⲥ .
    23: ϫⲉⲉⲓⲥⲧⲡⲁⲣⲑⲉⲛⲟⲥ ⲛⲁⲱ ⲛⲥϫⲡⲟ ⲛⲟⲩϣⲏⲣⲉ ⲛⲥⲉⲙⲟⲩⲧⲉ ⲉⲡⲉϥⲣⲁⲛ ϫⲉⲉⲙⲙⲁⲛⲟⲩⲏⲗ ⲉⲧⲉⲡⲁⲓ ⲡⲉ ⲛϣⲁⲩⲟⲩⲁϩⲙⲉϥ ϫⲉⲡⲛⲟⲩⲧⲉ ⲛⲙⲙⲁⲛ .
    24: ⲁϥⲧⲱⲟⲩⲛ ⲇⲉ ⲛϭⲓⲓⲱⲥⲏⲫ ⲉϥⲛⲕⲟⲧⲕ ⲁϥⲉⲓⲣⲉ ⲕⲁⲧⲁⲧϩⲉ ⲛⲧⲁϥϩⲱⲛ ⲉⲧⲟⲟⲧϥ ⲛϭⲓⲡⲁⲅⲅⲉⲗⲟⲥ ⲙⲡϫⲟⲉⲓⲥ . ⲁϥϫⲓ ⲙⲙⲁⲣⲓⲁ ⲧⲉϥⲥϩⲓⲙⲉ .
    25: ⲙⲡⲉϥⲥⲟⲩⲱⲛⲥ ϣⲁⲛⲧⲉⲥϫⲡⲟ ⲙⲡⲉⲥϣⲏⲣⲉ . ⲁϥⲙⲟⲩⲧⲉ ⲉⲡⲉϥⲣⲁⲛ ϫⲉⲓⲏⲥⲟⲩⲥ .