pythonhtmlpandasjupyter

Combine multiple HTML files into one html file Using Python


I have a task, I'm using jupyter and I have to combine or merge multiple html files into one html file.

Any ideas how?

I did this with excel but didn't work with html files:

import os
import pandas as pd

data_folder='C:\\Users\\hhhh\Desktop\\test'


df = []
for file in os.listdir(data_folder):
    if file.endswith('.xlsx'):
        print('Loading file {0}...'.format(file))
        df.append(pd.read_excel(os.path.join(data_folder , file), sheet_name='sheet1'))

Solution

  • Sounds like a task for Beautiful Soup.

    You would get anything inside the <body> tag of each HTML document, I assume, and then combine them.

    Maybe something like:

    import os
    from bs4 import BeautifulSoup
    
    output_doc = BeautifulSoup()
    output_doc.append(output_doc.new_tag("html"))
    output_doc.html.append(output_doc.new_tag("body"))
    
    for file in os.listdir(data_folder):
        if not file.lower().endswith('.html'):
            continue
    
        with open(file, 'r') as html_file:
            output_doc.body.extend(BeautifulSoup(html_file.read(), "html.parser").body)
    
    print(output_doc.prettify())