pythonhtmlpython-docxpypandoc

i want to replace ‘ and ’ with ' in html using python. i tried multiple ways but failed


I wrote a code that converts a Word document to HTML using pypandoc because I even want images in that. The problem is my docx file contains characters and which turn into something different in HTML when sent as a mail body. I want and to be replaced with ', a normal apostrophe.

Check the attached images so that the difference is clear enough.

source

expected result

I tried a few ways as shown in the code below. I commented out ways which I tried but failed.

# Read the HTML file
with open(html_file, 'r') as file:
    html_data = file.read()
            
    # Replace all occurrences of ',' with '
    # print("called")
    html_data = re.sub("‘", "'", html_data)
    html_data = re.sub("’", "'", html_data)
    # html_data = re.sub(r'’', "'", html_data)
    # html_data =  re.sub(r'‘', "'", html_data)
    # html_data = re.sub(r'“', '"', html_data)
    # html_data = re.sub(r'”', '"', html_data)
    # html_data = html_data.replace("‘", "'")
    # html_data = html_data.replace("’", "'")
    # html_data = html_data.replace('“', "'")
    # html_data = html_data.replace("”", "'")

For example, my Word document contains a phrase i’d like to that should get converted to i'd like to.


Solution

  •         # Read the HTML file
        with open(html_file, 'r') as file:
            html_data = file.read()
            
        # Replace all occurrences of ',' with '
        html_data = re.sub("‘", "'",html_data)
        html_data = re.sub("’", "'",html_data)
        html_data = re.sub("‘", "'",html_data)
        html_data = re.sub("’", "'",html_data)
    

    Try this it works, in html ‘ is sometimes considered as ‘ and ’ is considered as ’ so it does not replaces using your code.