pythonpython-3.xtidyhtmltidy

Python exception thrown by libtidy is amusingly impossible to catch


I am trying to use the tidy_document() function from tidylib to format an html document as xhtml before I can post it somewhere and a couple of steps up the stack, an exception is being thrown. The code is wrapped in a try...except block, with about 3, ever more generic except statements, to cast my net even wider, but the exception propagates right past them anyway with none of the code in any of the except bodies being executed.

The offending code:

from tidylib import tidy_document

...

try:
    xhtmlDoc, errors = tidy_document(htmlContent)
except UnicodeDecodeError as ude:
    print("Caught the exception")
except UnicodeError as ue:
    print("Caught the exception")
except Exception as ex:
    print("Caught the exception")
except:
    print("Caught the exception")

Doesn't matter whether htmlContent is sent in str or encoded in utf-8 byte form.

The resulting stack trace follows:

  File "_ctypes/callbacks.c", line 232, in 'calling callback function'
  File "/home/legend855/anaconda3/lib/python3.7/site-packages/tidylib/sink.py", line 79, in put_byte
    write_func(byte.decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 0: unexpected end of data
Traceback (most recent call last):
  File "_ctypes/callbacks.c", line 232, in 'calling callback function'
  File "/home/legend855/anaconda3/lib/python3.7/site-packages/tidylib/sink.py", line 79, in put_byte
    write_func(byte.decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaa in position 0: invalid start byte
Traceback (most recent call last):
  File "_ctypes/callbacks.c", line 232, in 'calling callback function'
  File "/home/legend855/anaconda3/lib/python3.7/site-packages/tidylib/sink.py", line 79, in put_byte
    write_func(byte.decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 0: invalid start byte

Wrapping the offending line from the sink.py in a try...except solves the issue, but according to my understanding, that shouldn't be the library's job. The client (my code) should be able to deal with the exception as desired, which at the moment, I do not understand why I can't. None of the print statements in my except bodies are ever executed.

p.s. I do return a false value to the calling function, to remove the record from further processing, but I've reduced the code to the bare minimum required to reproduce the error.

The html snippet below is what's passed as the variable htmlContent either in str or byte format and triggers the exception.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" lang="ja" xml:lang="ja">

<head>
  <meta http-equiv="X-UA-Compatible" content="IE=8 ; IE=9" />
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Language" content="ja" />
  <meta name="viewport" content="width=1024, maximum-scale=1.0, user-scalable=0">
  <meta property="og:title" content="TECHNOLOGY MAKES HAPPINESS(テクノロジー メイクス ハピネス)- 先端地図技術が創るスマートライフ -|ゼンリン" />
  <meta property="og:type" content="article" />
  <meta property="og:description" content="ゼンリンが地図を制作する過程で培われた技術をアニメーションや解説を用いて紹介する特設サイトです。" />
  <meta property="og:url" content="http://www.zenrin.co.jp/create/technology/index.html" />
  <meta property="og:image" content="http://www.zenrin.co.jp/create/technology/images/ogp_image.jpg" />
  <meta property="og:site_name" content="TECHNOLOGY MAKES HAPPINESS(テクノロジー メイクス ハピネス)- 先端地図技術が創るスマートライフ -|ゼンリン" />
  <meta property="og:locale" content="ja_JP" />
  <meta property="fb:app_id" content="248887565152095" />

  <meta property="title" content="TECHNOLOGY MAKES HAPPINESS 先端地図技術が創るスマートライフ - ゼンリン" />
  <meta property="description" content="ビッグデータの世界を拓くゼンリンの先端技術で実現する“しあわせ”をご紹介します。" />
  <meta property="keywords" content="地図,住宅地図,カーナビソフト,GIS,ゼンリン,zenrin,map,地図ソフト,デジタルマップ" />

  <title>TECHNOLOGY MAKES HAPPINESS 先端地図技術が創るスマートライフ - ゼンリン</title>
  <link rel="stylesheet" type="text/css" href="common/css/common.css">
  <script type="text/javascript" src="common/js/jquery-1.9.1.min.js"></script>
  <script type="text/javascript" src="common/js/lib.js"></script>
  <script type="text/javascript" src="common/js/zenrin.js"></script>
</head>

<body style="overflow:hidden;">
  <noscript>
	<div class="noscript">
	<p>現在JavaScriptがOFFに設定されています。ゼンリンのすべての機能を使用するためには、JavaScriptの設定をONに変更してください。</p>
	</div>
</noscript>

  <div id="preloaderWrp">
    <p id="preloader">
      <img src="common/img/splash.gif" width="558" height="45">
      <img src="common/img/animation/preloader.gif" height="32" width="32" class="spinner">
    </p>
  </div>
  <script type="text/javascript">
    PreLoader.init();
  </script>
  <div id="spec_lightbox" class="lb_fit">
    <div class="inner lb_fit">
      <div class="modal_window">
        <p>
          <img src="common/img/spec_img.gif" alt="ご利用環境について" />
          <a class="closebtn" href="#">閉じる</a>
        </p>
      </div>
    </div>
  </div>
  <div id="light_box">
    <div class="inner">
      <div id="lb_bg"></div>
      <div id="modal_window">
        <div class="inner">
          <div id="spec_area">
            <img src="common/img/space.gif" id="info_spec" />
          </div>
          <div id="aniamtion_area">
            <img src="common/img/space.gif" id="info_anima" />
            <div class="preloader">
              <img src="common/img/animation/preloader.gif" height="32" width="32">
            </div>
          </div>
          <div id="last_area">
            <div id="net1_title">
              <img src="common/img/navi/happiness1.png" height="70" width="690" alt="TECHNOLOGY MAKES HAPPINESS 歩行者ネットワークが実現するしあわせ" />
            </div>
            <div id="net2_title">
              <img src="common/img/navi/happiness2.png" height="70" width="690" alt="TECHNOLOGY MAKES HAPPINESS 自動車ネットワークが実現するしあわせ" />
            </div>
            <div id="net3_title">
              <img src="common/img/navi/happiness3.png" height="70" width="690" alt="TECHNOLOGY MAKES HAPPINESS 付随情報が実現するしあわせ" />
            </div>
            <div id="lib1_title">
              <img src="common/img/navi/happiness4.png" height="70" width="690" alt="TECHNOLOGY MAKES HAPPINESS 高精度到着地点情報が実現するしあわせ" />
            </div>
            <div id="lib2_title">
              <img src="common/img/navi/happiness5.png" height="70" width="690" alt="TECHNOLOGY MAKES HAPPINESS 注記情報が実現するしあわせ" />
            </div>
            <div id="lib3_title">
              <img src="common/img/navi/happiness6.png" height="70" width="690" alt="TECHNOLOGY MAKES HAPPINESS 施設内・地下情報が実現するしあわせ" />
            </div>
            <div id="lib4_title">
              <img src="common/img/navi/happiness7.png" height="70" width="690" alt="TECHNOLOGY MAKES HAPPINESS 3次元コンテンツが実現するしあわせ" />
            </div>
            <div id="map1_title">
              <img src="common/img/navi/happiness8.png" height="70" width="690" alt="TECHNOLOGY MAKES HAPPINESS 地図データ提供技術が実現するしあわせ" />
            </div>
            <div id="mak1_title">
              <img src="common/img/navi/happiness15.png" height="70" width="690" alt="TECHNOLOGY MAKES HAPPINESS マーケティング支援が実現するしあわせ" />
            </div>
            <div id="route_title">
              <img src="common/img/navi/happiness10.png" height="70" width="690" alt="TECHNOLOGY MAKES HAPPINESS 最適ルート案内を実現する技術" />
            </div>
            <div id="adas_title">
              <img src="common/img/navi/happiness11.png" height="70" width="690" alt="TECHNOLOGY MAKES HAPPINESS 自動車の安全運転支援を実現する技術" />
            </div>
            <div id="multi_title">
              <img src="common/img/navi/happiness12.png" height="70" width="690" alt="TECHNOLOGY MAKES HAPPINESS ドアtoドアの誘導を実現する技術" />
            </div>
            <div id="hazard_title">
              <img src="common/img/navi/happiness13.png" height="70" width="690" alt="TECHNOLOGY MAKES HAPPINESS 事故・災害時の活用を実現する技術" />
            </div>
            <div id="area_title">
              <img src="common/img/navi/happiness14.png" height="70" width="690" alt="TECHNOLOGY MAKES HAPPINESS 営業活動支援を実現する技術" />
            </div>
            <ul>
              <li id="net1Btn">
                <a href="#network1_lightBox" class="trk_last_network1">
                  <img src="common/img/navi/btn1.jpg" height="300" width="340" alt="歩行者ネットワーク" />
                </a>
              </li>
              <li id="net2Btn">
                <a href="#network2_lightBox" class="trk_last_network2">
                  <img src="common/img/navi/btn2.jpg" height="300" width="340" alt="自動車ネットワーク" />
                </a>
              </li>
              <li id="net3Btn">
                <a href="#network3_lightBox" class="trk_last_network3">
                  <img src="common/img/navi/btn3.jpg" height="300" width="340" alt="付随情報" />
                </a>
              </li>
              <li id="lib1Btn">
                <a href="#lib1_lightBox" class="trk_last_lib1">
                  <img src="common/img/navi/btn4.jpg" height="300" width="340" alt="高精度到着地点情報" />
                </a>
              </li>
              <li id="lib2Btn">
                <a href="#lib3_lightBox" class="trk_last_lib3">
                  <img src="common/img/navi/btn5.jpg" height="300" width="340" alt="施設内・地下情報" />
                </a>
              </li>
              <li id="lib3Btn">
                <a href="#lib2_lightBox" class="trk_last_lib2">
                  <img src="common/img/navi/btn6.jpg" height="300" width="340" alt="注記情報" />
                </a>
              </li>
              <li id="map1Btn">
                <a href="#map1_lightBox" class="trk_last_map1">
                  <img src="common/img/navi/btn7.jpg" height="300" width="340" alt="地図データ提供技術" />
                </a>
              </li>
              <li id="mak1Btn">
                <a href="#mak1_lightBox" class="trk_last_mak1">
                  <img src="common/img/navi/btn8.jpg" height="300" width="340" alt="マーケティング支援" />
                </a>
              </li>
              <li id="routeBtn">
                <a href="#route_lightBox" class="trk_last_route">
                  <img src="common/img/navi/btn21.jpg" height="300" width="340" alt="最適ルート案内" />
                </a>
              </li>
              <li id="adasBtn">
                <a href="#adas_lightBox" class="trk_last_adas">
                  <img src="common/img/navi/btn22.jpg" height="300" width="340" alt="自動車の安全運転支援" />
                </a>
              </li>
              <li id="multiBtn">
                <a href="#multi_lightBox" class="trk_last_multi">
                  <img src="common/img/navi/btn23.jpg" height="300" width="340" alt="ドアtoドアの誘導" />
                </a>
              </li>
              <li id="hazardBtn">
                <a href="#hazard_lightBox" class="trk_last_hazard">
                  <img src="common/img/navi/btn24.jpg" height="300" width="340" alt="災害時の活用" />
                </a>
              </li>
              <li id="areaBtn">
                <a href="#area_lightBox" class="trk_last_area">
                  <img src="common/img/navi/btn25.jpg" height="300" width="340" alt="営業活動支援" />
                </a>
              </li>
              <li id="modal_close_Btn">
                <a href="#modal_close">
                  <img src="common/img/modal_close_btn.png" height="132" width="122">
                </a>
              </li>
            </ul>
          </div>
          <div id="trigger_area">
            <div class="trigger_inner">
              <div id="info_txt_wrp">
                <table cellpadding="0" cellspacing="0" width="730" height="150">
                  <tr>
                    <td id="info_txt"></td>
                  </tr>
                </table>
              </div>
              <div id="more_trigger">
                <a href="#" class="trk_more">
                  <div></div>
                </a>
              </div>
            </div>
          </div>
        </div>
      </div>
    </div>
  </div>


  <div id="wrapper">
    <div id="map">
      <img src="common/img/bg.jpg" alt="" id="defaultmap" />
      <img src="common/img/map/map1.jpg" alt="" id="map1" />
      <!-- <img src="common/img/map/map1.jpg" alt="" id="map1" /> -->
      <img src="common/img/map/target.png" height="131" width="226" id="target" />
      <img src="common/img/map/target.png" height="131" width="226" id="target2" />
    </div>
    <div id="slide_bg" class="clear">
      <div id="slide_content_area">
        <div id="slide1">
          <ul id="slide1_inner">
            <li class="li1">
              <a href="#skil1" class="trk_skil1">
                <img src="common/img/navi/navi1_off.jpg" height="230" width="230" alt="マーケティング支援">
              </a>
            </li>
            <li class="li2">
              <a href="#skil2" class="trk_skil2">
                <img src="common/img/navi/navi2_off.jpg" height="230" width="230" alt="ネットワーク情報">
              </a>
            </li>
            <li class="li3">
              <a href="#skil3" class="trk_skil3">
                <img src="common/img/navi/navi3_off.jpg" height="230" width="230" alt="高精度情報ライブラリ">
              </a>
            </li>
            <li class="li4">
              <a href="#skil4" class="trk_skil4">
                <img src="common/img/navi/navi4_off.jpg" height="230" width="230" alt="地図データ提供技術">
              </a>
            </li>
            <li class="li5">
              <a href="#skil1" class="trk_skil1">
                <img src="common/img/navi/navi1_off.jpg" height="230" width="230" alt="マーケティング支援">
              </a>
            </li>
            <li class="li6">
              <a href="#skil2" class="trk_skil2">
                <img src="common/img/navi/navi2_off.jpg" height="230" width="230" alt="ネットワーク情報">
              </a>
            </li>
            <li class="li7">
              <a href="#skil3" class="trk_skil3">
                <img src="common/img/navi/navi3_off.jpg" height="230" width="230" alt="高精度情報ライブラリ">
              </a>
            </li>
            <li class="li8">
              <a href="#skil4" class="trk_skil4">
                <img src="common/img/navi/navi4_off.jpg" height="230" width="230" alt="地図データ提供技術">
              </a>
            </li>
            <li class="li9">
              <a href="#skil1" class="trk_skil1">
                <img src="common/img/navi/navi1_off.jpg" height="230" width="230" alt="マーケティング支援">
              </a>
            </li>
          </ul>
        </div>


        <div id="slide2">
          <ul id="slide2_inner">
            <li class="li1">
              <a href="#route_lightBox" class="trk_route">
                <img src="common/img/navi/navi5_off.jpg" height="230" width="230" alt="Route Support 雨にぬれなくて階段がすくない行き方はないかな・・・">
              </a>
            </li>
            <li class="li2">
              <a href="#adas_lightBox" class="trk_adas">
                <img src="common/img/navi/navi6_off.jpg" height="230" width="230" alt="ADAS もしも、の時も心に余裕のある運転がしたいな">
              </a>
            </li>
            <li class="li3">
              <a href="#multi_lightBox" class="trk_multi">
                <img src="common/img/navi/navi7_off.jpg" height="230" width="230" alt="Multi Modal 車を降りてから目的地までの歩行経路が分からなくて困るな・・・">
              </a>
            </li>
            <li class="li4">
              <a href="#hazard_lightBox" class="trk_hazard">
                <img src="common/img/navi/navi8_off.jpg" height="230" width="230" alt="Hazard Database 事故や災害の時に警察や消防がすぐに駆けつけてくれるのはなぜだろう?">
              </a>
            </li>
            <li class="li5">
              <a href="#area_lightBox" class="trk_area">
                <img src="common/img/navi/navi9_off.jpg" height="230" width="230" alt="Business Support この商品が売れそうな60代女性が住む地域はどこかしら?">
              </a>
            </li>
            <li class="li6">
              <a href="#route_lightBox" class="trk_route">
                <img src="common/img/navi/navi5_off.jpg" height="230" width="230" alt="Route Support 雨にぬれなくて階段がすくない行き方はないかな・・・">
              </a>
            </li>
            <li class="li7">
              <a href="#adas_lightBox" class="trk_adas">
                <img src="common/img/navi/navi6_off.jpg" height="230" width="230" alt="ADAS もしも、の時も心に余裕のある運転がしたいな">
              </a>
            </li>
            <li class="li8">
              <a href="#multi_lightBox" class="trk_multi">
                <img src="common/img/navi/navi7_off.jpg" height="230" width="230" alt="Multi Modal 車を降りてから目的地までの歩行経路が分からなくて困るな・・・">
              </a>
            </li>
            <li class="li9">
              <a href="#hazard_lightBox" class="trk_hazard">
                <img src="common/img/navi/navi8_off.jpg" height="230" width="230" alt="Hazard Database 事故や災害の時に警察や消防がすぐに駆けつけてくれるのはなぜだろう?">
              </a>
            </li>
            <li class="li10">
              <a href="#area_lightBox" class="trk_area">
                <img src="common/img/navi/navi9_off.jpg" height="230" width="230" alt="Business Support この商品が売れそうな60代女性が住む地域はどこかしら?">
              </a>
            </li>

          </ul>
        </div>
        <div id="title">
          <img src="common/img/title.png" height="104" width="554" alt="TECHNOLOGY MAKES HAPPINESS 先端地図技術が創るスマートライフ POWERD BY ZENRIN" />
        </div>
        <div id="slash1">
          <img src="common/img/slash01.png" height="230" width="585" alt="TECHNOLOGY ビッグデータの世界を拓くゼンリンの先端技術 ADVANCED TECHNOLOGIES AND DATA SOLUTIONS." />
        </div>
        <div id="slash3">
          <img src="common/img/slash03.png" height="230" width="230" alt="" />
        </div>
        <div id="slash5">
          <img src="common/img/slash05.png" height="126" width="356" alt="" />
        </div>

        <div id="slash2">
          <img src="common/img/slash02.png" height="230" width="232" alt="" />
        </div>
        <div id="slash6">
          <img src="common/img/slash06.png" height="126" width="358" alt="" />
        </div>
        <div id="slash4">
          <img src="common/img/slash04.png" height="230" width="587" alt="HAPPINESS ゼンリンの技術で実現するしあわせ MAP TECHNOLOGY REALIZES SMART LIFE." />
        </div>
      </div>


    </div>


    <div id="content_page">
      <div id="header">
        <div class="inner">
          <div class="backto">
            <a href="#" id="backto">
              <img src="common/img/back_btn_off.png" height="49" width="204">
            </a>
          </div>
          <div id="typ1">
            <img src="common/img/typ1_header.png" height="239" width="240">
          </div>
          <div id="typ2">
            <img src="common/img/typ2_header.png" height="239" width="240">
          </div>
        </div>
      </div>



      <div id="network_navi_area">
        <div class="menuClick">
          <img class="ipad_conv" src="common/img/left_menu_hover.gif" src_i="common/img/i_left_menu_hover.gif" height="78" width="70" alt="メニューをクリック" />
        </div>
        <div class="title">
          <img src="common/img/skil_title.png" height="15" width="210" alt="ゼンリンの技術1 ネットワーク情報" />
        </div>
        <ul>
          <li>
            <a href="#network1_lightBox" class="trk_network1"><img src="common/img/left_navi01_off.png" height="70" width="211"></a>
          </li>
          <li>
            <a href="#network2_lightBox" class="trk_network2"><img src="common/img/left_navi02_off.png" height="70" width="211"></a>
          </li>
          <li>
            <a href="#network3_lightBox" class="trk_network3"><img src="common/img/left_navi03_off.png" height="70" width="211"></a>
          </li>
        </ul>
        <div class="cover"></div>
      </div>

      <div id="lib_navi_area">
        <div class="menuClick">
          <img class="ipad_conv" src="common/img/left_menu_hover.gif" src_i="common/img/i_left_menu_hover.gif" height="78" width="70" alt="メニューをクリック" />
        </div>
        <div class="title">
          <img src="common/img/skil2_title.png" height="14" width="211" alt="ゼンリンの技術2 高精度情報ライブラリ" />
        </div>
        <ul>
          <li>
            <a href="#lib1_lightBox" class="trk_lib1"><img src="common/img/left_navi04_off.png" height="70" width="211"></a>
          </li>
          <li>
            <a href="#lib2_lightBox" class="trk_lib2"><img src="common/img/left_navi05_off.png" height="70" width="211"></a>
          </li>
          <li>
            <a href="#lib3_lightBox" class="trk_lib3"><img src="common/img/left_navi06_off.png" height="70" width="211"></a>
          </li>
          <li>
            <a href="#lib4_lightBox" class="trk_lib4"><img src="common/img/left_navi07_off.png" height="70" width="211"></a>
          </li>
        </ul>
        <div class="cover"></div>
      </div>

      <div id="map_navi_area">
        <div class="menuClick">
          <img class="ipad_conv" src="common/img/left_menu_hover.gif" src_i="common/img/i_left_menu_hover.gif" height="78" width="70" alt="メニューをクリック" />
        </div>
        <div class="title">
          <img src="common/img/skil3_title.png" height="14" width="211" alt="ゼンリンの技術3 地図データ提供技術" />
        </div>
        <ul>
          <li>
            <a href="#map1_lightBox" class="trk_map1"><img src="common/img/left_navi08_off.png" height="70" width="211"></a>
          </li>
        </ul>
        <div class="cover"></div>
      </div>

      <div id="mak_navi_area">
        <div class="menuClick">
          <img class="ipad_conv" src="common/img/left_menu_hover.gif" src_i="common/img/i_left_menu_hover.gif" height="78" width="70" alt="メニューをクリック" />
        </div>
        <div class="title">
          <img src="common/img/skil4_title.png" height="14" width="211" alt="ゼンリンの技術4 マーケティング支援" />
        </div>
        <ul>
          <li>
            <a href="#mak1_lightBox" class="trk_mak1"><img src="common/img/left_navi09_off.png" height="70" width="211"></a>
          </li>
        </ul>
        <div class="cover"></div>
      </div>

      <div id="right_navi_area">
        <div class="menuClick_right">
          <img class="ipad_conv" src="common/img/right_menu_hover.gif" src_i="common/img/i_right_menu_hover.gif" height="78" width="70" alt="メニューをクリック" />
        </div>
        <div class="title" style="text-align:right;">
          <img src="common/img/happy_title.png" height="14" width="212" alt="この技術が実現するしあわせ" />
        </div>
        <ul>
          <li>
            <a href="#route_lightBox" class="trk_rnavi_route"><img src="common/img/right_navi_01_off.png" height="70" width="211"></a>
          </li>
          <li>
            <a href="#adas_lightBox" class="trk_rnavi_adas"><img src="common/img/right_navi_02_off.png" height="70" width="211"></a>
          </li>
          <li>
            <a href="#multi_lightBox" class="trk_rnavi_multi"><img src="common/img/right_navi_03_off.png" height="70" width="211"></a>
          </li>
          <li>
            <a href="#hazard_lightBox" class="trk_rnavi_hazard"><img src="common/img/right_navi_04_off.png" height="70" width="211"></a>
          </li>
          <li>
            <a href="#area_lightBox" class="trk_rnavi_area"><img src="common/img/right_navi_05_off.png" height="70" width="211"></a>
          </li>
        </ul>
        <div class="cover"></div>
      </div>
    </div>

    <div id="footer_area">
      <div class="inner">
        <div class="copyright">
          <a id="footerlogo" class="trk_footerlogo" href="http://www.zenrin.co.jp/" target="_blank"><img src="common/img/copyright.png" height="29" width="318窶?" alt="ZENRIN Maps to the Future COPYRIGHT c ZENRIN CO., LTD. ALL RIGHT RESERVED."></a>
        </div>
        <div class="spec">
          <a class="trk_spec" href="#spec_lightbox"><img src="common/img/spec_btn.gif" height="11" width="96" alt="ご利用環境について"></a>
        </div>
        <div id="social_area">
          <ul class="clearfix">
            <li>
              <a class="trk_twitter" href="http://twitter.com/share?count=horizontal&original_referer=http://www.zenrin.co.jp/create/technology/&text=TECHNOLOGY%20MAKES%20HAPPINESS%20%E5%85%88%E7%AB%AF%E5%9C%B0%E5%9B%B3%E6%8A%80%E8%A1%93%E3%81%8C%E5%89%B5%E3%82%8B%E3%82%B9%E3%83%9E%E3%83%BC%E3%83%88%E3%83%A9%E3%82%A4%E3%83%95%E3%80%90%E3%82%BC%E3%83%B3%E3%83%AA%E3%83%B3%E3%80%91%0A&url=http://www.zenrin.co.jp/create/technology/"
                onclick="window.open(this.href, 'tweetwindow', 'width=550, height=450,personalbar=0,toolbar=0,scrollbars=1,resizable=1'); return false;"><img src="common/img/twitter.png" width="30" height="20" /></a>
            </li>
            <li>
              <a class="trk_facebook" href="http://www.facebook.com/share.php?u=http://www.zenrin.co.jp/create/technology/" onclick="window.open(this.href, 'FBwindow', 'width=650, height=450, menubar=no, toolbar=no, scrollbars=yes'); return false;"><img src="common/img/facebook.png" width="25" height="20" /></a>
            </li>
          </ul>
        </div>
      </div>
    </div>
  </div>
  <div id="footer2">
    <a href="http://www.zenrin.co.jp/" target="_blank"><img src="common/img/copyright2.gif" height="60" width="363" alt="ZENRIN Maps to the Future COPYRIGHT c ZENRIN ALL RIGHT RESERVED."></a>
  </div>

  <div style="display:none;">
    <!-- for display network -->
    <script type="text/javascript" language="javascript" src="//b92.yahoo.co.jp/js/s_retargeting.js"></script>
    <script type="text/javascript">
      /* <![CDATA[ */
      var yahoo_ss_retargeting_id = 1000387951;
      var yahoo_sstag_custom_params = window.yahoo_sstag_params;
      var yahoo_ss_retargeting = true;
      /* ]]> */
    </script>
    <!-- for sponsored search -->
    <script type="text/javascript" src="//s.yimg.jp/images/listing/tool/cv/conversion.js">
    </script>
    <noscript>
<div style="display:inline;">
<img height="1" width="1" style="border-style:none;" alt="" src="//b97.yahoo.co.jp/pagead/conversion/1000387951/?guid=ON&script=0&disvt=false"/>
</div>
</noscript>
  </div>

</body>

</html>


Solution

  • I managed to reproduce the problem on Win (saved the HTML snippet in a file). Below is the last code variant.

    code00.py:

    #!/usr/bin/env python
    
    import sys
    import os
    import threading
    
    os.environ["PATH"] += os.pathsep + os.path.abspath(os.path.dirname(__file__))  # Built tidy.dll in the cwd, this is needed for it to be found
    from tidylib import tidy_document
    
    
    def main(*argv):
        print("main - TID: {0:d}".format(threading.get_ident()))
        mode = "rb"
        raw_content = open("content.html", mode=mode).read()
        enc = "utf-8" if len(sys.argv) < 2 else sys.argv[1]
        html_content = raw_content.decode(enc)
        print(html_content.encode(enc) == raw_content)
        with open("content_utf8.html", "w", encoding=enc) as fout:
            fout.write(html_content)
        try:
            xhtml_doc, errors = tidy_document(html_content)
        except UnicodeDecodeError as ude:
            print("Caught the exception:", ude)
        except UnicodeError as ue:
            print("Caught the exception:", ue)
        except Exception as ex:
            print("Caught the exception:", ex)
        except:
            print("Caught an exception")
    
    
    if __name__ == "__main__":
        print("Python {0:s} {1:d}bit on {2:s}\n".format(" ".join(item.strip() for item in sys.version.split("\n")), 64 if sys.maxsize > 0x100000000 else 32, sys.platform))
        rc = main(*sys.argv[1:])
        print("\nDone.")
        sys.exit(rc)
    

    Output:

    [cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q059054833]> "e:\Work\Dev\VEnvs\py_pc064_03.08.07_test0\Scripts\python.exe" code00.py
    Python 3.8.7 (tags/v3.8.7:6503f05, Dec 21 2020, 17:59:51) [MSC v.1928 64 bit (AMD64)] 64bit on win32
    
    main - TID: 9528
    True
    Exception ignored on calling ctypes callback function: <function Sink.__init__.<locals>.put_byte at 0x000002144F596940>
    Traceback (most recent call last):
    File "e:\Work\Dev\VEnvs\py_pc064_03.08.07_test0\lib\site-packages\tidylib\sink.py", line 79, in put_byte
        write_func(byte.decode('utf-8'))
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 0: unexpected end of data
    Exception ignored on calling ctypes callback function: <function Sink.__init__.<locals>.put_byte at 0x000002144F596940>
    Traceback (most recent call last):
    File "e:\Work\Dev\VEnvs\py_pc064_03.08.07_test0\lib\site-packages\tidylib\sink.py", line 79, in put_byte
        write_func(byte.decode('utf-8'))
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaa in position 0: invalid start byte
    Exception ignored on calling ctypes callback function: <function Sink.__init__.<locals>.put_byte at 0x000002144F596940>
    Traceback (most recent call last):
    File "e:\Work\Dev\VEnvs\py_pc064_03.08.07_test0\lib\site-packages\tidylib\sink.py", line 79, in put_byte
        write_func(byte.decode('utf-8'))
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 0: invalid start byte
    
    Done.
    

    I tested (temporarily modified sink.py), and they are indeed in the same thread. Then, I looked more closely at the stacktrace, and figured it out:

    1. PyTidyLib calls some C code from the backend Tidy library (tidy.dll), via CTypes
    2. The (above) C code calls some Python code (Sink.put_byte), as a callback that was passed to it together with the arguments
    3. The (Python) code from previous step raises an exception, but the underlying C code (that calls it) doesn't "know" how pass it back to #1., as it has no Python "knowledge" whatsoever (so the exception "dies" there)

    That's why you couldn't catch it in Python.

    I tried reading the files with different other encodings, but no luck. Then I did some more debugging, and it seems like there are 3 invalid UTF-8 characters (\x07, \xAA, \xB6 - when combined with other ones) in your file.
    Of course, trying to decode an UTF-8 character out of a single byte seems strange to me, but that might be a PyTidyLib bug.



    Update #0

    Since I had to build tidy.dll (as I didn't want to start Lnx VMs or install the .whl under Cygwin) to do all the tests, I also uploaded it (and other artifacts) to [GitHub]: CristiFati/Prebuilt-Binaries - Prebuilt-Binaries/HTML-Tidy/v5.7.28.