pythonregexunicodesoutheast-asian-languages

Regular expression with Lao?


In Python I would like to display only Lao Characters in this HTML code (just only in "textarea" tag):

<font color="Red">ພິມຄໍາສັບລາວ ຫຼື ອັງກິດແລ້ວກົດປຸ່ມຄົ້ນຫາ - Enter English or Lao Then Hit Search</font><br />
<center><table id='display' border='0' width='100%'>
  <tr>
    <td id='lao2' colspan='3' style='height: 18px; text-align: left'>
      <span style='color: #660033'><span style='font-size: 12pt'>&nbsp;&nbsp;&nbsp;</span></span>&nbsp;&nbsp;
    </td>
  </tr>
  <tr>
    <td style='width: 120px'>&nbsp;</td>
    <td style='width: 192px'>
      <textarea ID='lao' Font-Name='Phetsarath OT' Font-Size='12' rows='10' cols='84' readonly='readonly'>
    1.  (loved, loving)
      1. ຮັກ
      2. ມັກຫຼາຍ
      3. would love ຢາກໄດ້ຫຼາຍ, ຢາກເຮັດຫຼາຍ
      ປະເພດ: ຄໍາກໍາມະ
      ການອອກສຽງ: ເລັຟ

    2.
      1. ຄວາມຮັກ
      2. ຄົນຮັກ, ຄູ່ຮັກ, ສິ່ງທີ່ເຈົ້າຮັກ
      3. ທີ່ຮັກ, (ເທັນນິດ) ສູນ
      be in love with ຮັກຜູ້ໃດຜູ້ໜຶ່ງ
      make love ຮ່ວມປະເວນີ
      ປະເພດ: ຄຳນາມ
      ການອອກສຽງ: ເລັຟ
      </textarea>
    </td>
    <td style='width: 284px'>&nbsp;&nbsp;</td>
  </tr>
  <tr>
    <td>&nbsp;</td>
    <td>&nbsp;</td>
    <td>&nbsp;</td>
  </tr>
  <tr>
    <td>&nbsp;</td>
    <td id='lao1' align='center'>ກະຊວງ ໄປສະນີ, ໂທລະຄົມມະນາຄົມ ແລະ ການສື່ສານ</td><td>&nbsp;</td>
  </tr>
  <tr>
    <td>&nbsp;</td>
    <td id='lao1' align='center'>ສູນບໍລິຫາລັດດ້ວຍເອເລັກໂຕຣນິກ</td><td>&nbsp;</td>
  </tr>
</table></center><br />

I just want the value in the "textarea". What should I do?


Solution

  • Don't use a regular expression. Use a HTML parser. BeautifulSoup makes the task easy:

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(htmltext)
    text = soup.find('textarea', id='lao').string
    

    If you then need to limit the result to just Lao characters, you can further process the text variable.

    However, the Python re module isn't that strong (yet) when it comes to Unicode. Your options are to use a regular expression to just grab code points in the range 0E80–0EFF, use the unicodedata module and filter on the unicode codepoint name, or use the regex library to only match Lao characters.