[SOLVED] How to strip SGML tags from a text file using Python?

How to strip SGML tags from a text file using Python?

I came across the Standard Generalized Markup Language lately. I have acquired the corpus which is in SGML format from EMILLE/CIIL Corpus. This is the documentation for this corpus:

EMILLE Corpus Documentation

I want to extract just the text present in the file. Encoding and Markup information of corpus from documentation is:

The text is encoded as two-byte Unicode text. For more information on Unicode. The texts are marked up in SGML using level 1 CES-compliant markup. Each file also includes a full header, which specifies the provenance of the text.

I am having a hard time stripping these tags. I tried with 'regular expression' as well as 'beautiful soup' but it is not working. This is the sample text file. The language I want to preserve is Punjabi.

Solution

Try the following:

from bs4 import BeautifulSoup
import requests

# Assuming this is the url where the file is
html = requests.get('http://www.lancaster.ac.uk/fass/projects/corpus/emille/MANUAL.htm').content

bsObj = BeautifulSoup(html)

textData = bsObj.findAll('p')

for item in textData:
    print item.get_text()