pythonregexunicodebeautifulsoupsgml

How to strip SGML tags from a text file using Python?


I came across the Standard Generalized Markup Language lately. I have acquired the corpus which is in SGML format from EMILLE/CIIL Corpus. This is the documentation for this corpus:

EMILLE Corpus Documentation

I want to extract just the text present in the file. Encoding and Markup information of corpus from documentation is:

The text is encoded as two-byte Unicode text. For more information on Unicode. The texts are marked up in SGML using level 1 CES-compliant markup. Each file also includes a full header, which specifies the provenance of the text.

I am having a hard time stripping these tags. I tried with 'regular expression' as well as 'beautiful soup' but it is not working. This is the sample text file. The language I want to preserve is Punjabi.

Sample text file


Solution

  • Try the following:

    from bs4 import BeautifulSoup
    import requests
    
    # Assuming this is the url where the file is
    html = requests.get('http://www.lancaster.ac.uk/fass/projects/corpus/emille/MANUAL.htm').content
    
    bsObj = BeautifulSoup(html)
    
    textData = bsObj.findAll('p')
    
    for item in textData:
        print item.get_text()