I came across the Standard Generalized Markup Language lately. I have acquired the corpus which is in SGML format from EMILLE/CIIL Corpus. This is the documentation for this corpus:
I want to extract just the text present in the file. Encoding and Markup information of corpus from documentation is:
The text is encoded as two-byte Unicode text. For more information on Unicode. The texts are marked up in SGML using level 1 CES-compliant markup. Each file also includes a full header, which specifies the provenance of the text.
I am having a hard time stripping these tags. I tried with 'regular expression' as well as 'beautiful soup' but it is not working. This is the sample text file. The language I want to preserve is Punjabi.
Try the following:
from bs4 import BeautifulSoup
import requests
# Assuming this is the url where the file is
html = requests.get('http://www.lancaster.ac.uk/fass/projects/corpus/emille/MANUAL.htm').content
bsObj = BeautifulSoup(html)
textData = bsObj.findAll('p')
for item in textData:
print item.get_text()