I have a list of 528k documents which are in SGML format, an example of one of the documents is as follows:
<DOC>
<DOCNO> FBIS4-46571 </DOCNO>
<HT> "jpuma009__l94008" </HT>
<HEADER>
<AU> JPRS-UMA-94-009-L </AU>
JPRS
Central Eurasia
</HEADER>
<ABS> Military Affairs ARMAMENTS, POLITICS, CONVERSION Nos 1 & 2, </ABS>
<TEXT>
1993
<DATE1> 17 June 1994 </DATE1>
<F P=100></F>
<F P=101> Arms, Military Equipment </F>
<H3> <TI> `Vympel' State Machinebuilding Design Bureau Proposes </TI></H3>
<HT><F P=107><PHRASE> `Vympel' State Machinebuilding Design Bureau Proposes </PHRASE></F></HT>
Cooperation
<F P=102> 94UM0312D Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp </F>
22-28--FOR OFFICIAL USE ONLY
<F P=103> 94UM0312D </F>
<F P=104> Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA </F>
<F P=105> Russian </F>
CSO
<F P=106> [Article by "Vympel" State Machinebuilding Design Bureau </F>
Lorem ipsum ........
</TEXT>
</DOC>
I want to extract palin text between <TEXT></TEXT>
, the desired result is as follows:
1993
17 June 1994
Arms, Military Equipment
`Vympel' State Machinebuilding Design Bureau Proposes
`Vympel' State Machinebuilding Design Bureau Proposes
94UM0312D Moscow VOORUZHENIYE, POLITIKA, KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp
22-28--FOR OFFICIAL USE ONLY
94UM0312D
Moscow VOORUZHENIYE, POLITIKA, KONVERSIYA
Russian
CSO
[Article by "Vympel" State Machinebuilding Design Bureau
Lorem ipsum ........
Is there a library or tools in Python/Java that that allows doing that ?
You could use BeautifulSoup in python
I tried this code and got the required output.
from bs4 import BeautifulSoup
with open('file.txt','r') as fo:
sgml=fo.read()
soup = BeautifulSoup(sgml,'html.parser')
text_list=soup.find_all('text')
for item in text_list:
lines_in_item=item.text.split('\n')
[print(x.strip()) for x in lines_in_item if x.strip()!=""]
Output
1993
17 June 1994
Arms, Military Equipment
`Vympel' State Machinebuilding Design Bureau Proposes
`Vympel' State Machinebuilding Design Bureau Proposes
Cooperation
94UM0312D Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp
22-28--FOR OFFICIAL USE ONLY
94UM0312D
Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA
Russian
CSO
[Article by "Vympel" State Machinebuilding Design Bureau
Lorem ipsum ........
file.txt
<DOC>
<DOCNO> FBIS4-46571 </DOCNO>
<HT> "jpuma009__l94008" </HT>
<HEADER>
<AU> JPRS-UMA-94-009-L </AU>
JPRS
Central Eurasia
</HEADER>
<ABS> Military Affairs ARMAMENTS, POLITICS, CONVERSION Nos 1 & 2, </ABS>
<TEXT>
1993
<DATE1> 17 June 1994 </DATE1>
<F P=100></F>
<F P=101> Arms, Military Equipment </F>
<H3> <TI> `Vympel' State Machinebuilding Design Bureau Proposes </TI></H3>
<HT><F P=107><PHRASE> `Vympel' State Machinebuilding Design Bureau Proposes </PHRASE></F></HT>
Cooperation
<F P=102> 94UM0312D Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA in Russian No 2, 1993 (Signed to press 12 May 93) pp </F>
22-28--FOR OFFICIAL USE ONLY
<F P=103> 94UM0312D </F>
<F P=104> Moscow VOORUZHENIYE, POLITIKA,
KONVERSIYA </F>
<F P=105> Russian </F>
CSO
<F P=106> [Article by "Vympel" State Machinebuilding Design Bureau </F>
Lorem ipsum ........
</TEXT>
</DOC>