I have the OCR'd text of a bibliography of periodicals that contains structured entries. I would like to use the Invisible XML standard to extract and parse the entries.
Example input:
1 2 Hype. 1990?- 1993. Frequency: Bimonthly. River Edge,
NJ. Published by Word Up! Video, Inc. Last issue 66 pages.
Height 28 cm. Line drawings; Photographs (some in color);
Commercial advertising; Table of contents. Previous editor(s):
Marica A. Cole. ISSN 1056-4632. LC card no. sn91-1965.
OCLC no. 23715422. Subject focus and/or Features: Hip hop
culture, Music, Rap music.
WHi v.l, n.6; v.2, n.5 Pam 01-5450 Aug, 1992; Aug, 1993
6561 The Zora Neale Hurston Forum. 1986-. Frequency:
Semiannual. Ruth T. Sheffey, Editor, The Zora Neale Hurston
Forum, P.O. Box 550, Morgan State University, Baltimore,
MD 21239. $15 for individuals and institutions. Telephone:
(301) 444-3435. Published by Zora Neale Hurston Society.
Last issue 69 pages. Last volume 142 pages. Height 23 cm.
Photographs; Table of contents. ISSN 1051-6867. LC card no.
90-649339. OCLC no. 15610848. Subject focus and/or Features: Hurston, Zora Neale, Literature, Literary criticism.
MdBMC v.l, n.l-v.8, n.2 Special Collections Fall, 1986-Spring,
1994
TxDw v.l, n.l; v.2, n.l Woman’s Collection Fall, 1986; Fall, 1987
WU v.l, n.l- AP/Z893/N345 Fall, 1986
6562 Zwanna: Son of Zulu. 1993-. Frequency: Unknown.
Nabile P. Hage, Editor, Zwanna, P.O. Box 38261, Atlanta, GA
30334. Published by Dark Zulu Lies Comics, Inc. Last issue 32
pages. Height 28 cm. Line drawings (some in color); Commercial advertising. OCLC no. 28389961. Subject focus and/or
Features: Comic books, strips, etc.
WHi v.l, n.l Pam 00-305 Apr/May, 1993
Each entry begins with an entry number, followed by one or more whitespace characters, followed by descriptive text split over newlines.
iXML grammar
data: entry+ .
entry: -#a, entrynum, " "+, content .
entrynum: -digit+ .
digit: ["1"-"9"] .
content: ~[]+; -#a+ .
This initial attempt at an iXML grammar produces an ambiguous parse (using the CoffeePot iXML processor).
Output
<data xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
<entry>
<entrynum>1</entrynum>
<content>2 Hype. 1990?- 1993. Frequency: Bimonthly. River Edge, NJ. Published by Word Up! Video,
Inc. Last issue 66 pages. Height 28 cm. Line drawings; Photographs (some in color); Commercial
advertising; Table of contents. Previous editor(s): Marica A. Cole. ISSN 1056-4632. LC card
no. sn91-1965. OCLC no. 23715422. Subject focus and/or Features: Hip hop culture, Music, Rap
music. WHi v.l, n.6; v.2, n.5 Pam 01-5450 Aug, 1992; Aug, 1993 6561 The Zora Neale Hurston
Forum. 1986-. Frequency: Semiannual. Ruth T. Sheffey, Editor, The Zora Neale Hurston Forum,
P.O. Box 550, Morgan State University, Baltimore, MD 21239. $15 for individuals and
institutions. Telephone: (301) 444-3435. Published by Zora Neale Hurston Society. Last issue
69 pages. Last volume 142 pages. Height 23 cm. Photographs; Table of contents. ISSN 1051-6867.
LC card no. 90-649339. OCLC no. 15610848. Subject focus and/or Features: Hurston, Zora Neale,
Literature, Literary criticism. MdBMC v.l, n.l-v.8, n.2 Special Collections Fall, 1986-Spring,
1994 TxDw v.l, n.l; v.2, n.l Woman’s Collection Fall, 1986; Fall, 1987 WU v.l, n.l-
AP/Z893/N345 Fall, 1986</content>
</entry>
<entry>
<entrynum>6562</entrynum>
<content>Zwanna: Son of Zulu. 1993-. Frequency: Unknown. Nabile P. Hage, Editor, Zwanna, P.O.
Box 38261, Atlanta, GA 30334. Published by Dark Zulu Lies Comics, Inc. Last issue 32 pages.
Height 28 cm. Line drawings (some in color); Commercial advertising. OCLC no. 28389961.
Subject focus and/or Features: Comic books, strips, etc. WHi v.l, n.l Pam 00-305 Apr/May, 1993
</content>
</entry>
</data>
As a start, I would like to understand how to chunk the entries, and then begin to parse the content: e.g., each entry number is followed by one or more spaces, then an alphanumeric title, which is followed by period, etc.
Your grammar is very very ambiguous, because "~[]" includes #a, so there are dozens of ways to parse the input. You have to determine how to unambiguously identify the start of an entry, and if that is 'if it starts with a number', then you also have to prevent lines that begin with a number from being recognised as 'content', for example,
content: line+.
line: ~["0"-"9"], ~[#a]*, #a.
If you want to track down ambiguity, you can try my implementation (https://homepages.cwi.nl/~steven/ixml/tutorial/run.html) which is much slower than Norm's, but gives potentially useful information about the source of ambiguity.
Here is a reasonable first try for your content, but note that that lone 1994 in the content gets treated as an entry number:
ocr: entry+.
entry: numbered, unnumbered*.
-numbered: number, (line*; -#a), blank-line.
-blank-line: -#a.
-line: ~[#a]+, -#a.
@number: ["0"-"9"]+, -" ".
-unnumbered: ~["0"-"9"; #a], line+, blank-line.