I'm trying to use the npm SGML library here to parse OFX data. OFX v1-1.6 are based on SGML, and later version on XML.
My plan thus far is to use SGML to convert all OFX files into fully normalised XML (proper end tags etc) and the use the xml2json library to convert the xml into json objects i can use in javascript.
This is what i have so far, but this throws an error "content must start with document element when document type isn't specified".
var entitymanager = new sgml.NoopEntitymanager();
var errorhandler = new sgml.Errorhandler();
var resolver = new sgml.Resolver();
var parser = new sgml.Parser();
const readableStream = new Stream.Readable();
let outputhandler = new sgml.Outputhandler(readableStream, entitymanager);
outputhandler.output_format = "xml";
parser.documentHandler = outputhandler;
parser.dtdHandler = outputhandler;
parser.errorHandler = errorhandler;
parser.lexicalHandler = outputhandler;
parser.entityResolver = resolver ;
let recordmanager = new sgml.PlatformStringRecordmanager(errorhandler, parser);
let fileData = await fs.readFile("MY FILE PATH");
recordmanager.set_input(fileData);
parser.recordManager = recordmanager;
recordmanager.start_records(); // throws.
Following along in the debugger i can see it starts to read the file and gets a few tags in before the error is thrown.
Here are some examples of the files i wish to process https://github.com/actualbudget/actual/tree/master/packages/loot-core/src/mocks/files
Thanks!
You didn't include the particular data file to parse so I just picked
data.ofx
from https://github.com/actualbudget/actual/blob/master/packages/loot-core/src/mocks/files/data.ofx.
Looking at that file reveals the following two issues:
A document type declaration (DOCTYPE) is missing; without markup
declarations, an SGML parser can't figure out eg. missing end-element
tags (which are seemingly used a lot with OFX, or more specifically
on nearly every non-container element). Apparently, an official OFX DTD
can be downloaded from https://financialdataexchange.org/common/Uploaded%20files/OFX%20files/OFX1.6.zip so download and unzip that file which will place
ofx160.dtd
in your directory. Now that ofx160.dtd
file isn't actually
a DTD as commonly understood since it contains a <!DOCTYPE ... [
line itself at the begin and a DOCTYPE-closing ]>
line at the end
when a DTD is supposed to only contain the markup declarations contained
within these two lines. So that file is intended as a fragment to
manually prepend in front of an OFX data file, which we're going
to do here. ofx160.dtd
also seems to contain (in lines 1342 and 2032,
resp.) invalid SGML comments with excess space characters in the comment
close marker -->
, and also contains Windows CR/LF sequences in unexpected
places. Long story short, I've prepended the data file with a version
of ofx160.dtd
with all comments (everything between <!--
and -->
) removed and all Windows CR/LF sequences converted to just plain linefeed (LF) characters. Unfortunately, due to space restrictions I can't post the complete working DTD here.
Moreover, the data file starts with a couple "file header" lines such as
OFXHEADER:100
, etc.; these aren't normal SGML (though there's a way to
make those parse using the SGML SHORTREF feature, but for that to work,
the DTD needs extra declarations), so we're going to remove those lines.
So this is how data.ofx
should look like at this point:
<!DOCTYPE OFX [
<!-- actual markup declarations omitted due to space constraints on StackOverflow -->
]>
<OFX>
<SIGNONMSGSRSV1>
<SONRS>
<STATUS>
<CODE>0
<SEVERITY>INFO
</STATUS>
<DTSERVER>20190124212851.000[0:UTC]
<LANGUAGE>ENG
<DTACCTUP>20190124212851.000[0:UTC]
<FI>
<ORG>Bank of America
<FID>5959
</FI>
<INTU.BID>6526
<INTU.USERID>jlongster03
</SONRS>
</SIGNONMSGSRSV1>
<BANKMSGSRSV1>
<STMTTRNRS>
<TRNUID>0
<STATUS>
<CODE>0
<SEVERITY>INFO
</STATUS>
<STMTRS>
<CURDEF>USD
<BANKACCTFROM>
<BANKID>012345678
<ACCTID>123456789123
<ACCTTYPE>CHECKING
</BANKACCTFROM>
<BANKTRANLIST>
<DTSTART>20190119120000
<DTEND>20190124120000
<STMTTRN>
<TRNTYPE>DEBIT
<DTPOSTED>20190123120000
<TRNAMT>-30.00
<FITID>00092990122-30.00019012312798.01
<NAME>PATIENT FIRST TOKEN 01/22 PURCHA
</STMTTRN>
<STMTTRN>
<TRNTYPE>DEBIT
<DTPOSTED>20190123120000
<TRNAMT>-3.77
<FITID>00092990121-3.77019012312828.01
<NAME>STARBUCKS STORE 07604 01/21 PURC
</STMTTRN>
<STMTTRN>
<TRNTYPE>DEBIT
<DTPOSTED>20190123120000
<TRNAMT>-9.62
<FITID>00092990121-9.62019012312831.78
<NAME>STARBUCKS STORE 07604 01/21 PURC
</STMTTRN>
<STMTTRN>
<TRNTYPE>DEBIT
<DTPOSTED>20190122120000
<TRNAMT>-115.99
<FITID>00090231800-115.99019012212841.40
<NAME>VERIZON DES:PAYMENTREC ID:XXXXX3
</STMTTRN>
<STMTTRN>
<TRNTYPE>DEBIT
<DTPOSTED>20190122120000
<TRNAMT>-10.34
<FITID>00092990120-10.34019012212957.39
<NAME>URBAN FARMHOUSE NO 2 01/19 PURCH
</STMTTRN>
<STMTTRN>
<TRNTYPE>DEBIT
<DTPOSTED>20190122120000
<TRNAMT>-22.53
<FITID>00092990120-22.53019012212967.73
<NAME>URBAN FARMHOUSE NO 2 01/19 PURCH
</STMTTRN>
<STMTTRN>
<TRNTYPE>DEBIT
<DTPOSTED>20190122120000
<TRNAMT>-108.71
<FITID>00092990119-108.71019012212990.26
<NAME>TMOBILE*AUTO PAY 01/19 PURCHASE
</STMTTRN>
<STMTTRN>
<TRNTYPE>DEBIT
<DTPOSTED>20190122120000
<TRNAMT>-25.00
<FITID>00092990118-25.00019012213098.97
<NAME>COUNTY WASTE 01/18 PURCHASE 804-
</STMTTRN>
<STMTTRN>
<TRNTYPE>DEBIT
<DTPOSTED>20190122120000
<TRNAMT>-32.38
<FITID>00092990118-32.38019012213123.97
<NAME>REGENCY MART CITGO 01/18 PURCHAS
</STMTTRN>
<STMTTRN>
<TRNTYPE>DEBIT
<DTPOSTED>20190122120000
<TRNAMT>-6.07
<FITID>00092990117-6.07019012213156.35
<NAME>CHICK-FIL-A #01342 01/17 PURCHAS
</STMTTRN>
</BANKTRANLIST>
<LEDGERBAL>
<BALAMT>12798.01
<DTASOF>20190124212851
</LEDGERBAL>
</STMTRS>
</STMTTRNRS>
</BANKMSGSRSV1>
</OFX>
If you invoke sgmlproc
(the command-line utility coming with the sgml
npm package), it will (rightly) complain about the data not matching
the schema
"data.ofx": line 1250: fatal: element 'INTU.BID' not accepted here
but if you remove the lines containing <INTU.BID>
and <INTU.USERID>
or adapt the content model declaration for the SONRS
element accordingly
to accept INTU.BID
and INTU.USERID
, then sgmlproc
will be able to
successfully parse data.ofx
, and the equivalent setup of an SGML parsing
pipeline via JavaScript sketched in your question should be as well.
You'll have to manage removing the extra "header" lines, prepending
the (compact) DTD, and removing/accepting those IMTU....
elements
programmatically, though.