sgml

How can I use the sgml npm library to parse OFX files


I'm trying to use the npm SGML library here to parse OFX data. OFX v1-1.6 are based on SGML, and later version on XML.

My plan thus far is to use SGML to convert all OFX files into fully normalised XML (proper end tags etc) and the use the xml2json library to convert the xml into json objects i can use in javascript.

This is what i have so far, but this throws an error "content must start with document element when document type isn't specified".

var entitymanager = new sgml.NoopEntitymanager();
var errorhandler = new sgml.Errorhandler();
var resolver = new sgml.Resolver();
var parser = new sgml.Parser();

const readableStream = new Stream.Readable();

let outputhandler = new sgml.Outputhandler(readableStream, entitymanager);
outputhandler.output_format = "xml";
  
parser.documentHandler = outputhandler;
parser.dtdHandler = outputhandler;
parser.errorHandler = errorhandler;
parser.lexicalHandler = outputhandler;
parser.entityResolver = resolver ;
  
let recordmanager = new sgml.PlatformStringRecordmanager(errorhandler, parser);

let fileData = await fs.readFile("MY FILE PATH");

recordmanager.set_input(fileData);
  
parser.recordManager = recordmanager;
  
recordmanager.start_records(); // throws.

Following along in the debugger i can see it starts to read the file and gets a few tags in before the error is thrown.

Here are some examples of the files i wish to process https://github.com/actualbudget/actual/tree/master/packages/loot-core/src/mocks/files

Thanks!


Solution

  • You didn't include the particular data file to parse so I just picked data.ofx from https://github.com/actualbudget/actual/blob/master/packages/loot-core/src/mocks/files/data.ofx.

    Looking at that file reveals the following two issues:

    A document type declaration (DOCTYPE) is missing; without markup declarations, an SGML parser can't figure out eg. missing end-element tags (which are seemingly used a lot with OFX, or more specifically on nearly every non-container element). Apparently, an official OFX DTD can be downloaded from https://financialdataexchange.org/common/Uploaded%20files/OFX%20files/OFX1.6.zip so download and unzip that file which will place ofx160.dtd in your directory. Now that ofx160.dtd file isn't actually a DTD as commonly understood since it contains a <!DOCTYPE ... [ line itself at the begin and a DOCTYPE-closing ]> line at the end when a DTD is supposed to only contain the markup declarations contained within these two lines. So that file is intended as a fragment to manually prepend in front of an OFX data file, which we're going to do here. ofx160.dtd also seems to contain (in lines 1342 and 2032, resp.) invalid SGML comments with excess space characters in the comment close marker -->, and also contains Windows CR/LF sequences in unexpected places. Long story short, I've prepended the data file with a version of ofx160.dtd with all comments (everything between <!-- and -->) removed and all Windows CR/LF sequences converted to just plain linefeed (LF) characters. Unfortunately, due to space restrictions I can't post the complete working DTD here.

    Moreover, the data file starts with a couple "file header" lines such as OFXHEADER:100, etc.; these aren't normal SGML (though there's a way to make those parse using the SGML SHORTREF feature, but for that to work, the DTD needs extra declarations), so we're going to remove those lines.

    So this is how data.ofx should look like at this point:

    <!DOCTYPE OFX [
    
    <!-- actual markup declarations omitted due to space constraints on StackOverflow -->
    
    ]>
    <OFX>
    <SIGNONMSGSRSV1>
    <SONRS>
    <STATUS>
    <CODE>0
    <SEVERITY>INFO
    </STATUS>
    <DTSERVER>20190124212851.000[0:UTC]
    <LANGUAGE>ENG
    <DTACCTUP>20190124212851.000[0:UTC]
    <FI>
    <ORG>Bank of America
    <FID>5959
    </FI>
    <INTU.BID>6526
    <INTU.USERID>jlongster03
    </SONRS>
    </SIGNONMSGSRSV1>
    <BANKMSGSRSV1>
    <STMTTRNRS>
    <TRNUID>0
    <STATUS>
    <CODE>0
    <SEVERITY>INFO
    </STATUS>
    <STMTRS>
    <CURDEF>USD
    <BANKACCTFROM>
    <BANKID>012345678
    <ACCTID>123456789123
    <ACCTTYPE>CHECKING
    </BANKACCTFROM>
    <BANKTRANLIST>
    <DTSTART>20190119120000
    <DTEND>20190124120000
    <STMTTRN>
    <TRNTYPE>DEBIT
    <DTPOSTED>20190123120000
    <TRNAMT>-30.00
    <FITID>00092990122-30.00019012312798.01
    <NAME>PATIENT FIRST TOKEN 01/22 PURCHA
    </STMTTRN>
    <STMTTRN>
    <TRNTYPE>DEBIT
    <DTPOSTED>20190123120000
    <TRNAMT>-3.77
    <FITID>00092990121-3.77019012312828.01
    <NAME>STARBUCKS STORE 07604 01/21 PURC
    </STMTTRN>
    <STMTTRN>
    <TRNTYPE>DEBIT
    <DTPOSTED>20190123120000
    <TRNAMT>-9.62
    <FITID>00092990121-9.62019012312831.78
    <NAME>STARBUCKS STORE 07604 01/21 PURC
    </STMTTRN>
    <STMTTRN>
    <TRNTYPE>DEBIT
    <DTPOSTED>20190122120000
    <TRNAMT>-115.99
    <FITID>00090231800-115.99019012212841.40
    <NAME>VERIZON DES:PAYMENTREC ID:XXXXX3
    </STMTTRN>
    <STMTTRN>
    <TRNTYPE>DEBIT
    <DTPOSTED>20190122120000
    <TRNAMT>-10.34
    <FITID>00092990120-10.34019012212957.39
    <NAME>URBAN FARMHOUSE NO 2 01/19 PURCH
    </STMTTRN>
    <STMTTRN>
    <TRNTYPE>DEBIT
    <DTPOSTED>20190122120000
    <TRNAMT>-22.53
    <FITID>00092990120-22.53019012212967.73
    <NAME>URBAN FARMHOUSE NO 2 01/19 PURCH
    </STMTTRN>
    <STMTTRN>
    <TRNTYPE>DEBIT
    <DTPOSTED>20190122120000
    <TRNAMT>-108.71
    <FITID>00092990119-108.71019012212990.26
    <NAME>TMOBILE*AUTO PAY 01/19 PURCHASE 
    </STMTTRN>
    <STMTTRN>
    <TRNTYPE>DEBIT
    <DTPOSTED>20190122120000
    <TRNAMT>-25.00
    <FITID>00092990118-25.00019012213098.97
    <NAME>COUNTY WASTE 01/18 PURCHASE 804-
    </STMTTRN>
    <STMTTRN>
    <TRNTYPE>DEBIT
    <DTPOSTED>20190122120000
    <TRNAMT>-32.38
    <FITID>00092990118-32.38019012213123.97
    <NAME>REGENCY MART CITGO 01/18 PURCHAS
    </STMTTRN>
    <STMTTRN>
    <TRNTYPE>DEBIT
    <DTPOSTED>20190122120000
    <TRNAMT>-6.07
    <FITID>00092990117-6.07019012213156.35
    <NAME>CHICK-FIL-A #01342 01/17 PURCHAS
    </STMTTRN>
    </BANKTRANLIST>
    <LEDGERBAL>
    <BALAMT>12798.01
    <DTASOF>20190124212851
    </LEDGERBAL>
    </STMTRS>
    </STMTTRNRS>
    </BANKMSGSRSV1>
    </OFX>
    

    If you invoke sgmlproc (the command-line utility coming with the sgml npm package), it will (rightly) complain about the data not matching the schema

    "data.ofx": line 1250: fatal: element 'INTU.BID' not accepted here
    

    but if you remove the lines containing <INTU.BID> and <INTU.USERID> or adapt the content model declaration for the SONRS element accordingly to accept INTU.BID and INTU.USERID, then sgmlproc will be able to successfully parse data.ofx, and the equivalent setup of an SGML parsing pipeline via JavaScript sketched in your question should be as well.

    You'll have to manage removing the extra "header" lines, prepending the (compact) DTD, and removing/accepting those IMTU.... elements programmatically, though.