phpxmldomdocumentsgmlofx

How to parse a OFX (Version 1.0.2) file in PHP?


I have a OFX file downloaded from Citibank, this file has a DTD defined at http://www.ofx.net/DownloadPage/Files/ofx102spec.zip (file OFXBANK.DTD), the OFX file appear to be SGML valid. I'm trying with DomDocument of PHP 5.4.13, but I get several warning and file is not parsed. My Code is:

$file = "source/ACCT_013.OFX";
$dtd = "source/ofx102spec/OFXBANK.DTD";
$doc = new DomDocument();
$doc->loadHTMLFile($file);
$doc->schemaValidate($dtd);
$dom->validateOnParse = true;

The OFX file start as:

OFXHEADER:100
DATA:OFXSGML
VERSION:102
SECURITY:NONE
ENCODING:USASCII
CHARSET:1252
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:NONE

<OFX>
<SIGNONMSGSRSV1>
<SONRS>
<STATUS>
<CODE>0
<SEVERITY>INFO
</STATUS>
<DTSERVER>20130331073401
<LANGUAGE>SPA
</SONRS>
</SIGNONMSGSRSV1>
<BANKMSGSRSV1>
<STMTTRNRS>
<TRNUID>0
<STATUS>
<CODE>0
<SEVERITY>INFO
</STATUS>
<STMTRS>
<CURDEF>COP
<BANKACCTFROM> ...

I'm open to install and use any program in Server (Centos) for call from PHP.

PD: This class http://www.phpclasses.org/package/5778-PHP-Parse-and-extract-financial-records-from-OFX-files.html don't work for me.


Solution

  • Well first of all even XML is a subset of SGML a valid SGML file must not be a well-formed XML file. XML is more strict and does not use all features that SGML offers.

    As DOMDocument is XML (and not SGML) based, this is not really compatible.

    Next to that problem, please see 2.2 Open Financial Exchange Headers in Ofexfin1.doc it explains you that

    The contents of an Open Financial Exchange file consist of a simple set of headers followed by contents defined by that header

    and further on:

    A blank line follows the last header. Then (for type OFXSGML), the SGML-readable data begins with the <OFX> tag.

    So locate the first blank line and strip everyhing until there. Then load the SGML part into DOMDocument by converting the SGML into XML first:

    $source = fopen('file.ofx', 'r');
    if (!$source) {
        throw new Exception('Unable to open OFX file.');
    }
    
    // skip headers of OFX file
    $headers = array();
    $charsets = array(
        1252 => 'WINDOWS-1251',
    );
    while(!feof($source)) {
        $line = trim(fgets($source));
        if ($line === '') {
            break;
        }
        list($header, $value) = explode(':', $line, 2);
        $headers[$header] = $value;
    }
    
    $buffer = '';
    
    // dead-cheap SGML to XML conversion
    // see as well http://www.hanselman.com/blog/PostprocessingAutoClosedSGMLTagsWithTheSGMLReader.aspx
    while(!feof($source)) {
    
        $line = trim(fgets($source));
        if ($line === '') continue;
    
        $line = iconv($charsets[$headers['CHARSET']], 'UTF-8', $line);
        if (substr($line, -1, 1) !== '>') {
            list($tag) = explode('>', $line, 2);
            $line .= '</' . substr($tag, 1) . '>';
        }
        $buffer .= $line ."\n";
    }
    
    // use DOMDocument with non-standard recover mode
    $doc = new DOMDocument();
    $doc->recover = true;
    $doc->preserveWhiteSpace = false;
    $doc->formatOutput = true;
    $save = libxml_use_internal_errors(true);
    $doc->loadXML($buffer);
    libxml_use_internal_errors($save);
    
    echo $doc->saveXML();
    

    This code-example then outputs the following (re-formatted) XML which also shows that DOMDocument loaded the data properly:

    <?xml version="1.0"?>
    <OFX>
      <SIGNONMSGSRSV1>
        <SONRS>
          <STATUS>
            <CODE>0</CODE>
            <SEVERITY>INFO</SEVERITY>
          </STATUS>
          <DTSERVER>20130331073401</DTSERVER>
          <LANGUAGE>SPA</LANGUAGE>
        </SONRS>
      </SIGNONMSGSRSV1>
      <BANKMSGSRSV1>
        <STMTTRNRS>
          <TRNUID>0</TRNUID>
          <STATUS>
            <CODE>0</CODE>
            <SEVERITY>INFO</SEVERITY>
          </STATUS>
          <STMTRS><CURDEF>COP</CURDEF><BANKACCTFROM> ...</BANKACCTFROM>
    </STMTRS>
        </STMTTRNRS>
      </BANKMSGSRSV1>
    </OFX>
    

    I do not know whether or not this can be validated against the DTD then. Maybe this works. Additionally if the SGML is not written with the values that are of a tag on the same line (and only a single element on each line is required), then this fragile conversion will break.