apache-tikatika-python

Date Format Tika output from XLSX


i have a XLSX file with this content excel simple

I have downloaded tika-app for testing:

java -jar tika-app-2.9.2.jar --metadata test.xlsx

Content-Length: 9217
Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
X-TIKA:Parsed-By: org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By: org.apache.tika.parser.microsoft.ooxml.OOXMLParser
X-TIKA:origResourceName: C:\Users\users\Documents\
dc:creator: daniele grillo
dc:publisher:
dcterms:created: 2024-04-17T07:44:01Z
dcterms:modified: 2024-04-17T13:58:35Z
extended-properties:AppVersion: 16.0300
extended-properties:Application: Microsoft Excel
extended-properties:Company:
extended-properties:DocSecurityString: None
meta:last-author: daniele grillo
protected: false
resourceName: test.xlsx

So i run the command

java -jar tika-app-2.9.2.jar --text test.xlsx

and this is the output

Foglio1
        date    name
        2/9/72  one
        2/10/98 two
        1/3/09  three
        1/1/00  four
        4/11/00 five

I have read know that is possibile to pass a tika-config.xml for manipulate the parser whith this:

java -jar /tika-app-2.9.2.jar --text test.xlsx --config=tika-config.xml

Becase for the date I would the output like: dd/mm/yyyy like in .XLSX format

Is possible? If yes how?

I tried to use this tika-config.xml but the output is the same:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
            <mime>application/vnd.openxmlformats-officedocument.spreadsheetml.sheet</mime>
            <parser-exclude class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
        </parser>
    </parsers>
    <dateFormats>
        <dateFormat>dd/MM/yyyy</dateFormat>
    </dateFormats>
</properties>

Solution

  • OOXMLParser has the setDateFormatOverride(String) method inherited from an AbstractOfficeParser.

    This parameter can be set within the <params> of a parser.

    <?xml version="1.0" encoding="UTF-8"?>
    <properties>
        <parsers>
            <parser class="org.apache.tika.parser.DefaultParser"/>
            <parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
                <params>
                    <param name="dateFormatOverride" type="string">dd/mm/yyyy</param>
                </params>
            </parser>
        </parsers>
    </properties>
    

    Note: --config option should be specified before the --text option:

    java -jar tika-app-2.9.2.jar --config=tika-config.xml --text test.xlsx