[SOLVED] Process .DAT files which contains PDF

Process .DAT files which contains PDF

I will be receiving a .dat file which contains multiple pdf files encoded as base64 string which will be separated by a new line or some character.

Initial approach is read -> payload splitBy “\n” - foreach - decode base64 - save as .pdf

It’s working fine if the dat file size is small. However, it’s started throwing heap memory error my hunch is due to splitBy loads the entire content as string in memory.

How this can be fixed? Any better way to solve this problem?

<flow name="dat-to-pdfFlow" doc:id="7f23d7a6-7187-454b-bd60-8e0319b52028" >
        <file:listener doc:name="Read .DAT" doc:id="aba64085-5b24-48b6-a6d6-f10658c991f1" config-ref="File_Config" directory="/Users/test/Work/POC/input" autoDelete="true" recursive="false" outputMimeType="application/octet-stream; streaming=true">
            <scheduling-strategy >
                <fixed-frequency />
            </scheduling-strategy>
        </file:listener>
        <logger level="INFO" doc:name="Logger" doc:id="ebd52647-c467-459f-bdeb-a30a997aba76" message="Read .DAT from #[attributes.path]"/>
        <ee:transform doc:name="Transform Message" doc:id="dbc765a9-ba1f-49be-b996-a7a883c8a6c5">
            <ee:message>
                <ee:set-payload><![CDATA[%dw 2.0
output application/java
---
payload splitBy "\n"]]></ee:set-payload>
            </ee:message>
        </ee:transform>
        <parallel-foreach doc:name="Parallel For Each" doc:id="7ba70e6b-630a-4259-9ad6-fb4b5c197402">
            <vm:publish doc:name="Publish" doc:id="7d1c9b34-6150-4195-b142-45ef43a9e2db" config-ref="VM_Config" queueName="write" />
        </parallel-foreach>
        <logger level="INFO" doc:name="Logger" doc:id="72bc4cfc-383d-4c95-add9-409bd4fdfeeb" message="Completed" />
    </flow>
    <flow name="consume-pdf" doc:id="9e531c3b-b4b6-4348-9848-bef76df138bb" >
        <vm:listener doc:name="Listener" doc:id="5907406d-d4cf-4d5b-a82a-8a3829fcc425" config-ref="VM_Config" queueName="write" outputMimeType="text/plain"/>
        <ee:transform doc:name="Transform Message" doc:id="aacf98b0-b212-46bf-b615-abadfddc87f7">
            <ee:message>
                <ee:set-payload><![CDATA[%dw 2.0
import * from dw::core::Binaries
output multipart/form-data
---
{
    parts:{
        base64Content:{
            headers:{
                "Content-Type":"application/pdf"
            },
            content: fromBase64(payload)
            },
        }
}
]]></ee:set-payload>
            </ee:message>
        </ee:transform>
        <set-payload value="#[payload]" doc:name="Set Payload" doc:id="215630ca-ebd1-4eb7-9325-536f034eaff3" mimeType="application/pdf" />
        <file:write doc:name="Write" doc:id="87623dd2-6051-4918-ab36-f76bf1c9544e" config-ref="File_Config" path="#['/Users/test/Work/POC/output/' ++ uuid() ++ '.pdf']" mode="APPEND" />
    </flow>

java.lang.OutOfMemoryError: Java heap space
Dumping heap to /Applications/AnypointStudio.app/Contents/Eclipse/plugins/org.mule.tooling.server.4.9.ee_7.21.0.202502030106/mule/logs/dump_mule-393ef4bd-6139-49d5-bc8a-3401c8045277.hprof ...
JVM received a signal SIGKILL (9).
Heap dump file created [419386973 bytes in 0.255 secs]
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError=""/Applications/AnypointStudio.app/Contents/Eclipse/plugins/org.mule.tooling.server.4.9.ee_7.21.0.202502030106/mule/bin/kill.sh" %p"
#   Executing ""/Applications/AnypointStudio.app/Contents/Eclipse/plugins/org.mule.tooling.server.4.9.ee_7.21.0.202502030106/mule/bin/kill.sh" 66579"...
JVM process is gone.
JVM process exited with a code of 1, setting the Wrapper exit code to 1.
JVM exited unexpectedly.
Automatic JVM Restarts disabled.  Shutting down.
<-- Wrapper Stopped

Solution

Payload splitBy "\n" loads all the content in memory and throws heap memory issue.

It's solved by passing the stream to Java class which process the stream adn writes it to /tmp dir without blowing up the heap.

Inspiration took from Mule File repeatable streaming strategy.