[SOLVED] PySpark failing to decode 'cp1047' when processing mainframe input

PySpark failing to decode 'cp1047' when processing mainframe input

In one of my requirements is to decode a byte-array based on a cp1047 code page. A sample of my code is:

ebcdic_str = input_bytes.decode('cp1047')

The above code using python works correctly but while executing the same operation as part of the PySpark code (by creating an udf wrapping the above code) i am getting the following error:

    ebcdic_str = input_bytes.decode('cp1047')
LookupError: unknown encoding: cp1047

I have successfully done the same operation in PySpark using the code page cp037, but faced other issues using that code page. Per a suggestion from IBM the code page has been changed to cp1047. Unfortunately the code itself is failing.

Can anybody suggest a reason for the failure?

Solution

The issue was happening because we are not using a PySpark package ebcdic in our code. Once that package is imported the issue was resolved.

A side note, since ebcdic package is not a widely used package it might not be pre-distributed on all your worker/edge nodes. So you might want to validate the package is available; otherwise you might receive an 'ebcdic module not found' error.