python azure-data-lake azure-synapse-analytics

How can I read pdf or pptx or docx files in python from ADLS gen2 using Synapse?

I am looking to read in files of different formats with python in a Synapse notebook. These include .pdf, .pptx, .docx, .msg, and .eml. I would like to be able to read in the files then parse and manipulate them with python. I was able to do this in data bricks using different python libraries.

This is how I had accomplished this in Data Bricks:

from pptx import Presentation
prs = Presentation(file_name)

# for pdf
from pypdf import PdfReader
reader = PdfReader(open(filename, 'rb'))

# word docs
import docx
doc = docx.Document(file_name)

# .eml files
import email
msg = email.message_from_file(open(file_name))type here

# .msg files
import extract_msg
msg = extract_msg.Message(file_name)

In Synapse I have been getting an error: FileNotFoundError: [Errno 2] No such file or directory.

These file paths work to read in csv, excel or txt data using spark or pandas so I don't think there is a authorization or connectivity issue. The format is: abfs[s]://file_system_name@account_name.dfs.core.windows.net/file_path

I also tried mounting the storage location. This did help to read in text files but not for the other formats. Mounting Storage locations in Synapse

Solution

Mounting was the right approach as this answer explains. I was using Synapse studio . The key was to use the file format obtained from the path command of the mounted storage. Otherwise I could basically use what I used previously as mentioned in my question. Only pdf I had to change from using the pypdf library to pypdf2.

the format that worked was:

path = mssparkutils.fs.getMountPath("/mounted_name") 
# this gave me this format '/synfs/{jobId}/mounted_path/{filename}'

What did not work was the format obtained from mssparkutils fs

mssparkutils.fs.ls("synfs:/{jobId}/mounted_path/") 
# this gave a different format which did not work   'synfs:/{jobId}/mounted_path/{filename}'

Here is the whole process:

First install the library you will need. Mounting the storage is described here. Then read the file using the PyPDF2 library.

!pip install PyPDF2  
    
    
# Then mount the storage location 
    
from notebookutils import mssparkutils
mssparkutils.fs.mount( "abfss://mycontainer@<accountname>.dfs.core.windows.net", "/test", {"LinkedService":"mygen2account"} )
    
# get mounted path
path = mssparkutils.fs.getMountPath("/test")
file_name  = path + '/filename'
    
# now read the file 
from PyPDF2 import PdfReader
    
reader = PdfReader(open(file_name, 'rb'))