I need to read an excel file in ".xlsx" format that is stored in a ftp server as a pandas DataFrame. I need to do this inside a docker container running in heroku and I don't think is a good practice downloading the file into the container and then read it from the local path. Instead I would like to save the bytes in a local variable and pass it to the pandas.read_excel() function to get my DataFrame. According to the documentation of pandas it is possible but I have tried all the forms of calling the function that I could understand and still error messages are raised. I have found answers to the .csv case containing text passed as bytes, but not to the excel file passed as bytes.
Here are some code corresponding to what I think is my best güess of the solution, but still doesn´t work. Maybe you can help me understand if there is a step I'm missing:
from ftplib import FTP
import pandas as pd
from io import BytesIO
# stablish ftp connection
ftp = FTP()
ftp.connect(host='myhost',port=myport)
ftp.login(user='myuser',passwd='mypassword')
# create a class with the __call__ method that can be used to store the bytes
class BinaryReader:
def __init__(self):
self.data = BytesIO()
def __call__(self,s):
self.data = BytesIO(s)
# Create an instance of that class and read the bytes into it
myBinary = BinaryReader()
folder = 'myFolder/'
file = 'myFile.xlsx'
ftp.retrbinary('RETR '+ folder + file, myBinary)
# Read my file usind pandas
pd.read_excel(myBinary.data, engine='openpyxl')
I have to use an engine because pandas cannot asume any extension of the file that I am passing as io. The code from above raises the following exception exactly in the last line:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_1022/801386693.py in <module>
----> 1 pd.ExcelFile(myBinary.data,engine='openpyxl')
~/anaconda3/envs/py310/lib/python3.10/site-packages/pandas/io/excel/_base.py in __init__(self, path_or_buffer, engine, storage_options)
1417 self.storage_options = storage_options
1418
-> 1419 self._reader = self._engines[engine](self._io, storage_options=storage_options)
1420
1421 def __fspath__(self):
~/anaconda3/envs/py310/lib/python3.10/site-packages/pandas/io/excel/_openpyxl.py in __init__(self, filepath_or_buffer, storage_options)
523 """
524 import_optional_dependency("openpyxl")
--> 525 super().__init__(filepath_or_buffer, storage_options=storage_options)
526
527 @property
~/anaconda3/envs/py310/lib/python3.10/site-packages/pandas/io/excel/_base.py in __init__(self, filepath_or_buffer, storage_options)
516 self.handles.handle.seek(0)
517 try:
--> 518 self.book = self.load_workbook(self.handles.handle)
519 except Exception:
520 self.close()
...
--> 744 self._file.seek(self._pos)
745 data = self._file.read(n)
746 self._pos = self._file.tell()
ValueError: negative seek value -9342652
I have tried passing the bytes and specifying the 'openpyxl' engine that is the one that can read '.xlsx' format, but the "negative seek value" error has been shown. I would like to pass pandas.read_excel function the bytes and that the functions to know the format of the file and return a DataFrame, maybe via some other argument that specifyes the file extension, or the engine. Another excel reader that works for reading the bytes can work for me as well.
Test server setup with pyftpdlib
to serve files in current working directory
from pyftpdlib.authorizers import DummyAuthorizer
from pyftpdlib.handlers import FTPHandler
from pyftpdlib.servers import FTPServer
import os
HOST = "localhost"
PORT = 21
authorizer = DummyAuthorizer()
authorizer.add_anonymous(os.getcwd())
handler = FTPHandler
handler.authorizer = authorizer
server = FTPServer((HOST, PORT), handler)
server.serve_forever()
Example code to read excel file called test.xlsx
from ftplib import FTP
from io import BytesIO
from pandas import read_excel
HOST = "localhost"
PORT = 21
ftp = FTP()
ftp.connect(host=HOST, port=PORT)
ftp.login()
f = BytesIO()
ftp.retrbinary("RETR " + "test.xlsx", f.write)
df = read_excel(f.getvalue())
print(df.info)
ftp.quit()