pythondjangocsvchunked

Iterate over and validate large uploaded CSV files in Django


I'm using the Django module django-chunked-upload to receive potentially large CSV files. I can assume the CSVs are properly formatted, but I can't assume what the delimiter is.

Upon completion of the upload, an UploadedFile object is returned. I need to validate that the correct columns are included in the uploaded CSV and that the data types in each column are correct.

loading the file with csv.reader() doesn't work:

reader = csv.reader(uploaded_file)
next(reader)
>>> _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

This might be because uploaded_file.content_type and uploaded_file.charset are both coming through as None.

I've come up with a fairly inelegant solution to grab the header and iterate over the rows:

i = 0
header = ""
for line in uploaded_file:
    if i == 0:
        header = line.decode('utf-8')
        header_list = list(csv.reader(StringIO(header)))
        print(header_list[0])
        #validate column names
    else:
        tiny_csv = StringIO(header + line.decode('utf-8'))
        reader = csv.DictReader(tiny_csv)
        print(next(reader))
        #validate column types

I also considered trying to load the path of the actual saved file:

path = #figure out the path of the temp file
f = open(path,"r")
reader = csv.reader(f)

But I wasn't able to get the temp file path from the UploadedFile object.

Ideally I would like to create a normal reader or DictReader out of the UploadedFile object, but it seems to be eluding me. Anyone have any ideas? - Thanks


Solution

  • The answer lies in chunked_upload/models.py which has the line:

    def get_uploaded_file(self):
        self.file.close()
        self.file.open(mode='rb')  # mode = read+binary
        return UploadedFile(file=self.file, name=self.filename,
                            size=self.offset)
    

    So when you create your file model you can choose to open the file with mode='r' instead:

    #myapp/models.py
    
    from django.db import models
    from chunked_upload.models import ChunkedUpload
    from django.core.files.uploadedfile import UploadedFile
    class FileUpload(ChunkedUpload):
        def get_uploaded_file(self):
            self.file.close()
            self.file.open(mode='r')  # mode = read+binary
            return UploadedFile(file=self.file, name=self.filename,
                                size=self.offset)
    

    This allows you to take the returned UploadedFile instance and parse it as a csv:

    def on_completion(self, uploaded_file, request):
        reader = csv.reader(uploaded_file)
        ...