I am using the vue3 and installed papaparse to parse a big csv file (390MB). I wrote a flask api (flask 10.0) where I serve this big file. The frontend should read the file line by line and send only the first two columns of the recieved line back to the backend.
This works even though it takes a long time. The only problem is that I get thousands of warnings stating that the header is being renamed to avoid header duplication and I dont understand why and couldn't find anything on the web. How can I avoid this warninig?
Also, this is a uni project, so I deliberately didn't account for security issues, etc. the focus is on storing key-value pairs via the api.
My Vue Component:
<template>
<h3>Input Data Reading</h3>
<div class="buttonWrapper">
<button class="storeData" id="storeInputData" @click="readInputFile">
Read and Store Data
</button>
</div>
<div v-if="fetchedResponse" class="response">
{{ fetchedResponse }}
</div>
</template>
<script>
import Papa from 'papaparse'
import { mapActions } from 'vuex';
export default {
name: 'ReadInputData',
props: {
msg: String
},
data() {
return {
fetchedResponse: ''
}
},
methods: {
...mapActions(['updateKeys']),
readInputFile() {
Papa.parse('http://127.0.0.1.nip.io/storage/api/download/inputfile', {
header: true,
download: true,
worker: true,
step: (row) => {
this.storeData(row.data);
},
complete: () => {
console.log('Successfully stored file.');
},
error: (e) => {
console.error(`Error parsing the file: ${e}`)
}
});
},
storeData(data) {
var key = data['id'];
var value = data['title'];
this.updateKeys(key)
fetch('http://127.0.0.1.nip.io/storage/api/insert/', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
key: key,
value: value
})
})
.then(response => response.json())
.then(message => this.fetchedResponse = message)
.catch(e => console.error(e));
}
}
}
</script>
<style scoped>
</style>
My Flask api:
import asyncio
from flask import Flask, Response, jsonify, logging as flaskLogging, request
from flask_cors import CORS
import logging
from storage_handler import StorageHandler
logging.basicConfig(level=logging.DEBUG, format=f'%(asctime)s %(levelname)s %(name)s %(threadName)s : %(message)s')
if __name__ == '__main__':
app = Flask(__name__)
CORS(app, resources={r"/*": {"origins": "*"}}, expose_headers=['Content-Range'])
event_loop = asyncio.get_event_loop()
logger = flaskLogging.create_logger(app)
storage_handler = StorageHandler(logger, event_loop)
@app.route('/download/inputfile', methods=['GET'])
def serve_input_file():
while event_loop.is_running():
asyncio.sleep(0.01)
def readCSVfile():
with open('./input_file/Imdb_Movie_Dataset.csv', 'r') as f:
for line in f:
yield line
response = Response(readCSVfile(), content_type='text/csv')
return response
@app.route('/insert/', methods=['POST'])
def insert_pair():
while event_loop.is_running():
asyncio.sleep(0.01)
pair = request.get_json()
key = pair.get('key')
value = pair.get('value')
request_str = f'store {key if key else "EMPTY_KEY_STR_PASSED"} {value if value else "EMPTY_VALUE_PASSED"}'
response = event_loop.run_until_complete(storage_handler.request_to_bucket(request_str))
return jsonify(response.decode())
app.run(host='0.0.0.0', port=5000)
How does you CSV file look like? Sounds like that more than one header column sharing the same name. Header is (Row 0 in your CSV) if you define header:true in your papaparse config.
Header column names need to be unique when headers: true is set in papaparse. Else they get renamed automatically cause otherwise the JSON object could not be build. It will result in something like that:
So if your CSV looks like this:
"Header Col 1", "Header Col 2", "Header Col 1"
"Value 1", "Value 2", "Value 3"
You would theoretically get a json object like the one below with header: true;
{
"Header Col 1": "Value 1",
"Header Col 2": "Value 2",
"Header Col 1": "Value 3",
},
This would cause trouble cause overwritten "Header Col 1" property.
To avoid this papaparse is renaming your header col names automatically in case of duplicates and prints you an warning.
So basically the header column name will be used to build the JSON object keys. Thus it can not have duplicated names!. Must be an unique name for all header columns.
One simple solution could be to set header: false if you don't need it.
This will result in an JSON numeric Array and not an object and thus no conflicts:
[
0: "Value 1",
1: "Value 2",
2: "Value 3",
],
If you need to have "headers: true" make sure the Column names are unique or ignore the warning and process the automatically renamed header names in your receiving API.
Remember results structure is different with "header:true" or "header:false"
Hope that helps.