I'm new to Django and thought it would be interesting to use it to visualise a data I scraped from certain websites, sth like this video)
(https://www.youtube.com/watch?v=TcnWEQMT3_A&list=PL-2EBeDYMIbTt63EH9ubqbvQqJkBP3vzy&index=1)
However, I'm finding it difficult to choose a right way of how to load a data.
Right now, I have a Python program that scrapes stats of players from big five leagues, from fbref.com
(football related website) and store them to a CSV file.
I think I have two options now?
Creating a Model in Django and read a CSV file to store each row as a separate model. So basically, I'm storing a data at Django db, similar to
Not creating a separate Django model and using pandas dataframe. (Not storing my data to Django DB).
I feel like first approach is less efficient because I'm thinking of adding further data analysis later on, so I would end up using pandas dataframe mainly anyways. However I'm lowkey worried about if anything would go terribly wrong if I don't use the Django database.
Which approach is better and are there better ways to do this??
I initially tried second approach but I got worried of data managements,
Typically you don't need to do data processing on all the data. In fact, you often need a (very) narrow subset of the data. If you for example want to process goal stats for the last five seasons of the Spanish league. If you have to load all data into a dataframe, you often do something wrong. As you keep scraping data, the file will grow bigger, and eventually the servers will run out of memory, and processing time will take longer, because it requires more disk I/O.
Databases are optimized to retrieve subsets of data: by using indexes they seldomly need to go over all records, and can often, in 𝓞(log n) look what records are necessary, and then thus only perform disk I/O to retrieve only these items and this combined with some advanced caching mechanims.
To some extent, pandas does what a database does, except that it holds all the data in memory. That is fine if the data is reasonably small. As the data grows, not all data will fit in memory anymore. But even if it does, loading the file in memory takes linear time with the total number of rows, not per se the rows you are interested in.
So usually it is better to store data in the database, use that for some filtering and aggregating. If you need more advanced functionalities, you can convert a Django QuerySet
to a pandas DataFrame
through django-pandas
[pypi.org]. This then thus can do extra processing. But using the database first will typically reduce the amount of disk I/O and memory usage drastically.