pythonbashcsvjoindataset

How to join two large CSV files?


I have two large .csv files that I would like to join.

file1.csv has the following structure:

productcode; *many useless columns* ; startdate; enddate; *some other useless columns*

file2.csv has the following structure:

productcode; *many useless columns different from file1* ; page; startdate; enddate; *some othe useless columns*

I would like to join the two files into a file (let's say, out.csv) with the same structure as file1.csv but with the "page" column from file2.csv, i.e.

productcode; *useless columns* ; page; startdate; enddate; *useless columns*

The join conditions are same productcode and overlapping dates, i.e.:

file1.productcode == file2.productcode

and

!(file1.endate<file2.startdate or file2.enddate<file1.startdate)

However, I have no idea on how to do that. One possibility could be to export the two CSVs into MySql, process them and then export the result in a final CSV file. However, that takes time (and is resource consuming).

I'm open to any suggestions.


Solution

  • Load them with pandas and use the function .join() to join both with the column reference you need