I have a single CSV file with following columns:
ActorName, Address, City, Movies
ActorName column has duplicate entries because there are multiple movies that actor acted in. Eg. Actor Clint Eastwood is listed 3 times because movies column has 3 movies he acted in.
Issue I am having is how to create a junction table or relation. If i create ActorID first then ActorName will still have duplicates. And if i create Movies table first and move Movies column, then delete the duplicates from Actor table then how will I associate movie with the actor?
An example. Loading the data into temporary table then copying into working tables.
CREATE PROCEDURE process_data ()
BEGIN
-- load data from CSV file into temptable
LOAD DATA INFILE 'x:/folder/new_data.csv' INTO TABLE raw_data;
-- insert actor names into actor table if absent
INSERT IGNORE INTO actor (actor_name) SELECT actor_name FROM raw_data;
-- insert movie names into movie table if absent
INSERT IGNORE INTO movie (movie_name) SELECT movie_name FROM raw_data;
-- insert junction data into actor_in_movie if absent
INSERT IGNORE INTO actor_in_movie (actor_id, movie_id)
SELECT actor_id, movie_id
FROM raw_data
JOIN actor USING (actor_name)
JOIN movie USING (movie_name);
-- clear imported data
TRUNCATE raw_data;
END