pandasimdbimdbpy

Getting 10,000 Movie Plots with IMDbPY


I'm using IMDbPY in conjunction with the publicly available IMDb datasets (https://www.imdb.com/interfaces/) to create a custom dataset with pandas. The public datasets contain a lot of great info, but don't contain plot info as far as I can see. IMDbPY does contain plot summaries, in addition to plot synopses and keywords for plots in the form of the plot, synopsis, and keywords keys of the movie class/dictionary.

I can get the plot for individual keys by making an API call: ia.get_movie(movie_index[2:])['plot'][0] where I use [2:] because the first 2 characters of the index are 'tt' in the public dataset and [0] because there are many plot summaries so I am taking the first one from IMDbPY.

However, to get 10,000 plot summaries, I would need to make 10,000 API calls which would take me 7.5 hours, assuming each API call takes 2.7 seconds (which is what I found using tqdm). So a solution to this is to let it run overnight. Are there any other solutions? Also, is there a better way of doing this than my current way of creating a dictionary with the keys as movie index (e.g. tt0111161 for "Shawshank Redemption") and the values as plots and then converting that dictionary to a dataframe? Any insight is appreciated. My code is below:

movie_dict = {}
for movie_index in tqdm(movies_index[0:10]):
    #movie = ia.get_movie(movie_index[2:])
    try:
        movie_dict[movie_index] = ia.get_movie(movie_index[2:])['plot'][0]
    except:
        movie_dict[movie_index] = ''

plots = pd.DataFrame.from_dict(movie_dict, orient='index')
plots.rename(columns={0:'plot'}, inplace=True)
plots


             plot
tt0111161   Two imprisoned men bond over a number of years...
tt0468569   When the menace known as the Joker emerges fro...
tt1375666   A thief who steals corporate secrets through t...
tt0137523   An insomniac office worker and a devil-may-car...
tt0110912   The lives of two mob hitmen, a boxer, a gangst...
tt0109830   The presidencies of Kennedy and Johnson, the e...
tt0120737   A meek Hobbit from the Shire and eight compani...
tt0133093   A computer hacker learns from mysterious rebel...
tt0167260   Gandalf and Aragorn lead the World of Men agai...
tt0068646   The aging patriarch of an organized crime dyna...

Solution

  • First of all, consider that doing so many queries in so little time may be considered against their terms of service: https://www.imdb.com/conditions

    However, 10.000 queries to a major web site is not that much to create any real problem, especially if you wait few seconds between each call just for being nicer (it will take longer, but that should not be a big deal in your case - but again see above regarding the license, that you must respect).

    I can suggest two different options:

    1. use the old dataset, that is free to use for personal and non-commercial usage and IMDbPY is able to parse; the drawback is that the data is a little outdated (end of 2017): https://imdbpy.readthedocs.io/en/latest/usage/ptdf.html
    2. use an alternative source, like https://www.omdbapi.com/ or https://www.themoviedb.org/ which should have public APIs and more permissive licenses.

    Disclaimer: I'm one of the main authors of IMDbPY