I work in HDP 2.6.5 platformon using Hive (1.2.1000.2.6.5.0-292) on a simple database based on data from:
https://grouplens.org/datasets/movielens/100k/
I have 4 tables named: genre, movies, ratings, users as below:
CREATE TABLE genre(genre string, genre_id int);
CREATE TABLE movies (movie_id INT, title STRING, rel_date DATE, video_rel_date STRING,
imdb_url STRING, unknown INT, action INT, adventure INT, animation INT, childrens INT,
comedy INT, crime INT, documentary INT, drama INT, fantasy INT, noir INT, horror INT,
musical INT, mystery INT, romance INT, sci_fi INT, thriller INT, war INT, western INT)
CLUSTERED BY (movie_id) INTO 12 BUCKETS STORED AS ORC;
CREATE TABLE ratings(user_id int, movie_id int, rating int, rating_time int);
CREATE TABLE users(user_id int, age int, gender char(1), occupation string, zip int);
I would like to write a query returning which genre of movies was watched most often by women and which by men? But the problem for me is the structure of the movies table where the movie genre is located:
1|Toy Story (1995)|1995-01-01||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
The last 19 fields are the genres, a '1' indicates the movie is of that genre, a '0' indicates it is not. Additionally movies can be in several genres at once. The gender is represented in 'users' table as 'M' or 'F' char. The required tables can be easily joined, but how to return and group the genres which are the columns names?
SELECT m.title, r.rating, u.gender
FROM movies m INNER JOIN ratings r ON (m.movie_id = r.movie_id)
INNER JOIN users u ON (u.user_id = r.user_id);
Make an array of genre columns placed in the order corresponding to genre_id, explode array and join by position in array with genre table. Like this(not tested):
select s.title, s.genre, s.gender, s.rating, s.cnt
from
(select s.title, s.gender, s.rating, s.cnt, s.genre,
rank() over (partition by s.gender order by s.cnt desc) as rnk
from
(
select m.title, u.gender, r.rating, g.genre, count(*) over(partition by u.gender) cnt
from
(select m.movie_id, m.title, e.id+1 as genre_id
from movies m
lateral view
posexplode (array(--place columns in a positions corresponding their genre_id
unknown, action, adventure, animation, childrens,
comedy, crime, documentary, drama, fantasy,
noir, horror, musical, mystery, romance,
sci_fi, thriller, war, western
)
)e as id, val
where e.val=1
) m
INNER JOIN ratings r ON (m.movie_id = r.movie_id)
INNER JOIN users u ON (u.user_id = r.user_id)
INNER JOIN genre g ON (g.genre_id = m.genre_id)
) s
) s where rnk = 1