hiveapache-spark-sqlhiveqlsparkcore

Hive/Spark SQL query approach


Problem statement: I have source data on HDFS which has approximately 400 columns, of which around 200 columns (I will call them VAR1..200) will have data optionally. Some rows might have data for V1..10 and some might have V34..78 and so on. I need to read this data and split them into 2 files (or tables if you will). Table A will have the other 200 columns and Table B will have information about the VAR1..200 columns for the rows in Table A. For example, if a row in Table A has data for VAR1..20 columns, then Table B should have 20 rows, one row each for VAR1..20, along with the columns to link the tables. So, essentially I need to check the VAR1..200 columns of the source dataset and then if any of those VAR has values, I need to create an entry in Table B.

In the source data, these VAR1..200 columns appear in groups i.e. VAR1..50 are placed together in the file, then after some other columns, VAR51..100 are stored together etc.

I have to implement this using HiveQL or Spark Core or Spark SQL, preferably Spark solutions.

The approach I thought was to use Spark Core, read the VAR1..200 by index, check the value and then move it to a different data set.

I wanted to know how best to implement such a solution. Please share your thoughts. Thanks.


Solution

  • Demo

    create table mytable (id bigint,var1 string,var2 string,var3 string,var4 string,var5 string,var6 string,var7 string,var8 string,var9 string,var10 string);
    
    insert into mytable values 
        (123,null,null,null,'DD','EE','FF','GG',null,null,null)
       ,(456,null,null,null,null,null,null,null,null,'II','JJ')
    ;   
    

    select * from mytable;
    

    +------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+
    | mytable.id | mytable.var1 | mytable.var2 | mytable.var3 | mytable.var4 | mytable.var5 | mytable.var6 | mytable.var7 | mytable.var8 | mytable.var9 | mytable.var10 |
    +------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+
    |        123 | NULL         | NULL         | NULL         | DD           | EE           | FF           | GG           | NULL         | NULL         | NULL          |
    |        456 | NULL         | NULL         | NULL         | NULL         | NULL         | NULL         | NULL         | NULL         | II           | JJ            |
    +------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+--------------+---------------+
    

    select  id
           ,v.i + 1 as i
           ,v.val
    
    from    mytable
            lateral view posexplode (array(var1,var2,var3,var4,var5,var6,var7,var8,var9,var10)) v as i,val
    
    where   v.val is not null
    ;
           
    

    +-----+----+-------+
    | id  | i  | v.val |
    +-----+----+-------+
    | 123 |  4 | DD    |
    | 123 |  5 | EE    |
    | 123 |  6 | FF    |
    | 123 |  7 | GG    |
    | 456 |  9 | II    |
    | 456 | 10 | JJ    |
    +-----+----+-------+