javaapache-sparkhivecontext

Ordering of a string column that contains numbers in it using hive context


I have a column called priority among other columns in a file and contains numbers For ex: 1, 2, 3, 4, 5, 6 etc. The file data is as follows

Department  Strength   Priority
--------------------------------
CS          Good       10
CS          Low        2
EC          Good       10
EC          Low        2
EC          Nil        3

I want to select the priority 2 records using a query using SQL hive context as below

select * from 
(
select testfile.*,row_number() over(partition by Department order by Priority asc) rn
from testfile
)ranked
where rn=1;

All the columns are defined as String in the spark code. I expect the code to select Priority 2 records because I have given the asc in order by clause. But is there a chance that the query selects priority other than 2 just because i didn't declare the priority column as int. Right now the query fetches the following data

 CS          Good       10
 EC          Good       10

However if I declare the Priority column as int, the records will be shown correctly as below

 CS          Low        2
 EC          Low        2

I would like to understand this. There are several instances where I declared the number column as string however got the ordering correctly.


Solution

  • The order by is using lexicographical order which is similar to alphabetical order but alphabetical order is actually a type of lexicographical order.

    It means that the following strings would be ordered as follows

    1 < 2
    10 < 2
    1000 < 2
    1 < 11
    10 < 11 
    21 < 3  
    

    In a simple way you could say it is ordering by each character position but you may read more from responses here and here

    Furthermore, you may optionally cast your columns to integer to achieve numerical ordering for eg.

    SELECT 
        *
    FROM (
        SELECT 
            testfile.*,
            row_number() over(partition by Department order by cast(Priority as int) asc) rn
        FROM 
            testfile
    ) ranked
    where rn=1;