I have a column called priority among other columns in a file and contains numbers For ex: 1, 2, 3, 4, 5, 6 etc. The file data is as follows
Department Strength Priority
--------------------------------
CS Good 10
CS Low 2
EC Good 10
EC Low 2
EC Nil 3
I want to select the priority 2 records using a query using SQL hive context as below
select * from
(
select testfile.*,row_number() over(partition by Department order by Priority asc) rn
from testfile
)ranked
where rn=1;
All the columns are defined as String in the spark code. I expect the code to select Priority 2 records because I have given the asc
in order by
clause. But is there a chance that the query selects priority other than 2 just because i didn't declare the priority column as int. Right now the query fetches the following data
CS Good 10
EC Good 10
However if I declare the Priority column as int
, the records will be shown correctly as below
CS Low 2
EC Low 2
I would like to understand this. There are several instances where I declared the number column as string however got the ordering correctly.
The order by
is using lexicographical order which is similar to alphabetical order but alphabetical order is actually a type of lexicographical order.
It means that the following strings would be ordered as follows
1 < 2
10 < 2
1000 < 2
1 < 11
10 < 11
21 < 3
In a simple way you could say it is ordering by each character position but you may read more from responses here and here
Furthermore, you may optionally cast your columns to integer to achieve numerical ordering for eg.
SELECT
*
FROM (
SELECT
testfile.*,
row_number() over(partition by Department order by cast(Priority as int) asc) rn
FROM
testfile
) ranked
where rn=1;