juliajulia-dataframe

Row wise median for julia dataframes


I want to compute the median values of all rows in a dataframe. Some columns contain NaN values. Some rows even have all NaN values. The problem with median is

  1. if there's any NaN values in a vector it returns NaN. In this case I would like to skip NaNs (like in Pandas).
  2. it is undefined for empty vectors (throws an error). In this case I want to return a NaN (like in Pandas)

I came up with the following solution:

df = DataFrame(rand(100, 10), :auto) 
df[1, :x3] = NaN
df[20, [:x3, :x6]] .= NaN
df[5, :] .= NaN

safemedian(y) = all(isnan.(y)) ? NaN : median(filter(!isnan, y))
x = select(df, AsTable(:) => ByRow(safemedian∘collect) => "median")

This works however it's rather slow.

Question 1) Is there a way to speed this up?

I think the collect method is causing the sluggish performance. But I need to use the collect method otherwise I get an error:

safemedian(y) = all(isnan.(y)) ? NaN : median(filter(!isnan, y))
x = select(df, AsTable(:) => ByRow(safemedian) => "median")

# results in ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved

This is because AsTable(:) passes each row a named tuple.

Question 2) Is there a way to pass rows as vectors instead?

This way I could pass the row to any function that expects a vector (for example the nanmedian function from the NaNStatistics.jl Package). Note I would not need to use the collect method if the AsVector(:) method was implemented (see [here]). Unfortunately it didn't get the go ahead and I'm not sure what the alternative is.

Question 3) This one is more philisophical. Coming from Python/Pandas some operations in Julia are hard to figure out. Pandas for example handles NaNs seemlessly (for better or worse). In Julia I artificially replace the missing values in my dataframe using mapcols!(x -> coalesce.(x, NaN), df). This is because many package functions (and functions I've written) are implemented for AbstractArray{T} where {T<:Real} and not AbstractArray{Union{T, Missing}} where {T<:Real} (ie. they don't propagate missings). But since there is no skipnan yet there is a skipmissing function in Julia, I'm thinking I've got it all wrong. Is the idiomatic way to keep missing values in Julia and handle them where appropriate? Or is it ok to use NaN's (and keep the type fixed as say Float64)?


Solution

  • The best option is probably to redefine safemedian to work with NamedTuples (which iterate their values and not their keys).

    julia> safemedian(y) = all(isnan, y) ? NaN : median((x for x in y if !isnan(x)))
    safemedian (generic function with 1 method)
    
    julia> select(df, AsTable(:) => ByRow(safemedian) => "median")
    100×1 DataFrame
     Row │ median     
         │ Float64    
    ─────┼────────────
       1 │   0.326412
       2 │   0.61873
       3 │   0.672079
       4 │   0.405539
       5 │ NaN
       6 │   0.358769
       7 │   0.469862
       8 │   0.585866
      ⋮  │     ⋮
      94 │   0.512761
      95 │   0.43875
      96 │   0.463244
      97 │   0.380401
      98 │   0.45737
      99 │   0.456926
     100 │   0.195296
       85 rows omitted
    
    julia> @benchmark select(df, AsTable(:) => ByRow(safemedian) => "median")
    BenchmarkTools.Trial: 10000 samples with 1 evaluation.
     Range (min … max):  63.916 μs …  2.223 ms  ┊ GC (min … max): 0.00% … 95.97%
     Time  (median):     65.042 μs              ┊ GC (median):    0.00%
     Time  (mean ± σ):   68.599 μs ± 79.734 μs  ┊ GC (mean ± σ):  4.30% ±  3.59%
    
      ▁▅▇██▇▅▄▂▁ ▂▂▃▃▃▃▂▁                                         ▂
      █████████████████████▆▇███▇▇▇▆▅▅▅▆▄▄▅▅▅▁▄▅▄▅▅▅▄▅▅▅▅▁▅▄▃▄▅▃▄ █
      63.9 μs      Histogram: log(frequency) by time      79.1 μs <
    
     Memory estimate: 62.42 KiB, allocs estimate: 555.