juliadataframes.jl

Julia: Return Minimum Date in DataFrame


The question is fairly simple. How do I return the minimum purchase date for each customer using Tidier?

using Tidier, DataFrames, Plots, CSV


#params
f = "path"

df = CSV.File(f) |> DataFrame
df = @chain df begin
    @select(SHOPIFY_ORDER_ID, CUSTOMER_ID, SHIPMONTH, GROSS_REVENUE, Country)
    @rename(order_id = SHOPIFY_ORDER_ID,
            customer_id = CUSTOMER_ID,
            date = SHIPMONTH,
            revenue = GROSS_REVENUE,
            country = Country)
    @filter(country != "CA")
    @filter(!ismissing(date))
    @filter(revenue != 0.0)
end 


# logic to calculate summary stats
df_sum = @chain df begin
    @group_by(customer_id)
    @mutate(
        cohort = min(date)
    )
end

min(df[!, :date])

for df_sum I receive the following error:

ERROR: ArgumentError: argument is not a permutation Stacktrace: [1] invperm(a::Vector{Int64}) @ Base .\combinatorics.jl:282 [2] groupby(df::DataFrame, cols::Cols{Tuple{Symbol}}; sort::Bool, skipmissing::Bool) @ DataFrames C:path\.julia\packages\DataFrames\LteEl\src\groupeddataframe\groupeddataframe.jl:264 [3] top-level scope @ path.jl:453

When attemtping to identify the min date in the data.frame I receive the error:

ERROR: MethodError: no method matching min(::Vector{Union{Missing, Dates.Date}})

Closest candidates are: min(::Any, ::Missing) @ Base missing.jl:134 min(::Any, ::Any) @ Base operators.jl:481
min(::Any, ::Any, ::Any, ::Any...) @ Base operators.jl:578 ...

Stacktrace: [1] top-level scope @ c:\path\script.jl:28

Which indicates to me that min doesn't work where there is a Missing data type, but I'm not sure how to solve from there.


Solution

  • Elaborating on Bogumił Kamiński’s answer, you could try the following code:

    df_sum = @chain df begin
        @group_by(customer_id)
        @mutate(
            minimum_date = minimum(skipmissing(date))
        )
    end
    

    The other thing to consider is whether you want to add a column to your existing dataset, or whether you simply want to return the minimum date only for each customer.

    Here are two alternative approaches:

    The first one will return only the customer_id and the minimum date for each customer.

    df_sum = @chain df begin
        @group_by(customer_id)
        @summarize(
            minimum_date = minimum(skipmissing(date))
        )
    end
    

    In case you want to return the whole row, here’s the second approach:

    df_sum = @chain df begin
        @group_by(customer_id)
        @filter(
            date == minimum(skipmissing(date))
        )
        @ungroup
    end
    

    Without having access to the original dataset, it’s hard to confirm if these will work for you. If these don’t work, please let us know!