revent-logprocess-miningbupar

R bupar: Get trace for each case


I use the bupar package to do process analysis. Suppose my data stored in a csv file looks like this (the file is already sorted properly by caseid and timestamp):

STATUS;timestamp;CASEID
created;16-02-2023 09:46:32;1
revised;13-04-2023 23:58:59;1
accepted;13-04-2023 23:59:59;1
created;16-02-2023 09:46:32;2
accepted;13-04-2023 23:59:59;2
created;14-12-2022 13:17:54;3
revised;02-01-2023 23:59:59;3
accepted;28-02-2023 19:37:01;3
submitted;03-03-2023 23:59:59;3
created;02-01-2023 07:45:43;5
created;24-01-2022 16:05:58;6
accepted;03-02-2022 23:59:59;6
created;24-01-2022 15:52:53;7
accepted;03-02-2022 23:59:59;7
created;15-08-2022 12:54:23;8
rejected;18-08-2022 23:59:59;8
created;21-03-2022 15:32:05;9
accepted;26-04-2022 23:59:59;9
created;21-03-2022 15:42:39;10

The first case with id 1 has the trace "created-revised-accepted". So first comes the event created, then revised and then accepted.

I now use the following code to create a process map:

library(bupaR)
library(processmapR)
library(edeaR)

datafile <- read.csv(file="pathtofile\\testfile.csv",header=T, sep=";")
datafile$timestampcolumn <- as.POSIXct(datafile$timestamp, format="%d-%m-%Y %H:%M:%S")

mytest <- simple_eventlog(datafile, case_id = "CASEID", activity_id = "STATUS", timestamp = "timestampcolumn")

process_map(mytest, type = frequency("absolute"))

This gives:

outputmap

Now I would like to add the trace for each case into my original file. The trace of course is always the same for a case. So the output should be like this (each event in the trace separated by example "-"):

STATUS;timestamp;CASEID;trace
created;16-02-2023 09:46:32;1;created-revised-accepted
revised;13-04-2023 23:58:59;1;created-revised-accepted
accepted;13-04-2023 23:59:59;1;created-revised-accepted
created;16-02-2023 09:46:32;2;created-accepted
accepted;13-04-2023 23:59:59;2;created-accepted
created;14-12-2022 13:17:54;3;created-revised-accepted-submitted
revised;02-01-2023 23:59:59;3;created-revised-accepted-submitted
accepted;28-02-2023 19:37:01;3;created-revised-accepted-submitted
submitted;03-03-2023 23:59:59;3;created-revised-accepted-submitted
created;02-01-2023 07:45:43;5;created
created;24-01-2022 16:05:58;6;created-accepted
accepted;03-02-2022 23:59:59;6;created-accepted
created;24-01-2022 15:52:53;7;created-accepted
accepted;03-02-2022 23:59:59;7;created-accepted
created;15-08-2022 12:54:23;8;created-rejected
rejected;18-08-2022 23:59:59;8;created-rejected
created;21-03-2022 15:32:05;9;created-accepted
accepted;26-04-2022 23:59:59;9;created-accepted
created;21-03-2022 15:42:39;10;created

I tried to play around with filter_activity, trace_list (from edeaR package) and other commands, but I was not able to figure it out. I want to use the results from the process_map algorithm / bupar package code. So that it corresponds to the output in the graph. So I do not want to manually implement an algorithm by myself to calculate the traces. So of course I could implement an algorithm to go through each case and write down the statuses and so. But this is already somehow in the bupar eventlog / process_map command and I would like to use it. I want to dig into the details to see which case had a specific trace according to the graph. That's why it is important to get it consistent with the bupar output and not program it with an algorithm separately. This information must be already somehow included, otherwise the graph would not exist.

So how can I achieve this?


Solution

  • I have never worked with any of these packages, but solved the problem like this:

    1. I looked at the class of mytest:
    class(mytest)
    # [1] "eventlog"   "log"        "tbl_df"     "tbl"        "data.frame"
    
    1. I looked into the methods which are defined for class eventlog:
    methods(class = "eventlog")
    # [1] act_collapse                     activities                       activity_frequency              
    # [4] activity_instance_id             activity_presence                add_end_activity                
    # [7] add_start_activity               arrange                          calculate_queuing_times         
    # [10] case_id                          case_list                        cases                           
    # [13] detect_resource_inconsistencies  dotted_chart                     durations                       
    # [16] end_activities                   events_to_activitylog            filter                          
    # [19] filter_activity_instance         filter_attributes                filter_endpoints_condition      
    # [22] filter_infrequent_flows          filter_lifecycle                 filter_lifecycle_presence       
    # [25] filter_precedence_resource       filter_time_period               filter_trim                     
    # [28] filter_trim_lifecycle            first_n                          fix_resource_inconsistencies    
    # [31] group_by                         group_by_activity                group_by_activity_instance      
    # [34] group_by_case                    group_by_resource                group_by_resource_activity      
    # [37] idle_time                        last_n                           lifecycle_id                    
    # [40] lifecycle_labels                 lifecycles                       lined_chart                     
    # [43] mapping                          mutate                           n_activity_instances            
    # [46] n_events                         number_of_repetitions            number_of_selfloops             
    # [49] process_map                      process_matrix                   processing_time                 
    # [52] redo_repetitions_referral_matrix redo_selfloops_referral_matrix   resource_frequency              
    # [55] resource_id                      resource_map                     resource_matrix                 
    # [58] resources                        sample_n                         select                          
    # [61] set_activity_instance_id         set_timestamp                    setdiff                         
    # [64] size_of_repetitions              size_of_selfloops                slice_activities                
    # [67] slice_events                     standardize_lifecycle            start_activities                
    # [70] summarise                        summary                          throughput_time                 
    # [73] timestamp                        timestamps                       to_activitylog                  
    # [76] trace_explorer                   trace_length                     trace_list                      
    # [79] ungroup_eventlog                 unite
    
    1. I played with several functions until I found the one which solves you problem: case_list

    Setup

    library(bupaR)
    library(processmapR)
    library(edeaR)
    library(dplyr)
    
    d <- readr::read_delim(
    "STATUS;timestamp;CASEID
    created;16-02-2023 09:46:32;1
    revised;13-04-2023 23:58:59;1
    accepted;13-04-2023 23:59:59;1
    created;16-02-2023 09:46:32;2
    accepted;13-04-2023 23:59:59;2
    created;14-12-2022 13:17:54;3
    revised;02-01-2023 23:59:59;3
    accepted;28-02-2023 19:37:01;3
    submitted;03-03-2023 23:59:59;3
    created;02-01-2023 07:45:43;5
    created;24-01-2022 16:05:58;6
    accepted;03-02-2022 23:59:59;6
    created;24-01-2022 15:52:53;7
    accepted;03-02-2022 23:59:59;7
    created;15-08-2022 12:54:23;8
    rejected;18-08-2022 23:59:59;8
    created;21-03-2022 15:32:05;9
    accepted;26-04-2022 23:59:59;9
    created;21-03-2022 15:42:39;10", delim = ";")
    
    d$timestampcolumn <- as.POSIXct(d$timestamp, format="%d-%m-%Y %H:%M:%S")
    mytest <- simple_eventlog(d, 
                              case_id = "CASEID", 
                              activity_id = "STATUS", 
                              timestamp = "timestampcolumn")
    process_map(mytest, type = frequency("absolute"))
    

    Solution

    d %>% 
      inner_join(case_list(mytest) %>% 
                   select(CASEID, trace),
                 "CASEID")
    # # A tibble: 19 × 5
    #    STATUS    timestamp           CASEID timestampcolumn     trace                             
    #    <chr>     <chr>                <dbl> <dttm>              <chr>                             
    #  1 created   16-02-2023 09:46:32      1 2023-02-16 09:46:32 created,revised,accepted          
    #  2 revised   13-04-2023 23:58:59      1 2023-04-13 23:58:59 created,revised,accepted          
    #  3 accepted  13-04-2023 23:59:59      1 2023-04-13 23:59:59 created,revised,accepted          
    #  4 created   16-02-2023 09:46:32      2 2023-02-16 09:46:32 created,accepted                  
    #  5 accepted  13-04-2023 23:59:59      2 2023-04-13 23:59:59 created,accepted                  
    #  6 created   14-12-2022 13:17:54      3 2022-12-14 13:17:54 created,revised,accepted,submitted
    #  7 revised   02-01-2023 23:59:59      3 2023-01-02 23:59:59 created,revised,accepted,submitted
    #  8 accepted  28-02-2023 19:37:01      3 2023-02-28 19:37:01 created,revised,accepted,submitted
    #  9 submitted 03-03-2023 23:59:59      3 2023-03-03 23:59:59 created,revised,accepted,submitted
    # 10 created   02-01-2023 07:45:43      5 2023-01-02 07:45:43 created                           
    # 11 created   24-01-2022 16:05:58      6 2022-01-24 16:05:58 created,accepted                  
    # 12 accepted  03-02-2022 23:59:59      6 2022-02-03 23:59:59 created,accepted                  
    # 13 created   24-01-2022 15:52:53      7 2022-01-24 15:52:53 created,accepted                  
    # 14 accepted  03-02-2022 23:59:59      7 2022-02-03 23:59:59 created,accepted                  
    # 15 created   15-08-2022 12:54:23      8 2022-08-15 12:54:23 created,rejected                  
    # 16 rejected  18-08-2022 23:59:59      8 2022-08-18 23:59:59 created,rejected                  
    # 17 created   21-03-2022 15:32:05      9 2022-03-21 15:32:05 created,accepted                  
    # 18 accepted  26-04-2022 23:59:59      9 2022-04-26 23:59:59 created,accepted                  
    # 19 created   21-03-2022 15:42:39     10 2022-03-21 15:42:39 created