rdataframearulesmarket-basket-analysis

correct converting dataframe into transactions for arules in R


I must performing association rules in R and i found the example here http://www.salemmarafi.com/code/market-basket-analysis-with-r/ In this example they work with data(Groceries) but they gave original dataset Groceries.csv

structure(list(chocolate = structure(c(9L, 13L, 1L, 8L, 16L, 
2L, 14L, 11L, 7L, 15L, 17L, 5L, 10L, 4L, 3L, 6L, 2L, 18L, 12L
), .Label = c("bottled water", "canned beer", "chicken,citrus fruit,tropical fruit,root vegetables,whole milk,frozen fish,rollsbuns", 
"chicken,pip fruit,other vegetables,whole milk,dessert,yogurt,whippedsour cream,rollsbuns,pasta,soda,waffles", 
"citrus fruit,pip fruit,root vegetables,other vegetables,whole milk,cream cheese ,domestic eggs,brown bread,margarine,baking powder,waffles", 
"frankfurter,citrus fruit,onions,other vegetables,whole milk,rollsbuns,sugar,soda", 
"frankfurter,rollsbuns,bottled water,fruitvegetable juice,hygiene articles", 
"frankfurter,sausage,butter,whippedsour cream,rollsbuns,margarine,spices", 
"fruitvegetable juice", "hamburger meat,other vegetables,whole milk,curd,yogurt,rollsbuns,pastry,semi-finished bread,margarine,bottled water,fruitvegetable juice", 
"meat,citrus fruit,berries,root vegetables,whole milk,soda", 
"packaged fruitvegetables,whole milk,curd,yogurt,domestic eggs,brown bread,mustard,pickled vegetables,bottled water,misc. beverages", 
"pickled vegetables,coffee", "root vegetables", "tropical fruit,margarine,rum", 
"tropical fruit,pip fruit,onions,other vegetables,whole milk,domestic eggs,sugar,soups,tea,soda,hygiene articles,napkins", 
"tropical fruit,root vegetables,herbs,whole milk,butter milk,whippedsour cream,flour,hygiene articles", 
"turkey,pip fruit,salad dressing,pastry"), class = "factor")), .Names = "chocolate", class = "data.frame", row.names = c(NA, 
-19L))

i load this data

g=read.csv("g.csv",sep=";")

so i must convert it to transactions like arule requires

#'@importClassesFrom arules transactions
trans = as(g, "transactions")

lets' examinate data(Groceries)

> str(Groceries)
Formal class 'transactions' [package "arules"] with 3 slots
  ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
  .. .. ..@ i       : int [1:43367] 13 60 69 78 14 29 98 24 15 29 ...
  .. .. ..@ p       : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
  .. .. ..@ Dim     : int [1:2] 169 9835
  .. .. ..@ Dimnames:List of 2
  .. .. .. ..$ : NULL
  .. .. .. ..$ : NULL
  .. .. ..@ factors : list()
  ..@ itemInfo   :'data.frame': 169 obs. of  3 variables:
  .. ..$ labels: chr [1:169] "frankfurter" "sausage" "liver loaf" "ham" ...
  .. ..$ level2: Factor w/ 55 levels "baby food","bags",..: 44 44 44 44 44 44 44 42 42 41 ...
  .. ..$ level1: Factor w/ 10 levels "canned food",..: 6 6 6 6 6 6 6 6 6 6 ...
  ..@ itemsetInfo:'data.frame': 0 obs. of  0 variables
>

and my converted data from original csv

> str(trans)
Formal class 'transactions' [package "arules"] with 3 slots
  ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
  .. .. ..@ i       : int [1:9835] 1265 6162 6377 4043 3585 6475 4431 3535 4401 6490 ...
  .. .. ..@ p       : int [1:9836] 0 1 2 3 4 5 6 7 8 9 ...
  .. .. ..@ Dim     : int [1:2] 7011 9835
  .. .. ..@ Dimnames:List of 2
  .. .. .. ..$ : NULL
  .. .. .. ..$ : NULL
  .. .. ..@ factors : list()
  ..@ itemInfo   :'data.frame': 7011 obs. of  3 variables:
  .. ..$ labels   : chr [1:7011] "tr=abrasive cleaner" "tr=abrasive cleaner,napkins" "tr=artif. sweetener" "tr=artif. sweetener,coffee" ...
  .. ..$ variables: Factor w/ 1 level "tr": 1 1 1 1 1 1 1 1 1 1 ...
  .. ..$ levels   : Factor w/ 7011 levels "abrasive cleaner",..: 1 2 3 4 5 6 7 8 9 10 ...
  ..@ itemsetInfo:'data.frame': 9835 obs. of  1 variable:
  .. ..$ transactionID: chr [1:9835] "1" "2" "3" "4" ...
> 

We see that in data(Groceries)

transactions in sparse format with
 9835 transactions (rows) and
 169 items (columns)

in my trans data

 9835 transactions (rows) and
 7011 items (columns)

i.e. i got 7011 columns from Groceries.csv, meanwhile in embedded example(169 columns)

Why it is so? How this file convert correct. I must understand it, cause, i can't work with my file

i tried found similar topic but this two posts didn't help me How to prep transaction data into basket for arules R (arules) Convert dataframe into transactions and remove NA


Solution

  • This is because the data is comma delimited when downloaded, and in g=read.csv("g.csv",sep=";"), you are splitting the data on a semi-colon. You should get desired output if you remove sep = ";" from your definition of g.

    See the following, which defines sep as ;:

    > trans <-  read.transactions("~/Downloads/groceries.csv", format = 'basket', sep = ';')
    > str(trans)
    Formal class 'transactions' [package "arules"] with 3 slots
      ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
      .. .. ..@ i       : int [1:9835] 1265 6162 6377 4043 3585 6475 4431 3535 4401 6490 ...
      .. .. ..@ p       : int [1:9836] 0 1 2 3 4 5 6 7 8 9 ...
      .. .. ..@ Dim     : int [1:2] 7011 9835
      .. .. ..@ Dimnames:List of 2
      .. .. .. ..$ : NULL
      .. .. .. ..$ : NULL
      .. .. ..@ factors : list()
      ..@ itemInfo   :'data.frame': 7011 obs. of  1 variable:
      .. ..$ labels: chr [1:7011] "abrasive cleaner" "abrasive cleaner,napkins" "artif. sweetener" "artif. sweetener,coffee" ...
      ..@ itemsetInfo:'data.frame': 0 obs. of  0 variables
    

    And this, which defines sep as ,:

    > trans <-  read.transactions("~/Downloads/groceries.csv", format = 'basket', sep = ',')
    > str(trans)
    Formal class 'transactions' [package "arules"] with 3 slots
      ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
      .. .. ..@ i       : int [1:43367] 29 88 118 132 33 157 167 166 38 91 ...
      .. .. ..@ p       : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ...
      .. .. ..@ Dim     : int [1:2] 169 9835
      .. .. ..@ Dimnames:List of 2
      .. .. .. ..$ : NULL
      .. .. .. ..$ : NULL
      .. .. ..@ factors : list()
      ..@ itemInfo   :'data.frame': 169 obs. of  1 variable:
      .. ..$ labels: chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ...
      ..@ itemsetInfo:'data.frame': 0 obs. of  0 variables