rdataframestatisticschi-squaredgoodness-of-fit

Chi-Square Test of Independence in r


I have a technical question related to my df structure. It looks like this:

    Month District   Age Gender Education Disability Religion                          Occupation JobSeekers GMI
1 2020-01      Dan   U17   Male      None       None   Jewish              Unprofessional workers          2   0
2 2020-01      Dan   U17   Male      None       None  Muslims          Sales and costumer service          1   0
3 2020-01      Dan   U17 Female      None       None    Other                           Undefined          1   0
4 2020-01      Dan 18-24   Male      None       None   Jewish         Production and construction          1   0
5 2020-01      Dan 18-24   Male      None       None   Jewish                     Academic degree          1   0
6 2020-01      Dan 18-24   Male      None       None   Jewish Practical engineers and technicians          1   0
  ACU NACU NewSeekers NewFiredSeekers
1   0    2          0               0
2   0    1          0               0
3   0    1          0               0
4   0    1          0               0
5   0    1          0               0
6   0    1          1               1

And I'm looking for a way to make an Chi-Square Test of Independence between 2 variables like district and JobSeekers so i can tell if Northern district related to jobseekers more than the southern for example. As far as i can tell, something is wrong with the data structure (District is a char and jobseekers is an integer which indicate how many jobseekers I have based on District, Gender, Occupation etc) I tried to subset it to district and jobseekers like this:

  Month   District  JobSeekers   GMI   ACU  NACU NewSeekers NewFiredSeekers
  <chr>   <chr>          <int> <int> <int> <int>      <int>           <int>
1 2020-01 Dan            33071  4694  9548 18829       6551            4682
2 2020-01 Jerusalem      21973  7665  3395 10913       3589            2260
3 2020-01 North          47589 22917  4318 20354       6154            3845
4 2020-01 Sharon         25403  6925  4633 13845       4131            2727
5 2020-01 South          37089 18874  2810 15405       4469            2342
6 2020-02 Dan            32660  4554  9615 18491       5529            3689

But it makes it harder to handle I'll accept any other test that will work of course.

Please help and let me know if there's more information that you need,

Moshe

Update

# t test for district vs new seekers

# sorting

dist.newseek <- Cdata %>% 
  group_by(Month,District) %>% 
  summarise(NewSeekers=sum(NewSeekers))

# performing a t test on the mini table we created

t.test(NewSeekers ~ District,data=subset(dist.newseek,District %in% c("Dan","South")))

# results

Welch Two Sample t-test

data:  NewSeekers by District
t = 0.68883, df = 4.1617, p-value = 0.5274
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -119952.3  200737.3
sample estimates:
  mean in group Dan mean in group South 
74608.25            34215.75 

#wilcoxon test 

# filtering Cdata to New seekers based on month and age

age.newseek <- Cdata %>% 
  group_by(Month,Age) %>% 
  summarise(NewSeekers=sum(NewSeekers))

#performing a wilcoxon test on the subset 

wilcox.test(NewSeekers ~ Age,data=subset(age.newseek,Age %in% c("25-34","45-54")))

# Results

Wilcoxon rank sum exact test

data:  NewSeekers by Age
W = 11, p-value = 0.4857
alternative hypothesis: true location shift is not equal to 0

ANOVA test

# Sorting occupation and month by new seekers

occu.newseek <- Cdata %>% 
  group_by(Month,Occupation) %>% 
  summarise(NewSeekers=sum(NewSeekers))

## Make the Occupation as a factor

occu.newseek$District <- as.factor(occu.newseek$Occupation)

## Get the occupation group means and standart deviations

group.mean.sd <- aggregate(
  x = occu.newseek$NewSeekers, # Specify data column
  by = list(occu.newseek$Occupation), # Specify group indicator
  FUN = function(x) c('mean'=mean(x),'sd'= sd(x))
)

## Run one way ANOVA test
anova_one_way <- aov(NewSeekers~ Occupation, data = occu.newseek)
summary(anova_one_way)

## Run the Tukey Test to compare the groups 
TukeyHSD(anova_one_way)

## Check the mean differences across the groups 

library(ggplot2)
ggplot(occu.newseek, aes(x = Occupation, y = NewSeekers, fill = Occupation)) +
  geom_boxplot() +
  geom_jitter(shape = 15,
              color = "steelblue",
              position = position_jitter(0.21)) +
  theme_classic()

Plot


Solution

  • You can use an ANOVA test to compare multiple groups. If you find any statistically significant results with the omnibus ANOVA test, then you can check which district is different better or worse.

    You can also refer to UCLA's website that shows which tests one should use to test their data. The link is here.

    As a quick example, let me put here how to run an ANOVA test.

    Here is your data:

    head(df)
    
    r$> head(df)
        Month  District   Age Gender Education Disability Religion                          Occupation JobSeekers GMI ACU NACU NewSeekers NewFiredSeekers
    1 2020-01       Dan 18-24   Male      None       Hard   Jewish Practical engineers and technicians          1   0   0    1          1               1
    2 2020-01     North 18-24   Male      None       Hard   Jewish Practical engineers and technicians          1   0   0    1          1               1
    3 2020-01     North 18-24   Male      None       Hard   Jewish Practical engineers and technicians          1   0   0    1          1               1
    4 2020-01     South 18-24   Male      None       Hard   Jewish Practical engineers and technicians          1   0   0    1          1               1
    5 2020-01       Dan 18-24   Male      None       Hard   Jewish Practical engineers and technicians          1   0   0    1          1               1
    6 2020-01 Jerusalem 18-24   Male      None       Hard   Jewish Practical engineers and technicians          1   0   0    1          1               1
    

    Because I need more data points to carry out the test, I replicated your data by bootstrapping. I also increased the number of job seekers in the South and North regions. You do not need to do the following steps in your data. But this is how I did.

    # For the sake of this example, I increased the number of observation by bootstrapping the example data
    
    for(i in 1:20) df <- rbind(df[sample(6, 5), ],df)
    rownames(df) <- 1:nrow(df)
    df$District <- sample(c("Jerusalem", "North", "Sharon", "South", "Dan"), nrow(df),replace = T)
    df$JobSeekers[df$District == "North"] <- sample(1:3,length(df$JobSeekers[df$District == "North"]),replace=T,p=c(0.1,0.5,0.4))
    df$JobSeekers[df$District == "South"] <- sample(4:6,length(df$JobSeekers[df$District == "South"]),replace=T,p=c(0.1,0.5,0.4))
    

    It is preferable to make the character as a factor when you are analyzing categorical variables. By doing this, you can control the level of factors.

    ## Make the District as a factor
    
    df$District <- as.factor(df$District)
    

    Next, get the group means and standard deviations to see whether there are any meaningful differences across the group. As you see, I changed the south and north districts, so they have the highest mean scores compared to the rest districts.

    ## Get the group means and standart deviations
        group.mean.sd <- aggregate(
            x = df$JobSeekers, # Specify data column
            by = list(df$District), # Specify group indicator
            FUN = function(x) c('mean'=mean(x),'sd'= sd(x))
        )
    
    r$> group.mean.sd
        Group.1    x.mean      x.sd
    1       Dan 1.1000000 0.3077935
    2 Jerusalem 1.0000000 0.0000000
    3     North 2.3225806 0.5992827
    4    Sharon 1.1363636 0.3512501
    5     South 5.2380952 0.4364358
    

    Finally, you can run an ANOVA test and Tukey test as follow.

    ## Run one way ANOVA test
        anova_one_way <- aov(JobSeekers~ District, data = df)
        summary(anova_one_way)
    
    r$> summary(anova_one_way)
                 Df Sum Sq Mean Sq F value Pr(>F)
    District      4 260.09   65.02   346.1 <2e-16 ***
    Residuals   101  18.97    0.19
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    ## Run the Tukey Test to compare the groups 
        TukeyHSD(anova_one_way)
    
    r$> Tukey multiple comparisons of means
        95% family-wise confidence level
    
    Fit: aov(formula = JobSeekers ~ District, data = df)
    
    $District
                            diff        lwr        upr     p adj
    Jerusalem-Dan    -0.10000000 -0.5396190  0.3396190 0.9695592
    North-Dan         1.22258065  0.8772809  1.5678804 0.0000000
    Sharon-Dan        0.03636364 -0.3356042  0.4083315 0.9987878
    South-Dan         4.13809524  3.7619337  4.5142567 0.0000000
    North-Jerusalem   1.32258065  0.9132542  1.7319071 0.0000000
    Sharon-Jerusalem  0.13636364 -0.2956969  0.5684241 0.9048406
    South-Jerusalem   4.23809524  3.8024191  4.6737714 0.0000000
    Sharon-North     -1.18621701 -1.5218409 -0.8505932 0.0000000
    South-North       2.91551459  2.5752488  3.2557803 0.0000000
    South-Sharon      4.10173160  3.7344321  4.4690311 0.0000000
    

    Lastly, you can plot which district has the highest job seekers with a bar plot.

    ## Check the mean differences across the groups 
    
    library(ggplot2)
    ggplot(df, aes(x = District, y = JobSeekers, fill = District)) +
        geom_boxplot() +
        geom_jitter(shape = 15,
            color = "steelblue",
            position = position_jitter(0.21)) +
        theme_classic()
    

    Job Seekers across Districts

    Update

    Based on your update, you can use the following syntax to abbreviate the x labels and change the legends.

    library(stringr)
    library(ggplot2)
    ggplot(occu.newseek, aes(x = Occupation, y = NewSeekers, fill = str_wrap(Occupation,10))) +
        geom_boxplot() +
        geom_jitter(
            shape = 19,
            color = "black",
            position = position_jitter(0.21)
        ) +
         scale_x_discrete(
            labels =
                c(
                    "Academic degree" = "Academic",
                    "Practical engineers and technicians" = "Engineering",
                    'Production and construction'='Production',
                    "Sales and costumer service" = "Sales",
                    "Unprofessional workers" = "Unprofessional",
                    "Undefined" = "Undefined"
                )
        ) +
        labs(fill = "Occupation") +
            theme_classic()+
            theme(
                axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1), legend.key.height=unit(2, "cm")
                #legend.position = "top",
                
            )
    

    You should get a graph like that.

    Updated Graph