I have a technical question related to my df structure. It looks like this:
Month District Age Gender Education Disability Religion Occupation JobSeekers GMI
1 2020-01 Dan U17 Male None None Jewish Unprofessional workers 2 0
2 2020-01 Dan U17 Male None None Muslims Sales and costumer service 1 0
3 2020-01 Dan U17 Female None None Other Undefined 1 0
4 2020-01 Dan 18-24 Male None None Jewish Production and construction 1 0
5 2020-01 Dan 18-24 Male None None Jewish Academic degree 1 0
6 2020-01 Dan 18-24 Male None None Jewish Practical engineers and technicians 1 0
ACU NACU NewSeekers NewFiredSeekers
1 0 2 0 0
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0
5 0 1 0 0
6 0 1 1 1
And I'm looking for a way to make an Chi-Square Test of Independence between 2 variables like district and JobSeekers so i can tell if Northern district related to jobseekers more than the southern for example. As far as i can tell, something is wrong with the data structure (District is a char and jobseekers is an integer which indicate how many jobseekers I have based on District, Gender, Occupation etc) I tried to subset it to district and jobseekers like this:
Month District JobSeekers GMI ACU NACU NewSeekers NewFiredSeekers
<chr> <chr> <int> <int> <int> <int> <int> <int>
1 2020-01 Dan 33071 4694 9548 18829 6551 4682
2 2020-01 Jerusalem 21973 7665 3395 10913 3589 2260
3 2020-01 North 47589 22917 4318 20354 6154 3845
4 2020-01 Sharon 25403 6925 4633 13845 4131 2727
5 2020-01 South 37089 18874 2810 15405 4469 2342
6 2020-02 Dan 32660 4554 9615 18491 5529 3689
But it makes it harder to handle I'll accept any other test that will work of course.
Please help and let me know if there's more information that you need,
Moshe
Update
# t test for district vs new seekers
# sorting
dist.newseek <- Cdata %>%
group_by(Month,District) %>%
summarise(NewSeekers=sum(NewSeekers))
# performing a t test on the mini table we created
t.test(NewSeekers ~ District,data=subset(dist.newseek,District %in% c("Dan","South")))
# results
Welch Two Sample t-test
data: NewSeekers by District
t = 0.68883, df = 4.1617, p-value = 0.5274
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-119952.3 200737.3
sample estimates:
mean in group Dan mean in group South
74608.25 34215.75
#wilcoxon test
# filtering Cdata to New seekers based on month and age
age.newseek <- Cdata %>%
group_by(Month,Age) %>%
summarise(NewSeekers=sum(NewSeekers))
#performing a wilcoxon test on the subset
wilcox.test(NewSeekers ~ Age,data=subset(age.newseek,Age %in% c("25-34","45-54")))
# Results
Wilcoxon rank sum exact test
data: NewSeekers by Age
W = 11, p-value = 0.4857
alternative hypothesis: true location shift is not equal to 0
# Sorting occupation and month by new seekers
occu.newseek <- Cdata %>%
group_by(Month,Occupation) %>%
summarise(NewSeekers=sum(NewSeekers))
## Make the Occupation as a factor
occu.newseek$District <- as.factor(occu.newseek$Occupation)
## Get the occupation group means and standart deviations
group.mean.sd <- aggregate(
x = occu.newseek$NewSeekers, # Specify data column
by = list(occu.newseek$Occupation), # Specify group indicator
FUN = function(x) c('mean'=mean(x),'sd'= sd(x))
)
## Run one way ANOVA test
anova_one_way <- aov(NewSeekers~ Occupation, data = occu.newseek)
summary(anova_one_way)
## Run the Tukey Test to compare the groups
TukeyHSD(anova_one_way)
## Check the mean differences across the groups
library(ggplot2)
ggplot(occu.newseek, aes(x = Occupation, y = NewSeekers, fill = Occupation)) +
geom_boxplot() +
geom_jitter(shape = 15,
color = "steelblue",
position = position_jitter(0.21)) +
theme_classic()
You can use an ANOVA test to compare multiple groups. If you find any statistically significant results with the omnibus ANOVA test, then you can check which district is different better or worse.
You can also refer to UCLA's website that shows which tests one should use to test their data. The link is here.
As a quick example, let me put here how to run an ANOVA test.
Here is your data:
head(df)
r$> head(df)
Month District Age Gender Education Disability Religion Occupation JobSeekers GMI ACU NACU NewSeekers NewFiredSeekers
1 2020-01 Dan 18-24 Male None Hard Jewish Practical engineers and technicians 1 0 0 1 1 1
2 2020-01 North 18-24 Male None Hard Jewish Practical engineers and technicians 1 0 0 1 1 1
3 2020-01 North 18-24 Male None Hard Jewish Practical engineers and technicians 1 0 0 1 1 1
4 2020-01 South 18-24 Male None Hard Jewish Practical engineers and technicians 1 0 0 1 1 1
5 2020-01 Dan 18-24 Male None Hard Jewish Practical engineers and technicians 1 0 0 1 1 1
6 2020-01 Jerusalem 18-24 Male None Hard Jewish Practical engineers and technicians 1 0 0 1 1 1
Because I need more data points to carry out the test, I replicated your data by bootstrapping. I also increased the number of job seekers in the South and North regions. You do not need to do the following steps in your data. But this is how I did.
# For the sake of this example, I increased the number of observation by bootstrapping the example data
for(i in 1:20) df <- rbind(df[sample(6, 5), ],df)
rownames(df) <- 1:nrow(df)
df$District <- sample(c("Jerusalem", "North", "Sharon", "South", "Dan"), nrow(df),replace = T)
df$JobSeekers[df$District == "North"] <- sample(1:3,length(df$JobSeekers[df$District == "North"]),replace=T,p=c(0.1,0.5,0.4))
df$JobSeekers[df$District == "South"] <- sample(4:6,length(df$JobSeekers[df$District == "South"]),replace=T,p=c(0.1,0.5,0.4))
It is preferable to make the character as a factor when you are analyzing categorical variables. By doing this, you can control the level of factors.
## Make the District as a factor
df$District <- as.factor(df$District)
Next, get the group means and standard deviations to see whether there are any meaningful differences across the group. As you see, I changed the south and north districts, so they have the highest mean scores compared to the rest districts.
## Get the group means and standart deviations
group.mean.sd <- aggregate(
x = df$JobSeekers, # Specify data column
by = list(df$District), # Specify group indicator
FUN = function(x) c('mean'=mean(x),'sd'= sd(x))
)
r$> group.mean.sd
Group.1 x.mean x.sd
1 Dan 1.1000000 0.3077935
2 Jerusalem 1.0000000 0.0000000
3 North 2.3225806 0.5992827
4 Sharon 1.1363636 0.3512501
5 South 5.2380952 0.4364358
Finally, you can run an ANOVA test and Tukey test as follow.
## Run one way ANOVA test
anova_one_way <- aov(JobSeekers~ District, data = df)
summary(anova_one_way)
r$> summary(anova_one_way)
Df Sum Sq Mean Sq F value Pr(>F)
District 4 260.09 65.02 346.1 <2e-16 ***
Residuals 101 18.97 0.19
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Run the Tukey Test to compare the groups
TukeyHSD(anova_one_way)
r$> Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = JobSeekers ~ District, data = df)
$District
diff lwr upr p adj
Jerusalem-Dan -0.10000000 -0.5396190 0.3396190 0.9695592
North-Dan 1.22258065 0.8772809 1.5678804 0.0000000
Sharon-Dan 0.03636364 -0.3356042 0.4083315 0.9987878
South-Dan 4.13809524 3.7619337 4.5142567 0.0000000
North-Jerusalem 1.32258065 0.9132542 1.7319071 0.0000000
Sharon-Jerusalem 0.13636364 -0.2956969 0.5684241 0.9048406
South-Jerusalem 4.23809524 3.8024191 4.6737714 0.0000000
Sharon-North -1.18621701 -1.5218409 -0.8505932 0.0000000
South-North 2.91551459 2.5752488 3.2557803 0.0000000
South-Sharon 4.10173160 3.7344321 4.4690311 0.0000000
Lastly, you can plot which district has the highest job seekers with a bar plot.
## Check the mean differences across the groups
library(ggplot2)
ggplot(df, aes(x = District, y = JobSeekers, fill = District)) +
geom_boxplot() +
geom_jitter(shape = 15,
color = "steelblue",
position = position_jitter(0.21)) +
theme_classic()
Based on your update, you can use the following syntax to abbreviate the x labels and change the legends.
library(stringr)
library(ggplot2)
ggplot(occu.newseek, aes(x = Occupation, y = NewSeekers, fill = str_wrap(Occupation,10))) +
geom_boxplot() +
geom_jitter(
shape = 19,
color = "black",
position = position_jitter(0.21)
) +
scale_x_discrete(
labels =
c(
"Academic degree" = "Academic",
"Practical engineers and technicians" = "Engineering",
'Production and construction'='Production',
"Sales and costumer service" = "Sales",
"Unprofessional workers" = "Unprofessional",
"Undefined" = "Undefined"
)
) +
labs(fill = "Occupation") +
theme_classic()+
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1), legend.key.height=unit(2, "cm")
#legend.position = "top",
)
You should get a graph like that.