I want to display in a plot/graph the customer relationships between 2 variables separated by third variable (grouping variable).
What are some options for displaying data between an x and y variable grouped on a third variable?
If you give the plot function 1 argument it will make a one-dimensional barplot.
If you give the plot function 2 values (must be numerical) it will give it will give you a standard algebraic plot of horizontal axis vs vertical axis.
If you give the plot function 3 values (2 numerical, and one factor, then you can plot the points, but color coat label them) and then you make the labels obvious to the user.
Using R's built in dataset "Orange", you can make a plot like this:
> View(Orange)
> summary(Orange)
plot(Orange$age, Orange$circumference, col = rainbow(5)[Orange$Tree], pch = 16, main = "Correlating Tree Age by Circumference", grid(nx = 25, ny = 25)) legend("topleft", title = "Orange Trees", fill = rainbow(5), levels(Orange$Tree))
Note: rainbow(5)? Why 5? Because the column trees has 1 - 5 as factors. Since you have 3 different cosmetic brands, you should do rainbow(3).
And this is how you get a linear regression line if it works. You have to use the linear model (lm) function:
> model <- lm(Orange$circumference ~ Orange$age)
> summary(model)
> abline(model)
You can also use xyplot in the lattice library.
> library(xyplot)
> xyplot(circumference ~ age| Tree, data = Orange, type = c("p", "g", "r"), main = "Plots of Orange Age vs Circumference for 5 Orange Trees")
I didn't color coat my points, but I didn't need to. While I like this plot, I think color coating with the plot function is better for making statistical judgement since it puts all factors in the same graph.
Questions: How do these functions work, etc?
>?plot
>?xyplot
>?Orange
The scatterplot3d function is also pretty cool. You can make a three-dimensional plot with it, but how you judge correlation is affected by your "angle" that you set the view to.
And you can also use the xyplot function to make a cooler graph. One with multiple regression lines for each factor.
>xyplot(circumference ~ age, data = Orange, groups = Tree, type = c("p", "g", "r"), main = "Plots of Orange Age vs Circumference for 5 Orange Trees", pch = 16, auto.key = TRUE)
My legend with my use of the auto.key command is pretty terrible. It can be improved, I'm sure!
If you want to plot two variables: one numerical variable and one factorial variable, you do it like this: You use the tapply function. Here I use the tapply function to count all circumferences for each Tree. Then you use the barplot function. This is probably what you had in mind.
> sum_table <- tapply(Orange$circumference, Orange$Tree, FUN = sum)
> sum_table <- sort.default(sum_table, decreasing = TRUE, na.last = NA)
> barplot(sum_table, xlab = "Trees", ylab = "Circumference", main = "Sum of Circumferences for all 5 Orange Trees", col = "dodgerblue1"))
Okay, nevermind the plot function defaults to making boxplots when one numerical variable is listed with another factor variable.
> plot(Orange$Tree, Orange$circumference, main = "Boxplots of Orange Circumference vs Orange Trees", xlab = "Orange Trees", ylab = "Circumference")