rr-markdowndata-manipulation

Error Forming a Boxplot and Scatterplot using 3 Variables in ggplot


I'm trying to work through a problem that I keep encountering in R, where when I try to create a side-by-side boxplot as well as a scatterplot of three variables. I am using the dataset "Boston" located in the ISLR2 package, and I am unsure why the graphs look so weird. For the side-by-side boxplot, I am creating the plot 'medv' by two variables, 'cat_chas' and 'cat_rm', which are both direct mutations of the variables 'chas' and 'rm' in the Boston dataset. For the scatterplot, I am using the variable 'age' on the horizontal axis and 'medv' on the vertical axis, with the points being colored by the variable 'cat_rm'. Is it something simple I am making a mistake on?

Side-by-side Boxplot

library(ISLR2)
library(dplyr)
library(tidyverse)

data = data.frame(Boston)
data <- mutate(data, cat_chas = chas, cat_rm = rm)


ggplot(data, aes(x=cat_chas, y=medv)) +

  geom_boxplot(fill="green", color="black") +

  facet_wrap(~ cat_rm, ncol=3) +

  labs(title="Boxplot of medv by cat_chas and cat_rm",

       x="cat_chas",

       y="Median Value ($1000s)") +

  theme_minimal()

Boxplot output

Scatterplot

library(dplyr)
ggplot(BSTN, aes(x = age, y = medv, color = cat_rm)) +
  geom_point() +
  labs(title = "Scatter plot of MEDV", color = "Category RM", x = "Age", y = "MEDV") +
  theme(plot.title = element_text(color = "blue", size = 17), plot.background = element_rect(fill = "orange"))

Scatterplot output

I thought the problem could be that I am using a data frame instead of a data set, so I tried switching it around, but I still get the same result.


Solution

  • In the boxplot, it seems that your variables are not right:

    $ cat_rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430,

    Variable cat_rm is numeric with too many different values to be meaningfully used inside a facet_wrap function or as a color in the scatterplot.

    This is why you get a (i) separate window for each value of cat_rm.

    Your dataset used in ggplot() is BSTN. It is not part of the packages you provided in your example code. Should it be data? If I use data as dataset within ggplot():

    ggplot(data, aes(x = age, y = medv, color = cat_rm)) +
          geom_point() +
          labs(title = "Scatter plot of MEDV", color = "Category RM", x = "Age", y = "MEDV") +
          theme(plot.title = element_text(color = "blue", size = 17), plot.background = element_rect(fill = "orange"))
    

    I get the following result:

    enter image description here