rggplot2na

Keep NA values in ggplot2 in R


I thought this would be easy ... but couldn't find a solution. I am trying to generate a ggplot2 in R with correlation between col1 and col2, and size of the dot with col3, and shape with col4. col3 and col4 has NA/missing values. When running the code below, ggplot2 removes the rows without a matching col3 and/or col4, however, I want to keep these and color code. Output below

Example dataframe:

Warning: Removed 3 rows containing missing values (geom_point).

  1. I tried to create another geom_point with is.na(df$col3 | df$col4) but that wouldn't work.
  2. tried adding na.rm=FALSE in
geom_point(aes(size=df$col3, col=df$col4), na.rm=FALSE)

  1. tried
scale_size(range = c(0.25,4), na.value = 0) #to give a 0 value to the na.value (although would rather not)

But, I ended with "Ignoring unknown aesthetics: na.rm" for #2 and #3, and #1 gave an error. Also, that doesn't fix the issue that col4 shapes are being removed too

ggplot(df, aes(x=df$col1, y=df$col2)) + 
    geom_point(aes(size=df$col3, col=df$col4), na.rm=FALSE) + 
    theme_classic() + 
    scale_size(range = c(0.25,4)) 
             
+-------------+-------------+-------------+----------+
|    col1     |    col2     |    col3     |   col4   |
+-------------+-------------+-------------+----------+
| 0.254393811 | 0.124242905 | NA          | NA       |
|  0.28223149 | 0.148601748 | 0.236953099 | CD8CTL   |
| 0.205945835 | 0.074541695 | NA          | NA       |
| 0.199758631 | 0.103369485 | NA          | CD8Mem   |
|   0.2798128 | 0.109511863 | 0.396113132 | CD8STAT1 |
| 0.254616042 | 0.059495241 | 0.479590212 | CD8CTL   |
| 0.197929395 |  0.10993698 | 0.272611442 | CD8CTL   |
| 0.294888359 |  0.12319682 | 0.16069263  | CD8CTL   |
| 0.191407446 | 0.086443936 | 0.36596486  | CD8CTL   |
| 0.267533392 |  0.11240525 | 0.344659516 | CD8CTL   |
+-------------+-------------+-------------+----------+

Out of the 10 rows, only subset shows that are complete


Solution

  • There's a few things to note - I think I have understood what the OP is looking to do here. In this case, you want all points to plot. I'm going to state how we want the plot to look:

    We have NA values in col3 and col4. So what to do with those? Well, for color, I'm going to have those labeled and include them in the legend color-coded and labeled as "NA". What about for size? Well, size=NA doesn't make any sense, so I think the best thing to do for df$col3 == NA is going to be to change the shape. Here's what I've done:

    ggplot(df, aes(x=col1, y=col2, color=col4)) +
      geom_point(aes(size=col3, shape='Not NA')) +
      geom_point(data=subset(df, is.na(col3)), aes(shape='NA'), size=3) +
      scale_shape_manual(values=c('NA'=3, 'Not NA'=19)) +
      theme_classic()
    

    enter image description here

    First of all, it's bad form to reference columns via data.frame$column.name - you should use just the column name itself.

    Color is easy - we just put color=col4 in the top aes() specification, since it's applied to every geom.

    For the shape, it's probably easiest here to specify in two separate calls to geom_point(). One is without any specification, which will naturally remove any NAs - you won't get points plotted with size=NA. To "add back in" the NA points, we have to specifically pull those out and specify a size. Finally, in order to get the shape aesthetic inside a legend, we need to put it inside the aes(). The general rule here is that if you set an aesthetic equal to the column name inside aes(), it will use the values inside that column for labelling. If you just type a character inside aes() like we did here, you will have all items in that geom call labeled with that character - but the legend is created. So, we basically are creating our own custom legend for shape here.

    Then it's just a matter of using scale_shape_manual() and a named vector for the values argument to set the actual shape we want to use.

    EDIT

    Thinking about this a bit more, it doesn't make sense for NA to appear in the legend for color and shape, so let's remove it from color. That's done by completely separating the dataset that includes NAs in col3 from the one that doesn't:

    ggplot(df, aes(x=col1, y=col2, color=col4)) +
      geom_point(data=subset(df, !is.na(col3)), aes(size=col3, shape='Not NA')) +
      geom_point(data=subset(df, is.na(col3)), aes(shape='NA'), size=3) +
      scale_shape_manual(values=c('NA'=3, 'Not NA'=19)) +
      theme_classic()
    

    enter image description here