rggplot2geom

geom_area plot y value greater than 1 when plotted


all.

I am giving a label to each sentence in an article. I am trying to generate a stacked area plot to show at a specific location, the percentage of a certain label.

The location is calculated as (sentence_index/total_number_of_sentence)

The percentage is calculated as at location X, (total number of sentences with label A/total number of sentences)

Here is an example of my data,a complete subsec of loc (0.24,0.28). I have tested that at each location, the sum of all percentage is 1.

> area_df[area_df$loc>0.24,]
    label percentage  loc
186   B1      0.195 0.25
187   C1      0.111 0.25
188   E1      0.006 0.25
189   G1      0.075 0.25
190   H1      0.008 0.25
191   M1      0.125 0.25
192   M2      0.064 0.25
193   M3      0.084 0.25
194   O1      0.070 0.25
195   O2      0.053 0.25
196   R1      0.209 0.25
197   B1      0.500 0.26
198   M2      0.250 0.26
199   M3      0.250 0.26
200   B1      0.166 0.27
201   C1      0.177 0.27
202   E1      0.015 0.27
203   G1      0.100 0.27
204   H1      0.011 0.27
205   M1      0.114 0.27
206   M2      0.048 0.27
207   M3      0.059 0.27
208   O1      0.074 0.27
209   O2      0.026 0.27
210   R1      0.210 0.27
211   B1      0.125 0.28
212   C1      0.250 0.28
213   G1      0.125 0.28
214   H1      0.125 0.28
215   M1      0.125 0.28
216   O1      0.125 0.28
217   O2      0.125 0.28

I want to create a stacked area plot to represent the overall percentage. I am expecting a solid fill graph with ranging from [0,1]. However, in my geom_area plot, there are some locations with sum(y) greater than 1. when I try set ylim(0,1), there are strange blank(white) lines showing in the area plot.

I am not sure what causes this problem

Here is my code without and with ylim:

# all data stored in area_df
normal_loc_uniq <- sort(unique(normal_loc))
area_df <- data.frame(matrix(ncol = 3,nrow=0))
colnames(area_df) <- c("loc","label","percentage")

# for each location, calculate the percentage
for (one_loc in normal_loc_uniq){
  subset <- data[data$normal_loc == one_loc,]
  subset_count <- as.data.frame(round(prop.table(table(subset$normal_label, useNA = "no")),5))
  names(subset_count) <- c("label","percentage")
  subset_count$loc <- as.numeric(one_loc)
  subset_count$percentage <- round(subset_count$percentage,3)

# test if there are locations with percentage not equal to 1
  if (0.98>sum(subset_count$percentage)| sum(subset_count$percentage) >1.02){
    print("error. total percentage is not 1")
  }
  area_df <- rbind(area_df,subset_count)
  }

library(ggplot2)
colors <- c("#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf", "#aaffc3")
ggplot(area_df, aes(x = loc, y = percentage, fill = label)) +
  geom_area(na.rm=TRUE,position="stack") + 
  scale_fill_manual(values=colors) + 
  labs(x = "Relative Location", y = "Percentage", fill = "Label") +
  theme_bw()

enter image description here

enter image description here

edit 1: added a complete subset of data


Solution

  • TL;DR - replace your geom_area(...) part with

    geom_area(na.rm=TRUE,position="fill")
    

    For more information...

    Position Adjustments

    What you are looking for are called Position Adjustments which adjust how plotting layers are handled when they may overlap. You can define position= as an argument for any geom_*() function, and each one has a different default behavior. This link explains some of the options with examples, but I'll summarize here:

    Position Adjustments for X axis

    Position Adjustments for Y axis

    Applying Position Adjustments

    To apply position adjustments, you address the position= argument in a geom_*() function. To use the default behavior of adjusting, just use position="dodge" or position="fill", etc. To fine-tune the adjustment, you can refer to each position adjustment's function instead, such as position=position_dodge(...) or position=position_stack(...).

    Example and to Answer OP

    Here's an example area plot. geom_area() defaults to use position="stack", so geom_area(position="stack") does the same thing as writing just geom_area().

    library(ggplot2)
    
    set.seed(8675309)
    df <- data.frame(
      x=1:20, 
      y=c(runif(20, min=0, max=100), runif(20, min=10, max=50),
        runif(20, min=5, max=20)),
      category=rep(LETTERS[1:3], each=20)
    )
    
    p <- ggplot(df, aes(x=x, y=y, fill=category))
    p + geom_area()
    

    enter image description here

    Using position="fill" we get this:

    p + geom_area(position="fill")
    

    enter image description here

    Therefore, OP should change geom_area(...) in their code to be:

    geom_area(na.rm=TRUE,position="fill")
    

    ... oh, and if you want to add actual percentages in the labels of y axis instead of 0, 0.25, 0.50,... then I would recommend adding:

    scale_y_continuous(labels=scales::percent_format())