I need to plot a huge dataset (1 million data) according to two variables. I want :
geom_point
to see the distribution of my dataMy data is very concentrated in some areas of my graphs. However, since I have a lot of data, the geom_smooth
should be valid across most areas of my graph (but not all).
I can use geom_point() for that, but it really takes a long time to plot, and can lead to misinterpreation since the graph needs to be zoomed in to see the the real position of the points.
set.seed(1)
library(data.table)
library(ggplot2)
d=data.table(a=c(sample(seq(1,1500,1),20000, replace=T),sample(seq(1998,2000,1),1000, replace=T),sample(seq(1,150,1),19000, replace=T)),
b=c(sample(seq(1,2000,1),20000, replace=T),sample(seq(150,160,1),1000, replace=T),sample(seq(1100,1600,1),19000, replace=T)))
ggplot(d) + aes(x=a,y=b)+
geom_point(shape = 1,alpha=0.2) +
geom_smooth(col="black")
Here we have a dezoomed plot : it gives us the impression that the density of the left-part of the graph is rather homogeneous
While in reality, we have density variations inside this area
The adress this, and to adress the rapidity problem of geom_point
, I found the geom_hex() function.
ggplot(d) + aes(x=a,y=b)+
geom_hex(bins=70,col="white") +
geom_smooth(col="black")
Here, we can see that the hexagons to the right are very dense in terms of data, but we barely see that the left part is also denser than the rest of the graph.
To adress this problem, I have set another scale_fill_gradient()
like said here. I set the gradient's limits to 0-150, considering that if there is more than 150 observations, the hexagon should be considered as dense.
ggplot(d) + aes(x=a,y=b)+
geom_hex(bins=70,col="white") +
scale_fill_gradient(low="yellow", high="coral2",limits=c(0,150)) +
geom_smooth(col="black")
The problem is that the hexagons that exceeds 150 observations are blue, but I need them red so the graph could be interpreted. I still want to have some nuance in my graph and keep a gradient for the hexagons with <150 observations (I don't want to have two colors).
Can someone help me with that ?
PS : I used ggthemr::ggthemr("pale")
to have prettier graphs, so it is normal if the formating isn't the same for you.
PSS : this is dummy data, obviously my data isn't as boring and homogeneous as this (and the repartition of points is more complicated), I just did what I could to recreate the problem.
If you want a bit more differentiation between lower values on the scale, you can use scale_fill_gradientn
and play around with the colours
and values
arguments to get a result that works well with your data:
ggplot(d) +
aes(x = a, y = b) +
geom_hex(bins = 70, colour = "white") +
scale_fill_gradientn(colours = c("white", 'yellow',
'gold', 'coral2', 'red2'),
values = c(0, 0.01, 0.1, 0.9, 1)) +
geom_smooth(colour = "black") +
theme_minimal()