rmass

Finding high density zones in a 2D Histogram in R


I have a series of 2D Histograms that I created using the kde2d function of MASS in the following way:

    # Loading libraries
    library(MASS)
    library(RcolorBrewer)
    # Loading data
    data <- as.matrix(read.table('data.dat'))
    # Create the 2dhist object      
    hist_2d <- kde2d(data[,1],data[,2],n = 60, lims=c(-180,180,-180,180))
    # Define the color palette
    rf <- colorRampPalette(rev(brewer.pal(11,'Spectral')))
    r <- rf(60)
    # Defining the axis
    at_x = seq(-180,180,by=30)
    at_y = seq(-180,180,by=30)
    # Plot the 2DHistogram
    image(hist_2d,col=r,cex.main=3,main='Q68L',axes=F)
    axis(1,lwd.ticks=2,at=at_x,labels=T,cex.axis=2)
    axis(2,lwd.ticks=2,at=at_y,labels=T,cex.axis=2)

The histogram generated looks like this. How I can identify all the zones with high density ( which I marked inside the white squares)? The ideal solution for this problem would be a function that throws an (x,y) range for every high density zone so that it can be applied in several datasets.

Thanks in advance and let me know if you need additional information


Solution

  • With the right representation of the data, this can be done with cluster analysis. Since you do not provide data, I will illustrate with the data used on the kde2d help page - the geyser data. This data is gives a pretty clean separation of "high density" areas (like your example pictures), so I will just use a simple k-means clustering.

    library(MASS)
    attach(geyser)
    f2 <- kde2d(duration, waiting, n = 50, lims = c(0.5, 6, 40, 100),
                h = c(width.SJ(duration), width.SJ(waiting)) )
    image(f2, zlim = c(0, 0.05))
    

    Heat Map

    We need to find the "hot spots". In order to get an idea about what values should be considered "high", we can look at a boxplot.

    boxplot(as.vector(f2$z))
    

    Box Plot to find outliers

    Based on this, I will somewhat arbitrarily use points where the z-value is greater than 0.012. You will need to tune this for your particular problem.

    Hot = which(f2$z > 0.012, arr.ind = TRUE)
    HotPoints = data.frame(x=f2$x[Hot[,1]], y=f2$y[Hot[,2]])
    plot(HotPoints, pch=20, xlim = c(0.5,6), ylim = c(40,100))
    

    Points in hot spots

    Now we need to cluster the points and find the x & y ranges for the clusters. First I do it simply and show that the results are reasonable.

    KM3 = kmeans(scale(HotPoints), 3)
    plot(HotPoints, pch=20, xlim = c(0.5,6), ylim = c(40,100))
    for(i in 1:3) {
        Rx = range(HotPoints[KM3$cluster == i,1])
        Ry = range(HotPoints[KM3$cluster == i,2])
        polygon(c(Rx, rev(Rx)), rep(Ry, each=2))
    }
    

    Boundaries for Hot regions

    I am not sure how you want the results presented to you, but one way to get them all in one place is this:

    XRanges = sapply(unique(KM3$cluster), 
        function(i) range(HotPoints[KM3$cluster == i,1]))
    XRanges
             [,1]     [,2]     [,3]
    [1,] 3.979592 3.867347 1.734694
    [2,] 4.877551 4.316327 2.071429
    YRanges = sapply(unique(KM3$cluster), 
        function(i) range(HotPoints[KM3$cluster == i,2]))
    YRanges
             [,1]     [,2]     [,3]
    [1,] 47.34694 70.61224 73.06122
    [2,] 62.04082 87.75510 95.10204
    

    This gives a min and max for x and y for each of the three clusters.

    However, I made a few choices here and I wish to point out that I still left some work for you. What you still need to do:
    1. You need to choose a cut-off point for how high the density needs to be to get a cluster.
    2. Given the points above your cut-off, you will need to say how many clusters you want to generate.

    The rest of the machinery is there.