rfor-loopgpshierarchical-clustering

Calculating mean of coordinates for each unique value in large dataset for hierarchical analysis


I am a beginner with R, but I have been analysing a large data set of GPS data, made up of unique individuals (name) (approx 100 unique names) with 1,000,000+ lines of data. Each unique name has varying number of coordinates (lat and lng) each. Each unique name belongs to either group a or b. I have so far done a polygon point count analysis to analyse use of the site between groups a and b. I want to do a hierarchical cluster analysis within each group a and group b to analysis interactions within each group, and then between groups a and b.

I have been advised to do a for loop to get mean of coordinates for each unique 'name', and I think I can then use this data to do a hierarchical cluster analysis (either with R or QGIS?). My data is below.

structure(list(lat = c(50.39761959, 50.39757382, 50.39760433, 
50.39742123, 50.39768063, 50.39740597, 50.39757382, 50.39769589, 
50.39763485, 50.39763485), lng = c(-4.888685435, -4.888639658, 
-4.888685435, -4.888746471, -4.88860914, -4.888883803, -4.888670176, 
-4.88860914, -4.888563363, -4.888181888), time_stamp = c("15/10/2021 00:21", 
"15/10/2021 00:50", "15/10/2021 01:51", "15/10/2021 02:21", "15/10/2021 02:51", 
"15/10/2021 03:21", "15/10/2021 03:51", "15/10/2021 04:21", "15/10/2021 04:51", 
"15/10/2021 05:21"), name = c("300005", "300005", "300005", "300005", 
"300005", "300005", "300005", "B100", "B100", "B100"), 
    breed = c("a", "a", "a", "a", "a", "a", "a", "b", "b", "b"
    )), row.names = c(NA, -10L), class = c("data.table", "data.frame"
))

I'm especially struggling with the for-loop to get the mean coordinates.


Solution

  • You don't need to loop for aggeration. From used tags it's not quite clear if you prefer a solution based on data.table(might make sense for 1,000,000+ records), but with dplyr (or with dtplyr, for near-data.table-performance) you could group by name (and breed) and summarise with something like this:

    df <- structure(list(lat = c(50.39761959, 50.39757382, 50.39760433, 
    50.39742123, 50.39768063, 50.39740597, 50.39757382, 50.39769589, 
    50.39763485, 50.39763485), lng = c(-4.888685435, -4.888639658, 
    -4.888685435, -4.888746471, -4.88860914, -4.888883803, -4.888670176, 
    -4.88860914, -4.888563363, -4.888181888), time_stamp = c("15/10/2021 00:21", 
    "15/10/2021 00:50", "15/10/2021 01:51", "15/10/2021 02:21", "15/10/2021 02:51", 
    "15/10/2021 03:21", "15/10/2021 03:51", "15/10/2021 04:21", "15/10/2021 04:51", 
    "15/10/2021 05:21"), name = c("300005", "300005", "300005", "300005", 
    "300005", "300005", "300005", "B100", "B100", "B100"), 
        breed = c("a", "a", "a", "a", "a", "a", "a", "b", "b", "b"
        )), row.names = c(NA, -10L), class = c("data.table", "data.frame"
    ))
    
    df
    #>         lat       lng       time_stamp   name breed
    #> 1  50.39762 -4.888685 15/10/2021 00:21 300005     a
    #> 2  50.39757 -4.888640 15/10/2021 00:50 300005     a
    #> 3  50.39760 -4.888685 15/10/2021 01:51 300005     a
    #> 4  50.39742 -4.888746 15/10/2021 02:21 300005     a
    #> 5  50.39768 -4.888609 15/10/2021 02:51 300005     a
    #> 6  50.39741 -4.888884 15/10/2021 03:21 300005     a
    #> 7  50.39757 -4.888670 15/10/2021 03:51 300005     a
    #> 8  50.39770 -4.888609 15/10/2021 04:21   B100     b
    #> 9  50.39763 -4.888563 15/10/2021 04:51   B100     b
    #> 10 50.39763 -4.888182 15/10/2021 05:21   B100     b
    
    dplyr::summarise(df, lat = mean(lat), lng = mean(lng), .by = c(name, breed))
    #>     name breed      lat       lng
    #> 1 300005     a 50.39755 -4.888703
    #> 2   B100     b 50.39766 -4.888451
    

    Created on 2024-04-29 with reprex v2.1.0