I am a beginner with R, but I have been analysing a large data set of GPS data, made up of unique individuals (name) (approx 100 unique names) with 1,000,000+ lines of data. Each unique name has varying number of coordinates (lat and lng) each. Each unique name belongs to either group a or b. I have so far done a polygon point count analysis to analyse use of the site between groups a and b. I want to do a hierarchical cluster analysis within each group a and group b to analysis interactions within each group, and then between groups a and b.
I have been advised to do a for loop to get mean of coordinates for each unique 'name', and I think I can then use this data to do a hierarchical cluster analysis (either with R or QGIS?). My data is below.
structure(list(lat = c(50.39761959, 50.39757382, 50.39760433,
50.39742123, 50.39768063, 50.39740597, 50.39757382, 50.39769589,
50.39763485, 50.39763485), lng = c(-4.888685435, -4.888639658,
-4.888685435, -4.888746471, -4.88860914, -4.888883803, -4.888670176,
-4.88860914, -4.888563363, -4.888181888), time_stamp = c("15/10/2021 00:21",
"15/10/2021 00:50", "15/10/2021 01:51", "15/10/2021 02:21", "15/10/2021 02:51",
"15/10/2021 03:21", "15/10/2021 03:51", "15/10/2021 04:21", "15/10/2021 04:51",
"15/10/2021 05:21"), name = c("300005", "300005", "300005", "300005",
"300005", "300005", "300005", "B100", "B100", "B100"),
breed = c("a", "a", "a", "a", "a", "a", "a", "b", "b", "b"
)), row.names = c(NA, -10L), class = c("data.table", "data.frame"
))
I'm especially struggling with the for-loop to get the mean coordinates.
You don't need to loop for aggeration. From used tags it's not quite clear if you prefer a solution based on data.table
(might make sense for 1,000,000+ records), but with dplyr
(or with dtplyr
, for near-data.table-performance) you could group by name (and breed) and summarise with something like this:
df <- structure(list(lat = c(50.39761959, 50.39757382, 50.39760433,
50.39742123, 50.39768063, 50.39740597, 50.39757382, 50.39769589,
50.39763485, 50.39763485), lng = c(-4.888685435, -4.888639658,
-4.888685435, -4.888746471, -4.88860914, -4.888883803, -4.888670176,
-4.88860914, -4.888563363, -4.888181888), time_stamp = c("15/10/2021 00:21",
"15/10/2021 00:50", "15/10/2021 01:51", "15/10/2021 02:21", "15/10/2021 02:51",
"15/10/2021 03:21", "15/10/2021 03:51", "15/10/2021 04:21", "15/10/2021 04:51",
"15/10/2021 05:21"), name = c("300005", "300005", "300005", "300005",
"300005", "300005", "300005", "B100", "B100", "B100"),
breed = c("a", "a", "a", "a", "a", "a", "a", "b", "b", "b"
)), row.names = c(NA, -10L), class = c("data.table", "data.frame"
))
df
#> lat lng time_stamp name breed
#> 1 50.39762 -4.888685 15/10/2021 00:21 300005 a
#> 2 50.39757 -4.888640 15/10/2021 00:50 300005 a
#> 3 50.39760 -4.888685 15/10/2021 01:51 300005 a
#> 4 50.39742 -4.888746 15/10/2021 02:21 300005 a
#> 5 50.39768 -4.888609 15/10/2021 02:51 300005 a
#> 6 50.39741 -4.888884 15/10/2021 03:21 300005 a
#> 7 50.39757 -4.888670 15/10/2021 03:51 300005 a
#> 8 50.39770 -4.888609 15/10/2021 04:21 B100 b
#> 9 50.39763 -4.888563 15/10/2021 04:51 B100 b
#> 10 50.39763 -4.888182 15/10/2021 05:21 B100 b
dplyr::summarise(df, lat = mean(lat), lng = mean(lng), .by = c(name, breed))
#> name breed lat lng
#> 1 300005 a 50.39755 -4.888703
#> 2 B100 b 50.39766 -4.888451
Created on 2024-04-29 with reprex v2.1.0