I have two dataframes: df1
contains observations with lat-lon coordinates; df2
has names with lat-lon coordinates. I want to create a new variable df1$names
which has for each observation the names of df2
that are within a specified distance to that observation.
Some sample data for df1
:
df1 <- structure(list(lat = c(52.768, 53.155, 53.238, 53.253, 53.312, 53.21, 53.21, 53.109, 53.376, 53.317, 52.972, 53.337, 53.208, 53.278, 53.316, 53.288, 53.341, 52.945, 53.317, 53.249), lon = c(6.873, 6.82, 6.81, 6.82, 6.84, 6.748, 6.743, 6.855, 6.742, 6.808, 6.588, 6.743, 6.752, 6.845, 6.638, 6.872, 6.713, 6.57, 6.735, 6.917), cat = c(2L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 3L, 2L, 2L, 2L, 2L, 2L), diff = c(6.97305555555555, 3.39815972222222, 14.2874305555556, -0.759791666666667, 34.448275462963, 4.38783564814815, 0.142430555555556, 0.698599537037037, 1.22914351851852, 7.0008912037037, 1.3349537037037, 8.67978009259259, 1.6090162037037, 25.9466782407407, 9.45068287037037, 4.76284722222222, 1.79163194444444, 16.8280787037037, 1.01336805555556, 3.51240740740741)), .Names = c("lat", "lon", "cat", "diff"), row.names = c(125L, 705L, 435L, 682L, 186L, 783L, 250L, 517L, 547L, 369L, 618L, 280L, 839L, 614L, 371L, 786L, 542L, 100L, 667L, 785L), class = "data.frame")
Some sample data for df2
:
df2 <- structure(list(latlonloc = structure(c(6L, 3L, 4L, 2L, 5L, 1L), .Label = c("Boelenslaan", "Borgercompagnie", "Froombosch", "Garrelsweer", "Stitswerd", "Tinallinge"), class = "factor"), lat = c(53.356789, 53.193886, 53.311237, 53.111339, 53.360848, 53.162031), lon = c(6.53493, 6.780792, 6.768608, 6.82354, 6.599604, 6.143804)), .Names = c("latlonloc", "lat", "lon"), class = "data.frame", row.names = c(NA, -6L))
Creating a distance matrix with the geosphere
package:
library(geosphere)
mat <- distm(df1[,c('lon','lat')], df2[,c('lon','lat')], fun=distHaversine)
The resulting distances are in meters (at least I think they are, else something is wrong with the distance matrix).
The specified distance is calculated with (df1$cat)^2)*1000
. I tried df1$names <- df2$latlonloc[apply(distmat, 1, which(distmat < ((df1$cat)^2)*1000 ))]
, but get an error message:
Error in match.fun(FUN) :
'which(distmat < ((df1$cat)^2) * 1000)' is not a function, character or symbol
This is probably not the correct appraoch, but what I need is this:
df1$names <- #code or function which gives me a string of names which are within a specified distance of the observation
How can I create a string with the names that are within a specified distance of the observations?
You need to operate on each row of df1
(or mat
) in order to figure out, for each row how far away each object in df2
is. From that, you can pick the ones that meet your distance criterion.
I think you're getting a little confused about the use of apply
and about the use of which
. To really have which
work for you, you need to apply it to each row of mat
whereas your current code applies it to the entire mat
matrix. Also note that it is hard to use apply
here because you're comparing each row of mat
against a corresponding element of a vector defined by ((df1$cat)^2)*1000)
. So, I will instead show you examples using sapply
and lapply
. You could also use mapply
here, but I think the sapply
/mapply
syntax is clearer.
To address your desired output, I show two examples. One returns a list containing, for each row in df1
, the names of items in df2
that are within the distance threshold. This won't easily go back into your original df1
as a variable because each element in the list can contain multiple names. The second example pastes those names together as a single comma-separated character string in order to create the new variable you're looking for.
Example 1:
out1 <- lapply(1:nrow(df1), function(x) {
df2[which(mat[x,] < (((df1$cat)^2)*1000)[x]),'latlonloc']
})
Result:
> str(out1)
List of 20
$ : Factor w/ 6 levels "Boelenslaan",..:
$ : Factor w/ 6 levels "Boelenslaan",..:
$ : Factor w/ 6 levels "Boelenslaan",..:
$ : Factor w/ 6 levels "Boelenslaan",..:
$ : Factor w/ 6 levels "Boelenslaan",..:
$ : Factor w/ 6 levels "Boelenslaan",..:
$ : Factor w/ 6 levels "Boelenslaan",..:
$ : Factor w/ 6 levels "Boelenslaan",..: 2
$ : Factor w/ 6 levels "Boelenslaan",..:
$ : Factor w/ 6 levels "Boelenslaan",..: 4
$ : Factor w/ 6 levels "Boelenslaan",..:
$ : Factor w/ 6 levels "Boelenslaan",..:
$ : Factor w/ 6 levels "Boelenslaan",..:
$ : Factor w/ 6 levels "Boelenslaan",..:
$ : Factor w/ 6 levels "Boelenslaan",..: 6 4 5
$ : Factor w/ 6 levels "Boelenslaan",..:
$ : Factor w/ 6 levels "Boelenslaan",..:
$ : Factor w/ 6 levels "Boelenslaan",..:
$ : Factor w/ 6 levels "Boelenslaan",..: 4
$ : Factor w/ 6 levels "Boelenslaan",..:
Example 2:
out2 <- sapply(1:nrow(df1), function(x) {
paste(df2[which(mat[x,] < (((df1$cat)^2)*1000)[x]),'latlonloc'], collapse=',')
})
Result:
> out2
[1] "" ""
[3] "" ""
[5] "" ""
[7] "" "Borgercompagnie"
[9] "" "Garrelsweer"
[11] "" ""
[13] "" ""
[15] "Tinallinge,Garrelsweer,Stitswerd" ""
[17] "" ""
[19] "Garrelsweer" ""
I think the second of these is probably closest to what you're going for.