rleft-join

many to many in left_join


Assume I have two dataframes that I want to join with each other. However they will end up in many-to many relations, which I do not want. If there are several matches of y to x, I want to consider other columns (in this case country as well). A solution, which could be extedet to more "conditional" columns" would be nice. However, in my mind I do have think a left_join passing several columns in the "by" argument, but the first by argument as a "nececceary condition" and the other by arguments as additional evidence for subseting the corretct columns.

If there is only one possible to match no further conditions should be checked. If the further column (in this case country) has NA, ignore this condition.

# Create the Questions DataFrame
questions_df <- data.frame(
  question_id = c(1, 2, 3, 4, 5),
  title = c("How to use Python?", "What is SQL?", "Django tutorial", "Data science with R", "Stackoverflow"),
  tag = c("python", "sql", "django", "r", "python"),
  country = c("NM", "TSE", "FR", "Z", "ZAF")
)

# Create the Tags DataFrame
tags_df <- data.frame(
  tag = c("python", "python", "sql", "django", "django", "r", "r"),
  expert = c("Expert A", "Expert B", "Expert C", "Expert D", "Expert E", "Expert F", "Expert G"),
  country = c("TGV", "NM", "TSE", "FR", "Z", "ZAF", NA)
)

# Perform a left join to illustrate the many-to-many relationship
result <- left_join(questions_df, tags_df, by = "tag")

Solution

  • # Perform the join based on the primary `tag` condition
    result <- left_join(questions_df, tags_df, by = "tag") %>%
      group_by(question_id) %>%
      # If there are multiple matches, filter by `country`
      filter(n() == 1 | (country.x == country.y | is.na(country.y))) %>%
      ungroup() %>%
      distinct()
    
    result