Here is my problem: I have an individual database: 1 row = 1 person. For each person, there is a unique identifier ("INAMI_key"), individual variables ("code_qualif" in the example below) and one or several addresses filled in through different columns. The number of addresses is indicated in the "n_addresses" variable: 1 for one address, 2 for two addresses, etc. The different addresses are indicated in the variables "travail_ruex" (street) and "travail_code_postalx" (postal code). This is what it looks like:
INAMI_key code_qualif n_adresses travail_rue1 travail_code_postal1 travail_rue2 travail_code_postal2 travail_rue3 travail_code_postal3 travail_rue4 travail_code_postal4 travail_rue5 travail_code_postal5
1 30000120 001 02 "RUE VAN ARTEVELDE " " 1000" "paul pastur " " 6180" "" "" "" "" "" ""
2 30000417 001 01 "av Margueritr depasse " " 6060" "" "" "" "" "" "" "" ""
3 37603435 007 01 "du Grand Veneur " " 1170" "" "" "" "" "" "" "" ""
4 38300152 007 02 "RUE WAUTERS 92-94" " 6040" "de châtelet " " 6120" "" "" "" "" "" ""
5 38707849 001 03 "de Campine " " 4000" "de Chestret " " 4000" "de la Gare " " 4020" "" "" "" ""
6 38813856 001 03 "Torhoutste steenweg " " 8400" "Lage Kaart " " 2930" "De Vrièrestraat " " 8301" "" "" "" ""
7 38811084 001 04 "chaussée de Waterloo " " 1180" "avenue Napoléon " " 1420" "rue Léon Théodor " " 1090" "avenue de la Basilique " " 1081" "" ""
8 39105054 001 04 "EMILE CLAUS " " 1050" "RUE DU FOYER SCHAERBEEKOIS " " 1030" "RUE XAVIER DE BUE " " 1180" "BV LAMBERMONT " " 1030" "" ""
9 39117031 001 05 "KERKSTRAAT " " 3850" "Pater Richard van de Wouwerstraa" " 3271" "Wilderenlaan " " 3803" "Gyzevennestraat " " 3560" "Molenveldstraat " " 3500"
10 31823918 070 05 "Route de l'Etat " " 1380" "Avenue Paul Hymans " " 1200" "Avenue WInston Churchill " " 1180" "Avenue Winston Churchill " " 1180" "avenue hippocrate " " 1200"
Here you have the code to import this example in R :
structure(list(INAMI_key = c("30000120", "30000417", "37603435",
"38300152", "38707849", "38813856", "38811084", "39105054", "39117031",
"31823918"), code_qualif = c("001", "001", "007", "007", "001",
"001", "001", "001", "001", "070"), n_adresses = c("02", "01",
"01", "02", "03", "03", "04", "04", "05", "05"), travail_rue1 = c("RUE VAN ARTEVELDE ",
"av Margueritr depasse ", "du Grand Veneur ",
"RUE WAUTERS 92-94", "de Campine ",
"Torhoutste steenweg ", "chaussée de Waterloo ",
"EMILE CLAUS ", "KERKSTRAAT ",
"Route de l'Etat "), travail_code_postal1 = c(" 1000",
" 6060", " 1170", " 6040", " 4000", " 8400", " 1180", " 1050",
" 3850", " 1380"), travail_rue2 = c("paul pastur ",
"", "", "de châtelet ", "de Chestret ",
"Lage Kaart ", "avenue Napoléon ",
"RUE DU FOYER SCHAERBEEKOIS ", "Pater Richard van de Wouwerstraa",
"Avenue Paul Hymans "), travail_code_postal2 = c(" 6180",
"", "", " 6120", " 4000", " 2930", " 1420", " 1030", " 3271",
" 1200"), travail_rue3 = c("", "", "", "", "de la Gare ",
"De Vrièrestraat ", "rue Léon Théodor ",
"RUE XAVIER DE BUE ", "Wilderenlaan ",
"Avenue WInston Churchill "), travail_code_postal3 = c("",
"", "", "", " 4020", " 8301", " 1090", " 1180", " 3803", " 1180"
), travail_rue4 = c("", "", "", "", "", "", "avenue de la Basilique ",
"BV LAMBERMONT ", "Gyzevennestraat ",
"Avenue Winston Churchill "), travail_code_postal4 = c("",
"", "", "", "", "", " 1081", " 1030", " 3560", " 1180"), travail_rue5 = c("",
"", "", "", "", "", "", "", "Molenveldstraat ",
"avenue hippocrate "), travail_code_postal5 = c("",
"", "", "", "", "", "", "", " 3500", " 1200")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
What I would like to do is multiply the number of rows per individual to show the different addresses on different rows, but in the same fields. For example, if an individual has 3 addresses, create 3 rows for the same individual, keep the individual variables, but reorganize the addresses into columns that have the same name: "travail_rue_total" and "travail_code_postal_total" in the example below. If the individual has 1 adress, create one row, if he has 5 adresses, create 5 rows, etc. :
INAMI_key code_qualif n_adresses travail_rue_total travail_code_postal_total
1 30000120 1 2 RUE VAN ARTEVELDE 1000
2 30000120 1 2 paul pastur 6180
3 30000417 1 1 av Margueritr depasse 6060
4 37603435 7 1 du Grand Veneur 1170
5 38300152 7 2 RUE WAUTERS 92-94 6040
6 38300152 7 2 de châtelet 6120
7 38707849 1 3 de Campine 4000
8 38707849 1 3 de Chestret 4000
9 38707849 1 3 de la Gare 4020
10 38813856 1 3 Torhoutste steenweg 8400
11 38813856 1 3 Lage Kaart 2930
12 38813856 1 3 De Vrièrestraat 8301
13 38811084 1 4 chaussée de Waterloo 1180
14 38811084 1 4 avenue Napoléon 1420
15 38811084 1 4 rue Léon Théodor 1090
16 38811084 1 4 avenue de la Basilique 1081
17 39105054 1 4 EMILE CLAUS 1050
18 39105054 1 4 RUE DU FOYER SCHAERBEEKOIS 1030
19 39105054 1 4 RUE XAVIER DE BUE 1180
20 39105054 1 4 BV LAMBERMONT 1030
21 39117031 1 5 KERKSTRAAT 3850
22 39117031 1 5 Pater Richard van de Wouwerstraa 3271
23 39117031 1 5 Wilderenlaan 3803
24 39117031 1 5 Gyzevennestraat 3560
25 39117031 1 5 Molenveldstraat 3500
26 31823918 70 5 Route de l'Etat 1380
27 31823918 70 5 Avenue Paul Hymans 1200
28 31823918 70 5 Avenue WInston Churchill 1180
29 31823918 70 5 Avenue Winston Churchill 1180
30 31823918 70 5 avenue hippocrate 1200
That's a simplified version of the data. In the whole database, i have 40 adresses x 15 variables for each (street, number, city, postal code, institution...).
Thank you !
You can use pivot_longer
to get data in long format and filter
to drop the empty values.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('travail'),
names_to = '.value',
names_pattern = 'travail_(.*?)\\d+') %>%
filter(rue != '')
# INAMI_key code_qualif n_adresses rue code_postal
# <chr> <chr> <chr> <chr> <chr>
# 1 30000120 001 02 "RUE VAN ARTEVELDE " " 1000"
# 2 30000120 001 02 "paul pastur " " 6180"
# 3 30000417 001 01 "av Margueritr depasse " " 6060"
# 4 37603435 007 01 "du Grand Veneur " " 1170"
# 5 38300152 007 02 "RUE WAUTERS 92-94" " 6040"
# 6 38300152 007 02 "de châtelet " " 6120"
# 7 38707849 001 03 "de Campine " " 4000"
# 8 38707849 001 03 "de Chestret " " 4000"
# 9 38707849 001 03 "de la Gare " " 4020"
#10 38813856 001 03 "Torhoutste steenweg " " 8400"
# … with 20 more rows