I am using an algorithm to lemmatize a text vector. The output is a .txt file stored in the way shown in the picture below.
The original word is listed in the first column, whilst the various lemmas are listed in the second column, followed by some grammatical classifications. I want to read this into R, but have no idea how to do this. I have tried various forms of separators, but none seem to work.
Ideally, I want the data frame in R to look as follows, where I only read the first occurence of each lemma:
Perhaps the best option could be to read the data, keep only the first occurence (ie. da da adv), then do something like text to columns and only keep the first two columns.
Output from lemmatization algorithm:
"<da>"
"da" adv
"da" sbu
"da" subst fork
"<dette>"
"dette" det dem nøyt ent
"dette" pron nøyt ent pers 3
"dette" verb inf
"<er>"
"være" verb pres <aux1/perf_part>
"<den>"
"den" det dem fem ent
"den" det dem mask ent
"den" pron mask fem ent pers 3
Wanted structure:
da da
dette dette
er være
den den
Here's an interesting result: You can read the file quite nicely with read.table:
s <- '"<da>"
"da" adv
"da" sbu
"da" subst fork
"<dette>"
"dette" det dem nøyt ent
"dette" pron nøyt ent pers 3
"dette" verb inf
"<er>"
"være" verb pres <aux1/perf_part>
"<den>"
"den" det dem fem ent
"den" det dem mask ent
"den" pron mask fem ent pers 3
'
x <- read.table(sep='', text=s, colClasses=c('character','character'), flush=TRUE, fill=TRUE)
> x
V1 V2 V3
1 <da>
2 da adv
3 da sbu
4 da subst fork
5 <dette>
6 dette det dem
7 dette pron nøyt
8 dette verb inf
9 <er>
10 være verb pres
11 <den>
12 den det dem
13 den det dem
14 den pron mask
Using packages dplyr
and tidyr
, we can unpack it into:
(y <- x %>% mutate(a=grepl('<', V1, fixed=TRUE), b=cumsum(a)) %>%
group_by(b) %>%
summarise(verbs=list(t(unique(V1)))) %>%
unnest(cols=c(verbs)))
# A tibble: 4 x 2
b verbs[,1] [,2]
<int> <chr> <chr>
1 1 <da> da
2 2 <dette> dette
3 3 <er> være
4 4 <den> den
result <- y$verbs
result[,1] <- gsub('(<|>)', '', result[,1])
[,1] [,2]
[1,] "da" "da"
[2,] "dette" "dette"
[3,] "er" "være"
[4,] "den" "den"