text-miningquantedaudpipe

How to find the co-occurences of a specific term with udpipe in R?


I am new to the udpipe package, and I think it has great potential for the social sciences.

A current project of mine to study how news articles write about networks and networking (i.e. the people kind, not computer networks). For this, I webscraped 500 articles with the search string "network" from a Dutch site for news about the flexible economy (this is the major source of news and discussion about e.g. self-employment). The data is in Dutch, but that should not matter for my question.

What I like to use udpipe for, is to find out in what context the noun "netwerk" or verb "netwerken" is used. I tried kwic to get this (from quanteda), but that gives me just the "window in which it occurs.

I would like to use the lemma (netwerk/netwerken) with the co-occurences operator, but without specifying a second term, and only limited to that specific lemma, rather than calculating all co-occurences.

Is this possible, and how? A normal language example: In my network, I contact a lot of people through Facebook -> I would like to get co-occurrence of network and contact (a verb) I found most of my clients through my network -> here I would like "my network" + "found my clients".

Any help is mightily appreciated!


Solution

  • It looks like that udpipe makes more sense about "context" than kwic. If sentence level, lemma and limiting word types suffices it should be rather straight forward. Udpipe had dutch model also available prebuilt.

    #install.packages("udpipe")
    library(udpipe)
    #dl <- udpipe_download_model(language = "english")
    # Check the name on download result
    udmodel_en <- udpipe_load_model(file = "english-ud-2.0-170801.udpipe")
    
    # Single and multisentence samples
    txt <- c("Is this possible, and how? A normal language example: In
    my network, I contact a lot of people through Facebook -> I would like to get co-occurrence of
    network and contact (a verb) I found most of my clients through my network")
    txtb <- c("I found most of my clients through my network")
    x <- udpipe_annotate(udmodel_en, x = txt)
    x <- as.data.frame(x)
    xb <- udpipe_annotate(udmodel_en, x = txtb)
    xb <- as.data.frame(xb)
    
    # Raw preview
    table(x$sentence[x$lemma == 'network'])
    
    # Use x or xb here 
    xn <- udpipe_annotate(udmodel_en, x = x$sentence[x$lemma == 'network'])
    xdf <- as.data.frame(xn)
    
    # Reduce noise and group by sentence ~ doc_id to table
    df_view = subset(xdf, xdf$upos %in% c('PRON','NOUN','VERB','PROPN'))
    library(tidyverse)
    df_view %>% group_by(doc_id) %>% 
    summarize(lemma = paste(sort(unique(lemma)),collapse=", "))
    

    On quick test, the prebuilt model defines network and networking as independent root lemmas so some rough stemmer might work better. I did however ensure that including networks in sentences created new match.

                        I found most of my clients through my network 
                                                                    1 
    I would like to get co-occurrence of network and contact (a verb) 
                                                                    1 
         In my network, I contact a lot of people through Facebook -> 
                                                                    1 
    A tibble: 3 × 2
    doc_id  lemma
    <chr>   <chr>
    doc1    contact, Facebook, I, lot, my, network, people
    doc2    co-occurrence, contact, get, I, like, network, verb
    doc3    client, find, I, my, network
    
    

    It is totally possible to also find previous and following words as context by stepping up and down from matching lemma indexes but that felt closer to what kwic was allready doing. I did not include dynamic co-occurring tabulation and ordering but I would imagine it should be rather trivial part now when contextul words are extracted. I think it might need some stop words etc but those should become more apparent with bigger data.