nlpglove

when calculating the cooccurance of two words, do we sepate the sentences or linking all sentences?


For example, I get I document that contains 2 sentences: I am a person. He also likes apples. Do we need to count the cooccurrence of "person" and "He" ?


Solution

  • Each document is separated with a line break. Context windows of cooccurrences are limited to each document.

    Based on the implementation here.

    A newline is taken as indicating a new document (contexts won't cross newline).

    So, depending on how you prepare sentences, you may get different results:

    Setting 1: ('He', 'person') cooccurred

    ...
    I am a person. He also likes apples.
    ...
    

    Setting 2: ('He', 'person') not cooccurred

    ...
    I am a person. 
    He also likes apples.
    ...