[SOLVED] Why does text2vec show more files than actually exist?

Why does text2vec show more files than actually exist?

I am testing text2vec. There are only 2 files under a dir (1.txt, 2.txt, of very small size, about 20 k each). I wanted to test their similarity. I do not know why it says 54 documents.

> library(stringr)
>  library(NLP)
>  library(tm)
>  library(text2vec)


>  filedir="F:\\0 R\\similarity test\\corpus"
>  prep_fun = function(x) {
+     x %>% 
+     # make text lower case
+     str_to_lower %>% 
+     # remove non-alphanumeric symbols
+     str_replace_all("[^[:alnum:]]", " ") %>% 
+     # collapse multiple spaces
+     str_replace_all("\\s+", " ")
+  }
>  allfile=idir(filedir)
>  #files=list.files(path=filedir, full.names=T)
>  #allfile=ifiles(files)
>  it=itoken(allfile, preprocessor=prep_fun, progressbar=F)
>  stopwrd=stopwords("en")
>  v=create_vocabulary(it, stopwords=stopwrd)
> v
Number of docs: 54 
174 stopwords: i, me, my, myself, we, our ... 
ngram_min = 1; ngram_max = 1 
Vocabulary: 
          term term_count doc_count
  1:     house          2         2
  2: 224161072          2         2
  3:  suggests          2         2
  4:   remains          2         2
  5: published          2         2
 ---                               
338:      year         14         6
339:       nep         16        12
340:      will         16        10
341:   chinese         20        12
342:     malay         20        10
>

I export the data into csv and find that the new file names are called:

1.txt_1
1.txt_2
1.txt_3
1.txt_4
...

...

If I used

#files=list.files(path=filedir, full.names=T)
#allfile=ifiles(files)

it still says 54 documents

And there are also similarity measures between them. Most of them are 0 similarity.

Please let me know if it should be such case or what ever.

What I want is only one similarity meaure for 1.txt and 2.txt and output such matrix that only contain measure for these two files.

Solution

text2vec consider each line in each file as a separate document. In your case I suggest to provide another reader function to the idir/ifiles function. Reader should just read whole file and collapse rows into a single string. (For example reader = function (x) paste(readLines(x), collapse=' '))