I am testing text2vec. There are only 2 files under a dir (1.txt, 2.txt, of very small size, about 20 k each). I wanted to test their similarity. I do not know why it says 54 documents.
> library(stringr)
> library(NLP)
> library(tm)
> library(text2vec)
> filedir="F:\\0 R\\similarity test\\corpus"
> prep_fun = function(x) {
+ x %>%
+ # make text lower case
+ str_to_lower %>%
+ # remove non-alphanumeric symbols
+ str_replace_all("[^[:alnum:]]", " ") %>%
+ # collapse multiple spaces
+ str_replace_all("\\s+", " ")
+ }
> allfile=idir(filedir)
> #files=list.files(path=filedir, full.names=T)
> #allfile=ifiles(files)
> it=itoken(allfile, preprocessor=prep_fun, progressbar=F)
> stopwrd=stopwords("en")
> v=create_vocabulary(it, stopwords=stopwrd)
> v
Number of docs: 54
174 stopwords: i, me, my, myself, we, our ...
ngram_min = 1; ngram_max = 1
Vocabulary:
term term_count doc_count
1: house 2 2
2: 224161072 2 2
3: suggests 2 2
4: remains 2 2
5: published 2 2
---
338: year 14 6
339: nep 16 12
340: will 16 10
341: chinese 20 12
342: malay 20 10
>
I export the data into csv and find that the new file names are called:
1.txt_1
1.txt_2
1.txt_3
1.txt_4
...
...
If I used
#files=list.files(path=filedir, full.names=T)
#allfile=ifiles(files)
it still says 54 documents
And there are also similarity measures between them. Most of them are 0 similarity.
Please let me know if it should be such case or what ever.
What I want is only one similarity meaure for 1.txt and 2.txt and output such matrix that only contain measure for these two files.
text2vec consider each line in each file as a separate document. In your case I suggest to provide another reader
function to the idir/ifiles function. Reader should just read whole file and collapse rows into a single string. (For example reader = function (x) paste(readLines(x), collapse=' '))