rjsonquantedaread-text

How to load multiple JSON files into a quanteda corpus using readtext?


I'm trying to load a large number of JSON files from a news website into a quanteda corpus using readtext. To simplify the process, the JSON files are all in the working directory. But I have also tried them in their own directory.

  1. When using c() to create a variable that explicitly defines a small subset of files, readtext works as hoped and a corpus is properly created with corpus().
  2. When attempting to create a variable using list.files() to list all of the +1500 JSON files readtext does not work as hoped, errors are returned, and a corpus is not created.

I tried to inspect the results of the two methods of defining the set of texts (i.e. c() and list.files()) as well as paste0().

# Load libraries
library(readtext)
library(quanteda)

# Define a set of texts explicitly
a <- c("border_2020_05_10__1589150513.json","border_2020_05_10__1589143358.json","border_2020_05_07__1589170960.json")

# This produces a corpus
extracted_texts <- readtext(a, text_field = "maintext")
my_corpus <- corpus(extracted_texts)
# Define a set of all texts in working directory
b <- list.files(pattern = "*.json", full.names = F)

# This, which I hope to use, produces an error
extracted_texts <- readtext(b, text_field = "maintext")
my_corpus <- corpus(extracted_texts)

The error produced by extracted_texts <- readtext(b, text_field = "maintext") is as follows

File doesn't contain a single valid JSON object.
Error: This JSON file format is not supported.

This is perplexing because the same files called with a do not produce an error. I validated several of the JSON files which in every case returned VALID (RFC 8259), the IETF standard for JSON.

Inspecting the differences between a and b:

I'm really confused why a works and b does not.

Lastly, attempting to exactly mimic procedures employed at the readtext documentation the following was also tried:

# XXXX = my username
data_dir <- file.path("C:/Users/XXXX/Documents/R/")

d <- readtext(paste0(data_dir, "/corpus_linguistics/*.json"), text_field = "maintext")

This also returned the error

File doesn't contain a single valid JSON object.
Error: This JSON file format is not supported.

At this point I'm stumped. Thanks in advance for any insight on how to move forward.

Solution and Summary

  1. Unclean Data: A few of the input JSON files have a null main_text field. These are not useful for analysis and should be removed. All of the files contain a JSON field called "title_rss" that is null. This can be eliminated through a directory level find and replace with Notepad ++, or probably R or Python though I still lack the skills for this. Additionally, the files were not in UTF-8 encoding, that was resolved with Codepage Converter.
  2. Method to call directory string: The list.files() method is employed in the readtext How to Use documentation and several third party tutorials. This method works with *.txt files but for some reason it does not seem to work with these particular JSON files. Once the JSON files are properly cleaned and encoded, the method below works without errors. If the data_dir is wrapped in a list.files() function it produces the following error: Error in list_files(file, ignore_missing, TRUE, verbosity) : File '' does not exist. I'm not sure why that is, but leaving it out works for these JSON files.
# Load libraries
library(readtext)
library(quanteda)

# Define a set of texts explicitly
data_dir <- "C:/Users/Nathan/Documents/R/corpus_linguistics/"
extracted_texts <- readtext(paste0(data_dir, "texts_unmodified/*.json"), text_field = "maintext", verbosity = 3)
my_corpus <- corpus(extracted_texts)

Test with unmodified files, one known to have empty fields

Input: 5 files consisting of 4 w/o an empty or null text_field and 1 file with a null text field. In addition, all of the files have Western European (Windows) 1252 Encoding.

Errors:

Reading texts from C:/Users/Nathan/Documents/R/corpus_linguistics/texts_unmodified/*.json
, using glob pattern
 ... reading (json) file: C:/Users/Nathan/Documents/R/corpus_linguistics/texts_unmodified/border_2014_02_17__1589147645.json
File doesn't contain a single valid JSON object.
 contain a single valid JSON object.
 ... reading (json) file: C:/Users/Nathan/Documents/R/corpus_linguistics/texts_unmodified/border_2014_03_13__1589150325.json
File doesn't contain a single valid JSON object.
Column 14 ['maintext'] of item 1 is length 0. This (and 0 others like it) has been filled with NA (NULL for list columns) to make each item uniform. ... read 5 documents.

Result: a properly formed corpus consisting of 5 documents. One document lacks either tokens or types. The corpus seems to build properly despite the errors. Perhaps some special characters don't display properly because of the encoding issue. I was not able to check this.

Test with cleaned files known to have no empty fields

Input files: 4 files that have no empty or null JSON fields. In all cases,text_field contains text and the title_rss field was removed. Each of the files was converted from Western European (Windows) 1252 into Unicode UTF-8-65001.

Errors: NONE!

Result: A properly formed corpus.

Many thanks to the two developers for detailed feedback and useful leads. The assistance is deeply appreciated.


Solution

  • There are a few possibilities here, but the most likely are:

    1. One of your files has a malformed JSON structure, from the point of view of readtext(). Even though this might be OK from a strictly JSON format, if one of your text fields is empty, for instance, then this will cause the error. (See below for a demonstration and a solution.)

    2. While readtext() can take a "glob" pattern match, list.files() takes a regular expression. It's possible (but unlikely) that you are picking up something you don't want then in list.files(pattern = "*.json".... But this should not be necessary with readtext() -- see below.

    To demonstrate, let's write out each document in data_corpus_inaugural as a separate JSON file, and then read them in using readtext().

    library("quanteda", warn.conflicts = FALSE)
    ## Package version: 2.0.1
    ## Parallel computing: 2 of 8 threads used.
    ## See https://quanteda.io for tutorials and examples.
    
    tmpdir <- tempdir()
    corpdf <- convert(data_corpus_inaugural, to = "data.frame")
    for (d in corpdf$doc_id) {
      cat(jsonlite::toJSON(dplyr::filter(corpdf, doc_id == d)),
        file = paste0(tmpdir, "/", d, ".json")
      )
    }
    
    head(list.files(tmpdir))
    ## [1] "1789-Washington.json" "1793-Washington.json" "1797-Adams.json"     
    ## [4] "1801-Jefferson.json"  "1805-Jefferson.json"  "1809-Madison.json"
    

    To read them in, you can use the "glob" pattern patch here and just read the JSON files.

    rt <- readtext::readtext(paste0(tmpdir, "/*.json"),
      text_field = "text", docid_field = "doc_id"
    )
    summary(corpus(rt), n = 5)
    ## Corpus consisting of 58 documents, showing 5 documents:
    ## 
    ##                  Text Types Tokens Sentences Year  President FirstName
    ##  1789-Washington.json   625   1537        23 1789 Washington    George
    ##  1793-Washington.json    96    147         4 1793 Washington    George
    ##       1797-Adams.json   826   2577        37 1797      Adams      John
    ##   1801-Jefferson.json   717   1923        41 1801  Jefferson    Thomas
    ##   1805-Jefferson.json   804   2380        45 1805  Jefferson    Thomas
    ##                  Party
    ##                   none
    ##                   none
    ##             Federalist
    ##  Democratic-Republican
    ##  Democratic-Republican
    

    So that all worked fine.

    But if we add to this one file whose text field is empty, then this produces the error in question:

    cat('[ { "doc_id" : "d1", "text" : "this is a file" },
           { "doc_id" : "d2", "text" :  } ]',
      file = paste0(tmpdir, "/badfile.json")
    )
    rt <- readtext::readtext(paste0(tmpdir, "/*.json"),
      text_field = "text", docid_field = "doc_id"
    )
    ## File doesn't contain a single valid JSON object.
    ## Error: This JSON file format is not supported.
    

    True, that was not a valid JSON file, since it contained a tag with no value. But I suspect you have something like that in one of your files.

    Here's how you can identify the problem: loop through your b (from the question, not as I've specified it below).

    b <- tail(list.files(tmpdir, pattern = ".*\\.json", full.names = TRUE))
    for (f in b) {
      cat("Reading:", f, "\n")
      rt <- readtext::readtext(f, text_field = "text", docid_field = "doc_id")
    }
    ## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2001-Bush.json 
    ## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2005-Bush.json 
    ## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2009-Obama.json 
    ## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2013-Obama.json 
    ## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2017-Trump.json 
    ## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/badfile.json 
    ## File doesn't contain a single valid JSON object.
    ## Error: This JSON file format is not supported.