rjsonweb-scrapingerror-handlingjsonparser

How to fix lexical error: invalid char in JSON text. when trying to parse from JSON


I have a problem with this RAW JSON generated from a script within a webpage of an ecommerce.

I need to parse it and extract information about the product

The same code works for other pages of the site but some products generate the error.

This is one of such problematic JSONs:

"\n\tvar products = [];\n\n\tvar _getAvailabilityText = function(available) {\n\t\tif(available != 'outOfStock')\n\t\t\treturn 'si';\n\t\telse\n\t\t\treturn 'no';\n\t};\n\n\tvar _getAvailabilityBinary = function(available) {\n\t\tif(available != 'outOfStock')\n\t\t\treturn 1;\n\t\telse\n\t\t\treturn 0;\n\t};\n\n\tproducts.push({\n\t\tid \t\t\t: 'IME5401' || '',\n\t\tprice \t\t: '22.99' || '',\n\t\tcurrency \t: 'EUR',\n\t\tname \t\t: 'Imetec Living Air umidificatore Vapore 0,4 L 700 W Blu' || '',\n\t\tcategory \t: 'Riscaldamento' || '',\n\t\tcategoryId\t: 'C7301' || '',\n\t\tgroup\t\t: 'Riscaldamento' || '',\n\t\ttdId\t\t: '',\n\t\tweight\t\t: '',\n\t\tbrand \t\t: 'Imetec' || '',\n\t\tvariant \t: '',\n\t\tdimension55 : _getAvailabilityText('lowStock'),\n\t\tmetric5 \t: _getAvailabilityBinary('lowStock'),\n\t\tdimension63\t: '',\n\t\tmetric12\t: '',\n\t\tdimension10 : 'Piccoli e Grandi Elettrodomestici',\n\t\tdimension11 : 'Trattamento Aria',\n\t\tdimension66 : '',\n\t\tdimension62 : '' || 'no-promo'\n\t});\n\n\twindow.dataLayer.push({\n\t\t'products'\t\t: products\n\t});\n\n\t/*window.dataLayer.push({\n\t\t'event' : 'detail',\n\t\t'ecommerce' : {\n\t\t\t'currencyCode': 'EUR',\n\t\t\t'detail' : {\n\t\t\t\t'products' : products\n\t\t\t}\n\t\t}\n\t});*/\n\n\twindow.dataLayer.push({\n\t\t'event': 'productDetail',\n\t\t'ecommerce' : {\n\t\t\t'currencyCode': 'EUR',\n\t\t\t'detail': {\n\t\t\t\t'products' : [{\n\t\t\t\t\t'name': 'Imetec Living Air umidificatore Vapore 0,4 L 700 W Blu' || '',\n\t\t\t\t\t'id': 'IME5401' || '',\n\t\t\t\t\t'price': '22.99' || '',\n\t\t\t\t\t'brand': 'Imetec' || '',\n\t\t\t\t\t'category': 'Riscaldamento' || '',\n\t\t\t\t}]\n\t\t\t}\n\t\t}\n\t});\n"

Here is my code:

  raw_json_embed <- json_data %>%
    str_remove_all("\\n|\\t") %>%
    str_extract("(?<=products\\.push\\()(\\{.*?\\})(?=\\);)") %>%
    str_replace_all("'", '"') %>%
    str_replace_all(' : ', ':')
ex_parsed_json <- jsonlite::parse_json(raw_json_embed)

At this point I get this error:

Error: lexical error: invalid char in json text.
                                      {id:"IME5401" || "",price:"22.99
                     (right here) ------^

I have tried other solutions such as these:

  raw_json_embed <- json_data %>%
    str_remove_all("\\n|\\t") %>%
    str_replace(".*(\\[\\{)", "\\1") %>%
    str_replace("(\\}\\]).*", "\\1")
  
  raw_json_embed <- gsub("'", '"', raw_json_embed)

But I still get the error.

If I copy the whole RAW JSON into a JSON validator it doesn't find any problem at all, I'm clueless


Solution

  • With some trickery, this particular example can be evaluated in V8 JS engine as-is and resulting object can be passed through js JSON.stringify() to get a valid JSON. Though keep in mind that running random code is generally a BadIdea(tm) and it may not scale for your real task.

    library(V8)
    #> Using V8 engine 9.1.269.38
    library(dplyr)
    
    ct <- v8()
    # v8 does not provide Window, though script only uses window.dataLayer.push()
    # and we can easily mock it with our own window object and array in it:
    ct$eval("var window = {dataLayer : []};")
    # evaluate the js string, script pushes product details to window.dataLayer 
    ct$eval(js)
    #> [1] "2"
    
    # turn our fake window.dataLayer to json string
    products_json <- ct$eval("JSON.stringify(window.dataLayer)") %>% 
      jsonlite::parse_json()
    
    # 2 objects that js script was pushing to window.dataLayer:
    products_json[[1]][["products"]][[1]] %>% 
      as_tibble() %>% 
      glimpse()
    #> Rows: 1
    #> Columns: 19
    #> $ id          <chr> "IME5401"
    #> $ price       <chr> "22.99"
    #> $ currency    <chr> "EUR"
    #> $ name        <chr> "Imetec Living Air umidificatore Vapore 0,4 L 700 W Blu"
    #> $ category    <chr> "Riscaldamento"
    #> $ categoryId  <chr> "C7301"
    #> $ group       <chr> "Riscaldamento"
    #> $ tdId        <chr> ""
    #> $ weight      <chr> ""
    #> $ brand       <chr> "Imetec"
    #> $ variant     <chr> ""
    #> $ dimension55 <chr> "si"
    #> $ metric5     <int> 1
    #> $ dimension63 <chr> ""
    #> $ metric12    <chr> ""
    #> $ dimension10 <chr> "Piccoli e Grandi Elettrodomestici"
    #> $ dimension11 <chr> "Trattamento Aria"
    #> $ dimension66 <chr> ""
    #> $ dimension62 <chr> "no-promo"
    
    products_json[[2]][["ecommerce"]][["detail"]][["products"]][[1]] %>% 
      as_tibble() %>% 
      glimpse()
    #> Rows: 1
    #> Columns: 5
    #> $ name     <chr> "Imetec Living Air umidificatore Vapore 0,4 L 700 W Blu"
    #> $ id       <chr> "IME5401"
    #> $ price    <chr> "22.99"
    #> $ brand    <chr> "Imetec"
    #> $ category <chr> "Riscaldamento"
    

    Input js string, formatted:

    js <- r"(
    var products = [];
    
    var _getAvailabilityText = function (available) {
        if (available != 'outOfStock')
            return 'si';
        else
            return 'no';
    };
    
    var _getAvailabilityBinary = function (available) {
        if (available != 'outOfStock')
            return 1;
        else
            return 0;
    };
    
    products.push({
        id: 'IME5401' || '',
        price: '22.99' || '',
        currency: 'EUR',
        name: 'Imetec Living Air umidificatore Vapore 0,4 L 700 W Blu' || '',
        category: 'Riscaldamento' || '',
        categoryId: 'C7301' || '',
        group: 'Riscaldamento' || '',
        tdId: '',
        weight: '',
        brand: 'Imetec' || '',
        variant: '',
        dimension55: _getAvailabilityText('lowStock'),
        metric5: _getAvailabilityBinary('lowStock'),
        dimension63: '',
        metric12: '',
        dimension10: 'Piccoli e Grandi Elettrodomestici',
        dimension11: 'Trattamento Aria',
        dimension66: '',
        dimension62: '' || 'no-promo'
    });
    
    window.dataLayer.push({
        'products': products
    });
    
    
    window.dataLayer.push({
        'event': 'productDetail',
        'ecommerce': {
            'currencyCode': 'EUR',
            'detail': {
                'products': [{
                    'name': 'Imetec Living Air umidificatore Vapore 0,4 L 700 W Blu' || '',
                    'id': 'IME5401' || '',
                    'price': '22.99' || '',
                    'brand': 'Imetec' || '',
                    'category': 'Riscaldamento' || '',
                }]
            }
        }
    });
    
    )"
    

    Created on 2023-05-10 with reprex v2.0.2