When I try to read JSON content from a field I get:
WARNING: document 1, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'a:foo'
Here are the details:
This is the (super simplified) CSV file I'm trying to read:
1,hello world, document number one,a:foo
22,hello again, document number two,foo:bar
23,hello now, This is some stuff,foo:{bar:baz}
24,hello cow, more test stuff and things,{foo:bar}
55,hello suess, box and sox and goats and moats,[a]
56,hello raven, nevermore said the thing,foo:bar
When I run the indexer this is the result I get:
../bin/indexer --config /home/ec2-user/sphinx/etc/sphinx.conf --all --rotate
Sphinx 3.3.1 (commit b72d67b)
Copyright (c) 2001-2020, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/home/ec2-user/sphinx/etc/sphinx.conf'...
indexing index 'csvtest'...
WARNING: document 1, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'a:foo'
WARNING: document 22, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'foo:bar'
WARNING: document 23, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'foo:{bar:baz}'
WARNING: document 24, attribute assorted: JSON error: syntax error, unexpected '}', expecting '[' near '}'
WARNING: document 55, attribute assorted: JSON error: syntax error, unexpected ']', expecting '[' near ']'
WARNING: document 56, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'foo:bar'
collected 6 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 6 docs, 0.1 Kb
total 0.0 sec, 17.7 Kb/sec, 1709 docs/sec
rotating indices: successfully sent SIGHUP to searchd (pid=14393).
This is the entire config file:
source csvsrc
{
type = csvpipe
csvpipe_delimiter = ,
csvpipe_command = cat /home/ec2-user/sphinx/etc/example.csv
csvpipe_field_string =t
csvpipe_attr_string =c
csvpipe_attr_json =assorted
}
index csvtest
{
source = csvsrc
path = /var/data/test7
morphology = stem_en
rt_field = t
rt_field = c
rt_field = assorted
}
indexer
{
mem_limit = 128M
}
searchd
{
listen = 9312
listen = 9306:mysql41
log = /var/log/searchd.log
query_log = /var/log/query.log
pid_file = /var/log/searchd.pid
binlog_path = /var/data
}
And If I do log in and query, it's pretty obvious that the JSON was not, in fact, indexed (as expected from the warnings)
select * from csvtest;
+------+-------------+----------------------------------+----------+
| id | t | c | assorted |
+------+-------------+----------------------------------+----------+
| 1 | hello world | document number one | NULL |
| 22 | hello again | document number two | NULL |
| 23 | hello now | This is some stuff | NULL |
| 24 | hello cow | more test stuff and things | NULL |
| 55 | hello suess | box and sox and goats and moats | NULL |
| 56 | hello raven | nevermore said the thing | NULL |
+------+-------------+----------------------------------+----------+
6 rows in set (0.00 sec)
I have tried a few things, but I'm just groping in the dark. Some things I have tried:
{foo:bar}
and {[foo:bar]}
and [{foo,bar}]
based on some experiences with other JSON inputs where they want it to be either an array or dict at the top level. These actually generate slightly different errors:WARNING: document 24, attribute assorted: JSON error: syntax error, unexpected '}', expecting '[' near '}'
WARNING: document 55, attribute assorted: JSON error: syntax error, unexpected ']', expecting '[' near ']'
$end
token that the parser is looking for. This generates an actual error ERROR: index 'csvtest': source 'csvsrc': not all columns found (found=5, total=4, line=1).
which prevents index generation. That makes sense to me2a) I tried adding a whole other column after the JSON so I could have the ending comma but not get an error that would prevent the index from generating. This did generate the index, but did not provide the $end
token that the JSON parser was looking for.
I'm totally stumped.
Well as such a:foo
isnt a valid JSON value AFAIK. LOoks like it meant to be object? So would need {...} surrounding it.
But even {foo:bar}
is not valid either. At the very least the 'value' shoud be quoted {foo:"bar"}
. But really the keys quoting too {"foo":"bar"}
Javascript Objects technically allow unquoted key names, but JSON requires the quoting.
... but also remember it CSV. Quotes are typically used for quoting (eg when columns contain commas), so the quotes need double encoding! Ends up a bit messy...
24,hello cow, more test stuff and things,"{""foo"":""bar""}"