When I recreate a table in spark using the command displayed from show create table mydb.mytable
I stop being able to use the table from Hive. This just happens for a few tables, the other tables I recreate still can be accessed from hive.
I'm adding comments metadata to my spark tables. I do it by recreating a loaded table. I first get the spark command to create a table:
spark.sql('show create table mydb.mytable')
It will return a create table command like:
"CREATE TABLE `raw_db`.`mytable` (`municipio` STRING )
USING orc
OPTIONS (`serialization.format` '1',
'hdfs://xxxx/corporativo/raw_db/mytable')"
I parse the returned SQL and insert the comments in it, but when I recreate it, spark can still correctly see the table, but it fails in Hive. Here is what Hive displays when I run the command describe raw_db.mytable
:
col_name data_type comment
col array<string> from deserializer
Here is the complete schema of a table where the recreation fails:
CREATE TABLE `sbx_ppppp`.`mytable` (
`id_conct_cdstrl` DECIMAL(19,0),
`dt_cria` TIMESTAMP,
`id_usu_cria` STRING,
`dt_ult_atualz` TIMESTAMP,
`id_usu_ult_atualz` STRING,
`nu_vrs` DECIMAL(19,0),
`dt_ini_vignc` TIMESTAMP,
`dt_fim_vignc` TIMESTAMP,
`nu_dia_prz_vignc` DECIMAL(4,0),
`in_escp_cnslt_altrd` STRING,
`id_fnt_conct_cdstrl` DECIMAL(2,0),
`id_avalc_conct_cdstrl_dlibr` DECIMAL(2,0),
`id_avalc_conct_cdstrl_autmtc` DECIMAL(2,0),
`id_avalc_conct_cdstrl_prop` DECIMAL(2,0),
`nu_rlat` DECIMAL(19,0),
`id_sit_conct_cdstrl` DECIMAL(2,0),
`id_solic_anls_conct_cdstrl` DECIMAL(19,0),
`id_anlst_rspvl` DECIMAL(19,0),
`id_grnt_rspvl` DECIMAL(19,0),
`id_doc_lst` DECIMAL(19,0),
`id_dlibr` DECIMAL(19,0),
`id_pondrc_escp` DECIMAL(2,0),
`qt_entdd_nao_escp` DECIMAL(6,0),
`vl_pontc_calc_nao_escp` DECIMAL(9,0),
`ds_fnaldd` STRING,
`te_pre_anls_conct_cdstrl` STRING)
USING orc
OPTIONS (
`serialization.format` '1',
path 'hdfs://xxxx/corporativo/raw_db/mytable'
)
I couldn't notice any difference from a table where it works and one where I can't see the schema from Hive after recreating.
How do I fix this problem for this handful of tables?
The problem was that some tables had too large comments. Any table that had a column with a comment with more than 1000 bytes would work in Spark, but have a broken schema when accessed from Hive.
I truncated comments with more 1000 bytes and everything worked fine.
Note that this limit is relative to the number of bytes in the string. So if you are using an UTF-8 encoded string in your comments, any accented char would use at least 2 bytes.
This limit wouldn't apply to the table comment, just to the columns.
I still don't know where the size of the comment is configured. Everything worked fine in my development environment, but failed in production.
BTW, in my version of Spark, the show create table
command does not display the comments.