regexhivehiveqlcreate-tablehive-serde

Creating table with Regex Serde in Hive returns error


I created a table using Regex Serde in Hive. In Hue it returns that the creation of the table was successful. However, when I try to return the table SELECT * FROM pricefile_edited or view the table in hue, it doesn't work and I get the Error .

The data is a 130 character (each line) with no delimiter.

Does anyone know what seems to be the problem, and help? Thanks

CREATE EXTERNAL TABLE pricefile_edited(
field1 STRING,
field2 STRING,
field3 STRING,
field4 STRING,
field5 STRING,
field6 STRING, 
field7 STRING,
field8 STRING,
field9 STRING,
field10 STRING,
field11 STRING,
field12 STRING,
field13 STRING,
field14 STRING,
field15 STRING,
field16 STRING,
field17 STRING,
field18 STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
 WITH SERDEPROPERTIES ("input.regex" = 
"(\\.{12})(\\.{1})(\\.{1})(\\.{24})(\\.{6})(\\.{6})(\\.{13})(\\.{6})(\\.{1})(\\.{4})(\\.{1})(\\.{3})(\\.{17})(\\.{9})(\\.{12})(\\.{1})(\\.{1})(\\.
{12})")
LOCATION '/user/hive/warehouse';

I get this error:

Bad status for request TFetchResultsReq(fetchType=0, operationHandle=TOperationHandle(hasResultSet=True, modifiedRowCount=None, operationType=0, operationId=THandleIdentifier(secret='\xc3\xd7\x97\xd3coB\xa1\x90P\x9e\xab\x82\xa4\xf4A', guid='\x80\xa1\x93\xe2\x10\xefJ\xd9\xa3\xa3\xdb\x1f\x95\x85\x88\xb3')), orientation=4, maxRows=100): TFetchResultsResp(status=TStatus(errorCode=0, errorMessage='java.io.IOException: java.io.IOException: Not a file: hdfs://quickstart.cloudera:8020/user/hive/warehouse/categories', sqlState=None, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:java.io.IOException: java.io.IOException: Not a file: hdfs://quickstart.cloudera:8020/user/hive/warehouse/categories:25:24', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:463', 'org.apache.hive.service.cli.operation.OperationManager:getOperationNextRowSet:OperationManager.java:294', 'org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:769', 'sun.reflect.GeneratedMethodAccessor20:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:498', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', 'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', 'java.security.AccessController:doPrivileged:AccessController.java:-2', 'javax.security.auth.Subject:doAs:Subject.java:422', 'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1917', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', 'com.sun.proxy.$Proxy21:fetchResults::-1', 'org.apache.hive.service.cli.CLIService:fetchResults:CLIService.java:462', 'org.apache.hive.service.cli.thrift.ThriftCLIService:FetchResults:ThriftCLIService.java:694', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1553', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1538', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', 'java.lang.Thread:run:Thread.java:748', '*java.io.IOException:java.io.IOException: Not a file: hdfs://quickstart.cloudera:8020/user/hive/warehouse/categories:29:4', 'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:508', 'org.apache.hadoop.hive.ql.exec.FetchOperator:pushRow:FetchOperator.java:415', 'org.apache.hadoop.hive.ql.exec.FetchTask:fetch:FetchTask.java:140', 'org.apache.hadoop.hive.ql.Driver:getResults:Driver.java:2069', 'org.apache.hive.service.cli.operation.SQLOperation:getNextRowSet:SQLOperation.java:458', '*java.io.IOException:Not a file: hdfs://quickstart.cloudera:8020/user/hive/warehouse/categories:32:3', 'org.apache.hadoop.mapred.FileInputFormat:getSplits:FileInputFormat.java:322', 'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextSplits:FetchOperator.java:363', 'org.apache.hadoop.hive.ql.exec.FetchOperator:getRecordReader:FetchOperator.java:295', 'org.apache.hadoop.hive.ql.exec.FetchOperator:getNextRow:FetchOperator.java:446'], statusCode=3), results=None, hasMoreRows=None)


Solution

  • Table location seems wrong: /user/hive/warehouse - this looks like default warehouse directory. There are some directories inside. It fails on /user/hive/warehouse/categories, saying that this is not a file. Looks like this is categories table directory.

    Create a folder inside /user/hive/warehouse directory and put files in it. Like this:

    /user/hive/warehouse/pricefiles/pricefile_edited.txt
    

    Change table location in the DDL:

    LOCATION '/user/hive/warehouse/pricefiles
    

    The regexp is not correct. Each column should have corresponding group in the regex (in parenthesis). Your regexp for the first column for example says it is 12 dots . because \\. means dot character literally. If you want any 12 characters it should be (.{12}) without two slashes. Also add delimiters between groups (space or tabs or what): (.{12})(.{1}) - this will take 12 characters from 140219078921B0 (140219078921) and B as second column. Fix your regexp accordingly and add spaces(delimiters) between groups if necessary. Also remove extra enter from regexp, write it as single line.

    You can test regex in a simple way using regexp_extract(string, regexp, group_number):

    hive> select regexp_extract('140219078921B0 A1DU1M 1223105DDB','(.{12})',1); --extract group number 1 (group 0 is the whole regexp)
    OK
    140219078921
    Time taken: 1.057 seconds, Fetched: 1 row(s)
    
    hive> select regexp_extract('140219078921B0 A1DU1M 1223105DDB','(.{12})(.{1})',2); --extract group number 2
    OK
    B
    Time taken: 0.441 seconds, Fetched: 1 row(s)
    

    And so on. Add more groups and carefully test