pythonjavah2o

java.lang.AssertionError when trying to train certain datasets with h2o


I am getting an error for my desired dataset when trying to use an isolation forest method to detect anomalies. However I have another completely different dataset that it works fine for, what could cause this issue?

isolationforest Model Build progress: | (failed) | 0% Traceback (most recent call last): File 
"h2o_test.py", line 149, in <module> isoforest.train(x=iso_forest.col_names[0:65], 
training_frame=iso_forest) File "/home/ec2-user/.local/lib/python3.7/site- 
packages/h2o/estimators/estimator_base.py", line 107, in train self._train(parms, 
verbose=verbose) File "/home/ec2-user/.local/lib/python3.7/site- 
packages/h2o/estimators/estimator_base.py", line 199, in _train 
job.poll(poll_updates=self._print_model_scoring_history if verbose else None) File 
"/home/ec2-user/.local/lib/python3.7/site-packages/h2o/job.py", line 89, in poll 
"\n{}".format(self.job_key, self.exception, self.job["stacktrace"])) OSError: Job with key 
$03017f00000132d4ffffffff$_92ee3e892f7bc86460e80153eaec4b70 failed with an exception: 

java.lang.AssertionError stacktrace: java.lang.AssertionError at 
hex.tree.DHistogram.init(DHistogram.java:350) at 
hex.tree.DHistogram.init(DHistogram.java:343) at 
hex.tree.ScoreBuildHistogram2$ComputeHistoThread.computeChunk(ScoreBuildHistogram2.java:427) 
at hex.tree.ScoreBuildHistogram2$ComputeHistoThread.map(ScoreBuildHistogram2.java:408) at 
water.LocalMR.compute2(LocalMR.java:89) at water.LocalMR.compute2(LocalMR.java:81) at 
water.H2O$H2OCountedCompleter.compute(H2O.java:1704) at 
jsr166y.CountedCompleter.exec(CountedCompleter.java:468) at 
jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263) at 
jsr166y.ForkJoinPool$WorkQueue.popAndExecAll(ForkJoinPool.java:906) at 
jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:979) at 
jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479) at 
jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
with open('/home/webapp/flask-api/tmp_rows/temp_file2.csv', 'w+') as tmp_file:
        temp_name = "/tmp_rows/temp_file2.csv"
        tmp_file.write(text_stream.getvalue())
        tmp_file.close()

h2o.init()
print("TEMP_nAME", temp_name)
iso_forest = h2o.import_file('/home/webapp/flask-api/{0}'.format(temp_name))
seed = 12345
ntrees = 100
isoforest = h2o.estimators.H2OIsolationForestEstimator(
ntrees=ntrees, seed=seed)
isoforest.train(x=iso_forest.col_names[0:65], training_frame=iso_forest)
predictions = isoforest.predict(iso_forest)
print(predictions)
h2o.cluster().shutdown()

The CSV is being created fine, so there doesn't seem to be an issue with that, what is causing this Java error? I even increased the size of my ec2 to have more RAM, that didn't solve it either.


Solution

  • I guess this is getting close votes because it will be the data that is causing the problem, but the data is not given. But maybe your data cannot be given, or there is too much of it.

    So I'd suggest try just using the first/half second half of data, and if only one or the other trigger it then keep repeating, to see if you can narrow it down to just one row.

    And the same for columns, e.g. try 10-15 columns at a time, to see if it is just one column, or maybe certain types of column, triggering it.

    Of course, once you have that, you also have the solution: exclude the troublesome column/row. But you will also have enough to give a bug report to H2O (looks like this can be done at https://github.com/h2oai/h2o-3/issues)