pysparkdatabricksxgboost

XGBoost model running out of memory in Databricks/PySpark


I am facing a problem for which I am unable to find a solution - whenever an xgboost model is used for relativelly small dataset inside Databricks environment with PySpark integration via xgboost.spark.SparkXGBClassifier, the task fails due to insufficient memory. The exception is as follows:

Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(66,0) finished unsuccessfully.

After a little bit of digging through what it means and why it occurs in this particular case, it kind of crystallized as a memory issue (especially after checking live metrics, like Ganglia, during testing of this model's training stage, where memory was completelly peaked out).

However, said dataset is nowhere near huge, about 20k rows only, where the feature column for the model itself is an IDF matrix, made from vectors which were created from an input text column (where size of each row instance is in the range of 400-600 words of variable character length). The odd part is that even if the final pipeline model is quite shallow and contains (for testing purposes) only default layers which are typically used for a task of this kind without any additional custom additions (such as custom written transformers), the memory sky rockets either way - there are, of course, additional transformers prepared to be put into the pipeline, however I didn't use them yet, so a potential issue with their implementation could be eliminated.

I have tried a variety of things:

The only thing that effectivelly reduced the memory usage was to completelly water down the dataset (to the extreme of dropping it to just 3k rows) or making the xgboost very shallow (a maximum depth of 2). No other intervention was effective, or rather effective enough, to notably easen up on the memory .

Despite all these changes, it all led me to conclusion that the issue is most likely in xgboost itself (in a form of some poor settings of the model or the cluster) or a bad general approach towards this use case (like using inadequate layers in the pipeline or just blatantly committing to a plain mistake).

To better understand, below is the code snippet used in said problem:


xgboost = SparkXGBClassifier(

    features_col="vectorizedFeatures", 

    label_col="label", 

    num_workers=2 # set to 2 because of 2 worker nodes


)

train_data, validation_data = df.randomSplit([0.8, 0.2], seed=42)


# Tokenization into separate words
tokenizer = Tokenizer(inputCol="Text",outputCol="tokens")

vectorizer = CountVectorizer(inputCol="tokens",outputCol="raw_features")
labelEncoder = StringIndexer(inputCol="category",outputCol="label", handleInvalid="keep")
idf = IDF(inputCol="raw_features",outputCol="vectorizedFeatures")


# Pipeline defining the order of the dataflow through the model

pipeline = Pipeline().setStages([tokenizer, vectorizer, labelEncoder, idf, xgboost])

# Definition of an evaluator
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")

# Fitting the model
model = pipeline.fit(train_data)

Also a sidenote - when I was checking the resource usage, I noticed that in both cases (one with too few rows to see if it atleast works and one where it fails due to OOM error) the CPU usage was quite low - averaging at around 15% for each worker node only.

Any ideas what might have gone wrong? Why memory is so incredibly thrashed while CPU is used up not even by a quarter?


Solution

  • I also faced the same issue. A little bit of digging led me to this https://github.com/dmlc/xgboost/issues/4826

    This indicates that xgboost is killing sparkcontext in case of a failure which might be the cause for your program to exit. I faced this issue while tuning the parameters. The solution mentioned is setting the parameter kill_spark_context_on_worker_failure to False. Hope this helps your issue.