Currently I am studying the usage of Apache Spark 3.0 with Rapids GPU Acceleration. In the official spark-rapids
docs I came across this page which states:
There are cases where you may want to get access to the raw data on the GPU, preferably without copying it. One use case for this is exporting the data to an ML framework after doing feature extraction.
To me this sounds as if one could make data that is already available on the GPU from some upstream Spark ETL process directly available to a framework such as Tensorflow or PyTorch. If this is the case how can I access the data from within any of these frameworks? If I am misunderstanding something here, what is the quote exactly referring to?
The link you references really only allows you to get access to the data still sitting on the GPU, but using that data in another framework, like Tensorflow or PyTorch is not that simple.
TL;DR; Unless you have a library explicitly setup to work with the RAPIDS accelerator you probably want to run your ETL with RAPIDS, then save it, and launch a new job to train your models using that data.
There are still a number of issues that you would need to solve. We have worked on these in the case of XGBoost, but it has not been something that we have tried to tackle for Tensorflow or PyTorch yet.
The big issues are