tensorflow-transform

Is it possible to run tf transform on spark?


tf transform is handy for feature processing, but it's not efficient to run on large dataset without distributed computation. tf transform runs on beam, which to my understanding can use multiple runners like dataflow, spark runner etc., but I can't find any example about running tf transform on spark. I am wondering if it is supported at this moment.


Solution

  • I don't think you can run tf.transform on Spark at this time yet.

    tf.transform is in Python, and the Beam's Spark runner only supports Java. AFAIK only the Google's Cloud Dataflow runner works with Python and tf.transform. There is one article mentioned PySpark, but not sure how that fits in.

    There are ongoing Beam runner developments and one that is furtherest is probably Flink Runner which has Python SDK, but it is still under development, and support and examples are very sparse. Here is a stack overflow post about setting it up.