I'm trying to understand what would be a good framework that integrates easily with existing python code and allows distributing a huge dataset across multiple worker nodes to perform some transformation or operation on it.
The expectation is that each worker node should be assigned data based on a specific key(here country as given in transaction data below), where the worker performs required transformation and returns the results to the leader node.
Finally, the leader node should perform an aggregation of the results obtained from the worker nodes and return one final result.
transactions = [
{'name': 'A', 'amount': 100, 'country': 'C1'},
{'name': 'B', 'amount': 200, 'country': 'C2'},
{'name': 'C', 'amount': 10, 'country': 'C1'},
{'name': 'D', 'amount': 500, 'country': 'C2'},
{'name': 'E', 'amount': 400, 'country': 'C3'},
]
I came across a similar question, where Ray is suggested as an option but does Ray allow defining specifically which worker gets the data based on a key?
Another question talks about using pySpark for this, but then how do you make the existing python code work with PySpark with minimal code change since pySpark has its own api's?
Based on your question and posts that you cited, your post actually covers three questions:
Ray
allow defining specifically which worker gets the data based on a key: To be honest, I haven't use Ray
before, I can't answer this question at this moment, but I just took a look to their whitepaper, it looks like their architecture is similar to modern distributed processing framework, which use a driver / header / coordinator to control the tasks to different nodes. I suspect it can achieve the 2nd question, but again, I'm not sure about this question.PySpark
achieve aggregation with minimal code change: Not sure how's your current pipeline in python code. Assuming you're using Pandas
library, PySpark
actually has pandas-in-spark api (https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html). Therefore it should only have minimal code change. Even you're not using Pandas
but pure python logic, thanks for the contributor in Spark, it creates very convenient Spark API. For example, you if want to perform aggregation in Spark SQL API: df.groupBy('country').agg(func.count('name').alias('name_count'), func.sum('amount').alias('amount_sum'))
. Coding level is simple, but again, the spark performance tuning is the critical part to utilize and optimize your resources,.