pythonpysparkdistributed-computingray

Distributing a python function across multiple worker nodes


I'm trying to understand what would be a good framework that integrates easily with existing python code and allows distributing a huge dataset across multiple worker nodes to perform some transformation or operation on it.

The expectation is that each worker node should be assigned data based on a specific key(here country as given in transaction data below), where the worker performs required transformation and returns the results to the leader node.

Finally, the leader node should perform an aggregation of the results obtained from the worker nodes and return one final result.

transactions = [
    {'name': 'A', 'amount': 100, 'country': 'C1'},
    {'name': 'B', 'amount': 200, 'country': 'C2'},
    {'name': 'C', 'amount': 10, 'country': 'C1'},
    {'name': 'D', 'amount': 500, 'country': 'C2'},
    {'name': 'E', 'amount': 400, 'country': 'C3'},
]

I came across a similar question, where Ray is suggested as an option but does Ray allow defining specifically which worker gets the data based on a key?
Another question talks about using pySpark for this, but then how do you make the existing python code work with PySpark with minimal code change since pySpark has its own api's?


Solution

  • Based on your question and posts that you cited, your post actually covers three questions:

    1. Data assignment into different nodes based on specific key: As you mentioned that you have multiple worker nodes, it forms a cluster to perform parallel computing. If you look at some distributed data processing / query engine, such as Spark and Trino, you can't assign specific key data into dedicated worker node, but you can partition your data in an even distribution so that each worker node can take partitions and perform parallel computing to increase the speed. Taking Spark as an example, it can perform data repartitioning based on your input partition key parameter and the number of partitions. But the real question is, how does your data partitioning strategy help optimize and utilize your cluster resource and computation speed?
    2. Does Ray allow defining specifically which worker gets the data based on a key: To be honest, I haven't use Ray before, I can't answer this question at this moment, but I just took a look to their whitepaper, it looks like their architecture is similar to modern distributed processing framework, which use a driver / header / coordinator to control the tasks to different nodes. I suspect it can achieve the 2nd question, but again, I'm not sure about this question.
    3. How PySpark achieve aggregation with minimal code change: Not sure how's your current pipeline in python code. Assuming you're using Pandas library, PySpark actually has pandas-in-spark api (https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html). Therefore it should only have minimal code change. Even you're not using Pandas but pure python logic, thanks for the contributor in Spark, it creates very convenient Spark API. For example, you if want to perform aggregation in Spark SQL API: df.groupBy('country').agg(func.count('name').alias('name_count'), func.sum('amount').alias('amount_sum')). Coding level is simple, but again, the spark performance tuning is the critical part to utilize and optimize your resources,.