I'm making a script that calls the same function several times with different parameters, then save the results in a spreadsheet (excel). To do this, I'm using multiprocessing.Pool, and I need to save the parameters in a list. The problem is, when I pass the parameters to Pool it gets a copy of the object. I assume the issue is map() but I'm not sure how to fix it.
Here is a working example to test the issue:
import sys
import multiprocessing
import xlwt
from xlwt import Workbook
def get_prices_in_range(varss):
row = varss[0]
col = varss[2]
start_day = varss[1]
sheet = varss[3]
print(sheet)
try:
sheet.write(row, col, start_day)
except:
pass
if __name__ == '__main__':
# multiprocessing pool object
pool = multiprocessing.Pool()
# pool object with number of element
pool = multiprocessing.Pool(processes=2)
#create sheet in excel
wb = Workbook()
sheet1 = wb.add_sheet("Sheet 1")
print(sheet1)
# input list: extend will copy sheet1! not a reference
inputs = [[x] for x in range(2)]
for x in inputs:
x.extend([1, 2, sheet1])
#they've been added by reference:
print(inputs)
#this doesn't work: sheet1 never gets writen.
pool.map(get_prices_in_range, inputs)
# this works:
#get_prices_in_range([0, 1, 2, sheet1])
#get_prices_in_range([1, 1, 2, sheet1])
wb.save("test.xls")
You can see that every print(sheet) shows a different object than print(inputs) as is, or the same if you comment pool.map(get_prices_in_range, inputs) and uncomment the next 2 lines of code.
Any idea how to fix this issue? And happy new year!
The issue arises because multiprocessing.Pool
creates separate processes, and objects shared between processes are serialized and deserialized. During this process, the reference to the original object (sheet1) is lost, so each process gets its own copy of the object.
To solve this problem you can use multiprocessing.Queue
to collect results. Instead of trying to write directly to sheet1
in the worker processes, you can have each process return the data it processes, and the main process can handle writing to sheet1
.
Here is the updated code that should work. I've added comments where I have made the changes:
import multiprocessing
import xlwt
from xlwt import Workbook
def get_prices_in_range(varss):
# Worker function: return the data instead of modifying the sheet
row, col, start_day = varss[0], varss[1], varss[2]
return (row, col, start_day)
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=2)
# Create workbook and sheet in the main process
wb = Workbook()
sheet1 = wb.add_sheet("Sheet 1")
# Prepare inputs; removed `sheet1` since workers no longer modify it directly
inputs = [[x, 1, 2] for x in range(2)]
# Collect results from workers instead of sharing the sheet
results = pool.map(get_prices_in_range, inputs)
# Write results to the sheet in the main process
for row, col, start_day in results:
sheet1.write(row, col, start_day)
wb.save("test.xls")
Let me know if this works for you.