I'm trying to create a function that will generate a hash using sha1 algorithm with 9 leading zeroes. The hash is based on some random data and, like in concurrency mining, I just want to add 1 to the string that is used in the hash function.
For this to be faster I used map() from the Pool class to make it run on all my cores, but I have an issue if I pass a chunk larger than range(99999999)
def computesha(counter):
hash = 'somedata'+'otherdata'+str(counter)
newHash = hashlib.sha1(hash.encode()).hexdigest()
if newHash[:9] == '000000000':
print(str(newHash))
print(str(counter))
return str(newHash), str(counter)
if __name__ == '__main__':
d1 = datetime.datetime.now()
print("Start timestamp" + str(d1))
manager = multiprocessing.Manager()
return_dict = manager.dict()
p = Pool()
p.map(computesha, range(sys.maxsize) )
print(return_dict)
p.close()
p.join()
d2 = datetime.datetime.now()
print("End timestamp " + str(d2))
print("Elapsed time: " + str((d2-d1)))
I want to create something similar to a global counter to feed it into the function while it is running multi-threaded, but if I try range(sys.maxsize) I get a MemoryError (I know, because i don't have enough RAM, few have), but I want to split the list generated by range() into chunks. Is this possible or should I try a different approach?
Hi Alin and welcome to stackoverflow.
Firstly, yes, a global counter is possible. E.g with a multiprocessing.Queue or a multiprocessing.Value which is passed to the workers. However, fetching a new number from the global counter would result in locking (and possibly waiting for) the counter. This can and should be avoided, as you need to make A LOT of counter queries. My proposed solution below avoids the global counter by installing several local counters which work together as if they were a single global counter.
Regarding the RAM consumption of your code, I see two problems:
computesha
returns a None
value most of the time. This goes into the iterator which is created by map
(even though you do not assign the return value of map
). This means, that the iterator is a lot bigger than necessary.maxtasksperchild
option (see the documentation of multiprocessing.pool.Pool). When you set this option to 1000, it closes the process after 1000 task and creates a new one, which frees the memory.However, i'd like to propose a different solution which solves both problems, is very memory-friendly and runs faster (as it seems to me after N<10 tests) as the solution with the maxtasksperchild
option:
#!/usr/bin/env python3
import datetime
import multiprocessing
import hashlib
import sys
def computesha(process_number, number_of_processes, max_counter, results):
counter = process_number # every process starts with a different counter
data = 'somedata' + 'otherdata'
while counter < max_counter: #stop after max_counter jobs have been started
hash = "".join((data,str(counter)))
newHash = hashlib.sha1(hash.encode()).hexdigest()
if newHash[:9] == '000000000':
print(str(newHash))
print(str(counter))
# return the results through a queue
results.put((str(newHash), str(counter)))
counter += number_of_processes # 'jump' to the next chunk
if __name__ == '__main__':
# execute this file with two command line arguments:
number_of_processes = int(sys.argv[1])
max_counter = int(sys.argv[2])
# this queue will be used to collect the results after the jobs finished
results = multiprocessing.Queue()
processes = []
# start a number of processes...
for i in range(number_of_processes):
p = multiprocessing.Process(target=computesha, args=(i,
number_of_processes,
max_counter,
results))
p.start()
processes.append(p)
# ... then wait for all processes to end
for p in processes:
p.join()
# collect results
while not results.empty():
print(results.get())
results.close()
This code spawns the desired number_of_processes
which then call the computesha
function. If number_of_processes=8
then the first process calculates the hash for the counter values [0,8,16,24,...]
, the second process for [1,9,17,25]
and so on.
The advantages of this approach: In each iteration of the while loop the memory of hash
, and newHash
can be reused, loops are cheaper than functions and only number_of_processes
function calls have to be made, and the uninteresting results are simply forgotten.
A possible disadvantage is, that the counters are completely independent and every process will do exactly 1/number_of_processes
of the overall work, even if the some are faster than others. Eventually, the program is as fast as the slowest process. I did't measure it, but I guess it is a rather theoretical problem here.
Hope that helps!