How to generate a counter for finding a hash with 9 leading zeroes

I'm trying to create a function that will generate a hash using sha1 algorithm with 9 leading zeroes. The hash is based on some random data and, like in concurrency mining, I just want to add 1 to the string that is used in the hash function.

For this to be faster I used map() from the Pool class to make it run on all my cores, but I have an issue if I pass a chunk larger than range(99999999)

def computesha(counter):
        hash = 'somedata'+'otherdata'+str(counter)
        newHash = hashlib.sha1(hash.encode()).hexdigest()     
        if newHash[:9] == '000000000':       
            print(str(newHash))
            print(str(counter))
            return str(newHash), str(counter)   

if __name__ == '__main__':

    d1 = datetime.datetime.now()
    print("Start timestamp" + str(d1))

    manager = multiprocessing.Manager()
    return_dict = manager.dict()

    p = Pool()
    p.map(computesha, range(sys.maxsize) )
    print(return_dict)
    p.close()
    p.join()

    d2 = datetime.datetime.now()  
    print("End timestamp " + str(d2))
    print("Elapsed time: " + str((d2-d1)))

I want to create something similar to a global counter to feed it into the function while it is running multi-threaded, but if I try range(sys.maxsize) I get a MemoryError (I know, because i don't have enough RAM, few have), but I want to split the list generated by range() into chunks. Is this possible or should I try a different approach?

Solution

Hi Alin and welcome to stackoverflow.

Firstly, yes, a global counter is possible. E.g with a multiprocessing.Queue or a multiprocessing.Value which is passed to the workers. However, fetching a new number from the global counter would result in locking (and possibly waiting for) the counter. This can and should be avoided, as you need to make A LOT of counter queries. My proposed solution below avoids the global counter by installing several local counters which work together as if they were a single global counter.

Regarding the RAM consumption of your code, I see two problems:

computesha returns a None value most of the time. This goes into the iterator which is created by map (even though you do not assign the return value of map). This means, that the iterator is a lot bigger than necessary.
Generally speaking, the RAM of a process is freed, after the process finishes. Your processes start A LOT of tasks which all reserve their own memory. A possible solution is the maxtasksperchild option (see the documentation of multiprocessing.pool.Pool). When you set this option to 1000, it closes the process after 1000 task and creates a new one, which frees the memory.

However, i'd like to propose a different solution which solves both problems, is very memory-friendly and runs faster (as it seems to me after N<10 tests) as the solution with the maxtasksperchild option:

#!/usr/bin/env python3
import datetime
import multiprocessing
import hashlib
import sys

def computesha(process_number, number_of_processes, max_counter, results):
    counter = process_number # every process starts with a different counter
    data = 'somedata' + 'otherdata'

    while counter < max_counter: #stop after max_counter jobs have been started
        hash = "".join((data,str(counter)))
        newHash = hashlib.sha1(hash.encode()).hexdigest()
        if newHash[:9] == '000000000':
            print(str(newHash))
            print(str(counter))

            # return the results through a queue
            results.put((str(newHash), str(counter)))
        counter += number_of_processes # 'jump' to the next chunk

if __name__ == '__main__':
    # execute this file with two command line arguments:
    number_of_processes = int(sys.argv[1])
    max_counter = int(sys.argv[2])

    # this queue will be used to collect the results after the jobs finished
    results = multiprocessing.Queue()

    processes = []
    # start a number of processes...
    for i in range(number_of_processes):
        p = multiprocessing.Process(target=computesha, args=(i,
                                                             number_of_processes,
                                                             max_counter,
                                                             results))
        p.start()
        processes.append(p)

    # ... then wait for all processes to end
    for p in processes:
        p.join()

    # collect results
    while not results.empty():
        print(results.get())
    results.close()

This code spawns the desired number_of_processes which then call the computesha function. If number_of_processes=8 then the first process calculates the hash for the counter values [0,8,16,24,...], the second process for [1,9,17,25] and so on.

The advantages of this approach: In each iteration of the while loop the memory of hash, and newHash can be reused, loops are cheaper than functions and only number_of_processes function calls have to be made, and the uninteresting results are simply forgotten.

A possible disadvantage is, that the counters are completely independent and every process will do exactly 1/number_of_processes of the overall work, even if the some are faster than others. Eventually, the program is as fast as the slowest process. I did't measure it, but I guess it is a rather theoretical problem here.

Hope that helps!