I have a hashes.txt
file that stores strings and their compressed SHA-256 hash values. Each line in the file is formatted as follows:
<compressed_hash>:<original_string>
The compressed_hash
is created by taking the 6th, 13th, 20th, and 27th characters of the full SHA-256 hash. For example, the string alon
when hashed: 5a24f03a01d5b10cab6124f3c0e7086994ac9c869fc8e76e1463458f829fc864
would be stored as:
0db3:alon
I have a search.py
script that works like this
For example, if the user inputs 5a24f03a01d5b10cab6124f3c0e7086994ac9c869fc8e76e1463458f829fc864
in search.py
the script searches for its shortened form, 0db3
in hashes.txt
.
If multiple matches are found, like:
0db3:alon
0db3:apple
The script rehashes the matches (alon
, apple
) to get their full SHA-256 hash, and if there is a match (eg. alon
when fully hashed matches the user input (5a24f03a01d5b10cab6124f3c0e7086994ac9c869fc8e76e1463458f829fc864
), the script prints the string (alon
)
The problem with this script is that it, the search usually takes around 1 hour, and my hashes.txt
is 54GB. Here is the search.py
:
import hashlib
import mmap
def compress_hash(hash_value):
return hash_value[6] + hash_value[13] + hash_value[20] + hash_value[27]
def search_compressed_hash(hash_input, compressed_file):
compressed_input = compress_hash(hash_input)
potential_matches = []
with open(compressed_file, "r+b") as file:
# Memory-map the file, size 0 means the whole file
mmapped_file = mmap.mmap(file.fileno(), 0)
# Read through the memory-mapped file line by line
for line in iter(mmapped_file.readline, b""):
line = line.decode().strip()
parts = line.split(":", 1) # Split only on the first colon
if len(parts) == 2: # Ensure there are exactly two parts
compressed_hash, string = parts
if compressed_hash == compressed_input:
potential_matches.append(string)
mmapped_file.close()
return potential_matches
def verify_full_hash(potential_matches, hash_input):
for string in potential_matches:
if hashlib.sha256(string.encode()).hexdigest() == hash_input:
return string
return None
if __name__ == "__main__":
while True:
hash_input = input("Enter the hash (or type 'exit' to quit): ")
if hash_input.lower() == 'exit':
break
potential_matches = search_compressed_hash(hash_input, "hashes.txt")
found_string = verify_full_hash(potential_matches, hash_input)
if found_string:
print(f"Corresponding string: {found_string}")
else:
print("String not found for the given hash.")
And, if it helps, here's the hash.py
script that actually generates the strings and hashes and puts them in hashes.txt
import hashlib
import sys
import time
`# Set the interval for saving progress (in seconds)
SAVE_INTERVAL = 60 # Save progress every minute
BUFFER_SIZE = 1000000 # Number of hashes to buffer before writing to file
def generate_hash(string):
return hashlib.sha256(string.encode()).hexdigest()
def compress_hash(hash_value):
return hash_value[6] + hash_value[13] + hash_value[20] + hash_value[27]
def write_hashes_to_file(start_length):
buffer = [] # Buffer to store generated hashes
last_save_time = time.time() # Store the last save time
for generated_string in generate_strings_and_hashes(start_length):
full_hash = generate_hash(generated_string)
compressed_hash = compress_hash(full_hash)
buffer.append((compressed_hash, generated_string))
if len(buffer) >= BUFFER_SIZE:
save_buffer_to_file(buffer)
buffer = [] # Clear the buffer after writing to file
# Check if it's time to save progress
if time.time() - last_save_time >= SAVE_INTERVAL:
print("Saving progress...")
save_buffer_to_file(buffer) # Save any remaining hashes in buffer
buffer = [] # Clear buffer after saving
last_save_time = time.time()
# Save any remaining hashes in buffer
if buffer:
save_buffer_to_file(buffer)
def save_buffer_to_file(buffer):
with open("hashes.txt", "a") as file_hashes:
file_hashes.writelines(f"{compressed_hash}:{generated_string}\n" for compressed_hash, generated_string in buffer)
def generate_strings_and_hashes(start_length):
for length in range(start_length, sys.maxsize): # Use sys.maxsize to simulate infinity
current_string = [' '] * length # Initialize with spaces
while True:
yield ''.join(current_string)
if current_string == ['z'] * length: # Stop when all characters reach 'z'
break
current_string = increment_string(current_string)
def increment_string(string_list):
index = len(string_list) - 1
while index >= 0:
if string_list[index] == 'z':
string_list[index] = ' '
index -= 1
else:
string_list[index] = chr(ord(string_list[index]) + 1)
break
if index < 0:
string_list.insert(0, ' ')
return string_list
def load_progress():
# You may not need this function anymore
return 1 # Just return a default value
if __name__ == "__main__":
write_hashes_to_file(load_progress())`
My OS is Windows 10.
You only have 65536 unique search keys. Why not just make 65536 files? You could use 256 directories with 256 files each to keep it manageable.
Then you can entirely eliminate all of the lookup process. Bonus: your files are smaller because you don’t need to store the key at all.
You’ll have to do some one-time processing to split your big hash file into smaller files, of course, but your lookups should be effectively instant.