pythonoptimizationbinary-datahard-driveraid

What's the most efficient way to process massive amounts of data from a disk using python?


I was writing a simple python script to read from and reconstruct data from a failed RAID5 array that I've been unable to rebuild in any other way. My script is running but slowly. My original script ran at about 80MB/min. I've since improved the script and it's running at 550MB/min but that still seems a bit low. The python script is sitting at 100% CPU, so it seems to be CPU rather than disk limited, which means I have opportunity for optimization. Because the script isn't very long at all I am unable to profile it effectively, so I don't know what's eating it all up. Here's my script as it stands right now (or at least, the important bits)

disk0chunk = disk0.read(chunkSize)
#disk1 is missing, bad firmware
disk2chunk = disk2.read(chunkSize)
disk3chunk = disk3.read(chunkSize)
if (parityDisk % 4 == 1): #if the parity stripe is on the missing drive
  output.write(disk0chunk + disk2chunk + disk3chunk)
else: #we need to rebuild the data in disk1
  # disk0num = map(ord, disk0chunk) #inefficient, old code
  # disk2num = map(ord, disk2chunk) #inefficient, old code
  # disk3num = map(ord, disk3chunk) #inefficient, old code
  disk0num = struct.depack("16384l", disk0chunk) #more efficient new code
  disk2num = struct.depack("16384l", disk2chunk) #more efficient new code
  disk3num = struct.depack("16384l", disk3chunk) #more efficient new code
  magicpotato = zip(disk0num,disk2num,disk3num)
  disk1num = map(takexor, magicpotato)
  # disk1bytes = map(chr, disk1num) #inefficient, old code
  # disk1chunk = ''.join(disk1bytes) #inefficient, old code
  disk1chunk = struct.pack("16384l", *disk1num) #more efficient new code

  #output nonparity to based on parityDisk

def takexor(magicpotato):
  return magicpotato[0]^magicpotato[1]^magicpotato[2]

Bolding to denote the actual questions inside this giant block of text:

Is there anything I can be doing to make this faster/better? If nothing comes to mind, is there anything I can do to better research into what is making this go slowly? (Is there even a way to profile python at a per line level?) Am I even handling this the right way, or is there a better way to handle massive amounts of binary data?

The reason I ask is I have a 3TB drive rebuilding and even though it's working correctly (I can mount the image ro,loop and browse files fine) it's taking a long time. I measured it as taking until mid-January with the old code, now it's going to take until Christmas (so it's way better but it's still slower than I expected it to be.)

Before you ask, this is an mdadm RAID5 (64kb blocksize, left symmetric) but the mdadm metadata is missing somehow and mdadm does not allow you to reconfigure a RAID5 without rewriting the metadata to the disk, which I am trying to avoid at all costs, I don't want to risk screwing something up and losing data, however remote the possibility may be.


Solution

    1. map(takexor, magicpotato) - This is probably better done with direct iteration, map isn't efficient if it needs to call other python code AFAIK, it needs to construct and destroy 16384 frame objects to perform the call, etc.

    2. Use the array module instead of struct

    3. If it's still too slow compile it with cython and add some static types (that will probably make it 2-3 orders of magnitude faster)