pythonpython-3.xparsingassemblyida

How do I parse this IDA-generated asm file to get a list of mnemonics for each function?


I have asm file, that was produced with IDA Pro. All of its functions looks almost like this.

; =============== S U B R O U T I N E =======================================


release                                 ; DATA XREF: attribute_manager_create+78↓o
                                        ; attribute_manager_create+7C↓o ...

var_30          = -0x30
var_24          = -0x24
arg_0           =  0
arg_4           =  4

                PUSH    {R4-R9,LR}
                MOV     R7, R0
                LDR     R0, [R0,#0x34]
                SUB     SP, SP, #0x14
                MOV     R9, R3
                LDR     R3, [R0]
                MOV     R5, R1
                MOV     R8, R2
                BLX     R3
                LDR     R0, [R7,#0x30]
                ADD     R6, SP, #0x30+var_24
                LDR     R3, [R0,#4]
                BLX     R3
                MOV     R4, R0
                B       loc_7A7C
; ---------------------------------------------------------------------------

loc_7A70                                ; CODE XREF: release+5C↓j
                LDR     R3, [SP,#0x30+var_24]
                CMP     R3, R5
                BEQ     loc_7AB4

loc_7A7C                                ; CODE XREF: release+38↑j
                LDR     R3, [R4]
                MOV     R1, R6
                MOV     R0, R4
                BLX     R3
                CMP     R0, #0
                BNE     loc_7A70

loc_7A94                                ; CODE XREF: release+A0↓j
                LDR     R3, [R4,#8]
                MOV     R0, R4
                BLX     R3
                LDR     R0, [R7,#0x34]
                LDR     R3, [R0,#0xC]
                BLX     R3
                ADD     SP, SP, #0x14
                POP     {R4-R9,PC}
; ---------------------------------------------------------------------------

loc_7AB4                                ; CODE XREF: release+44↑j
                LDR     R3, [SP,#0x30+arg_4]
                STR     R3, [SP,#0x30+var_30]
                MOV     R2, R9
                LDR     R3, [SP,#0x30+arg_0]
                LDR     R6, [R5,#4]
                MOV     R1, R8
                MOV     R0, R5
                BLX     R6
                B       loc_7A94
; End of function release

I want to parse this file and get a dictionary where the key will be the name of the function and the value will be a string that is formed from the instructions combined together. I will explain in more detail.

I have a dictionary in which each Arm instruction corresponds to a specific letter.

arm_dict = {"MOV": "a","MVN": "b","ADD": "c","SUB": "d","MUL": "e","LSL": "f","LSR": "g","ASR": "h","ROR": "i","CMP": "j","AND": "k","ORR": "l","EOR": "m","LDR": "n","STR": "o","LDM": "p","STM": "q","PUSH": "r","POP": "s","B": "t","BL": "u","BLX": "v","BEQ": "w","SWI": "x","SVC": "y","NOP": "z"}

When parsing, you need the instruction to become this letter. For example, the above function in the dictionary should look like this:

{'release': 'randanaavncnvat...'}

If the code contains an instruction that is not in arm_dict, then that instruction is skipped.

I've tried to parse linearly using strings containing "S U B R O U T I N E" and "End of function", but I can't get rid of the instruction operands. I would be glad if someone can provide some sample code or advice.


Solution

  • arm_dict = {"MOV": "a","MVN": "b","ADD": "c","SUB": "d","MUL": "e","LSL": "f","LSR": "g","ASR": "h","ROR": "i","CMP": "j","AND": "k","ORR": "l","EOR": "m","LDR": "n","STR": "o","LDM": "p","STM": "q","PUSH": "r","POP": "s","B": "t","BL": "u","BLX": "v","BEQ": "w","SWI": "x","SVC": "y","NOP": "z"}
    FILE_NAME = "ida_output.asm"
    result = ""
    
    with open(FILE_NAME) as f:
        lines = f.readlines()
        for line in lines:
            words = line.split()
            # if the line is empty, skip it
            if not words:
                continue
            if words[0] in arm_dict:
                result += arm_dict[words[0]]
    
    print(result)
    

    Heres a messy edited version after peter's suggestion:

    arm_dict = {"MOV": "a","MVN": "b","ADD": "c","SUB": "d","MUL": "e","LSL": "f","LSR": "g","ASR": "h","ROR": "i","CMP": "j","AND": "k","ORR": "l","EOR": "m","LDR": "n","STR": "o","LDM": "p","STM": "q","PUSH": "r","POP": "s","B": "t","BL": "u","BLX": "v","BEQ": "w","SWI": "x","SVC": "y","NOP": "z"}
    FILE_NAME = "ida_output.asm"
    
    def trim(line):
        if ";" in line:
            return line.split(";")[0]
        return line
    
    functions = {}
    with open(FILE_NAME) as f:
        label = None
        lines = f.readlines()
        for line in lines:
            words = line.split()
            # if the line is empty, skip it
            if not words:
                continue
            first = words[0]
            if label and first in arm_dict:
                functions[label] += arm_dict[first]
            elif first[0] != ";" and (not label or (not line[0].isspace() and "=" not in trim(line) and label not in line)):
                label = first
                functions[label] = ""
    
    
    
    print(functions)
    

    Theres lots of potential edge cases it could fail, but it should do pretty alright.