First of all I'm new to working with binaries and hope this is not a stupid question.
I have generated tables with sequences of instructions from the .text section of a binary. A table with 2-instruction sequences looks like that:
sequence | total | relative
------------------------------------
e3a0b000e3a0e000 | 2437 | 0.0469
...
The sequences were extracted using IDAPython with the generated text files looking like that:
9c54 SUBROUTINE
9c54 e3a0b000 MOV R11, #0
9c58 e3a0e000 MOV LR, #0
...
UPDATED
Now I'm using the Aho-Corasick algorithm to match these sequences in the same binary from which I extracted them. I just add all sequences from the table to the Aho automaton:
import binascii
import ahocorasick
from connect_db import DB
from get_metadata import get_meta
a = ahocorasick.Automaton()
meta = get_meta()
with DB('test.db') as db:
for idx, key in enumerate(list(db.select_query(meta['select_queries']['select_all'].format('sequence_two')))):
a.add_word(key[0], (idx, key[0]))
a.make_automaton()
with open('../test/test_binary', 'rb') as f:
for sub in a.iter(f.read().hex()):
print('file offset: %s; length: %d; sequence: %s' % (hex(sub[0]), len(sub[1][1]), sub[1][1]))
Then I get the following outout:
file offset: 0x38b7; length: 16; sequence: e3a0b000e3a0e000
...
My problem is that Aho-Corasick returns 0x38b7 and I used ghex in Ubuntu to look into the binary again and found the two instructions at the expected offset:
offset: bytes:
00001C54 E3A0B000 E3A0E000 ...
Meaning I should find them in the range of 0x1c54 - 0x1c5c which is the raw offset (0x9c54 - 0x8000)
I have not really understood yet how I get to the same offset but I'd like to get the raw offset using Aho-Corasick. I know that Aho-Corasick returns the offset of the end of the key word.
I was able to fix the problem when I figured out that converting the bytes to hex ascii, the characters would take more memory. I had to half the returned offset from Aho-Corasick to get the real raw offset:
BEFORE
with open('../test/test_binary', 'rb') as f:
for sub in a.iter(f.read().hex()):
print('file offset: %s; length: %d; sequence: %s' % (hex(sub[0]), len(sub[1][1]), sub[1][1]))
AFTER
with open('../test/test_binary', 'rb') as f:
for sub in a.iter(f.read().hex()):
print('file offset: %s; length: %d; sequence: %s' % (hex(int(sub[0] / 2)), len(sub[1][1]), sub[1][1]))
The new output is almost as expected:
file offset: 0x1c5b; length: 16; sequence: e3a0b000e3a0e000
NOTE
When dividing the offset by 2, it turns the integer into a float. I have to keep in mind that converting the float back into an integer, will round the value up or down.