Would it be a good idea to convert a text file to a doc string (same as literal string) for regular expressions to work? I've tried converting it to a string with str()
and using multiline mode in re.
I've created a rudimentary script to parse out an EnCase export file via Python. It works but for some reason I can't get regular expression code to do a findall function to search the file unless I store the contents of the file as a doc string in a variable as such.
file = '''
'''
It seems that this code can be reused for different files but it becomes cumbersome to copy and paste every file content. Any other suggestions?
The EnCase file export is essentially tab delimited and the following has information as to the format of the file.
Also see: Exporting Files and Folder from EnCase
Just read the file. This will give you a string:
In [2]: with open('encase_example.md') as cf:
...: data = cf.read()
...:
In [3]: data[:41]
Out[3]: '\n1)\nName\tfile.doc\nFile Category\tDocument\n'
(Just showing part of the string as an example.)
Note in the data that there are newlines between fields of each record, but tabs between the key and value of each field. We will use this later.
This works with regexes:
In [14]: re.findall('Full Path.*', data)
Out[14]:
['Full Path\tproject\\D\\analysis\\system\\folder\\file.doc',
'Full Path\tproject\\D\\analysis\\system\\folder\\file2.doc']
If you want to separate the records, just split on \n\n
:
In [18]: records = data.split('\n\n')
In [19]: len(records)
Out[19]: 2
In [20]: records[0][:50]
Out[20]: '\n1)\nName\tfile.doc\nFile Category\tDocument\nFile Type'
You can also make the records into a dictionary:
In [35]: dict([ln.split('\t') for ln in records[0].splitlines()][2:])
Out[35]:
{'Entry Modified': '12/18/14 11:18:53AM',
'File Acquired': '04/28/15 01:54:45PM',
'File Category': 'Document',
'File Created': '03/29/14 03:22:59PM',
'File Deleted': '',
'File Type': 'Word Document',
'Full Path': 'project\\D\\analysis\\system\\folder\\file.doc',
'Is Deleted': '',
'Last Written': '08/18/08 01:20:48PM',
'Name': 'file.doc',
'Physical Location': '546,930,589,696',
'Physical Size': '32,768'}