I have a Textfile containing numbers that looks as follows:
[mpz(0), mpz(0), mpz(0), mpz(0), mpz(4), mpz(54357303843626),...]
Does there exist a simple way to parse it directly into an integer list? It doesn't matter whether the target data type is a mpz integer or a plain python integer.
What I tried so far and works is pure parsing (note: the target array y_val3
needs to be initialized with zeros in advance, since it may be larger than the list in the Textfile):
text_file = open("../prod_sum_copy.txt", "r")
content = text_file.read()[1:-1]
text_file.close()
content_list = content.split(",")
y_val3 = [0]*10000
print(content_list)
for idx, str in enumerate(content_list):
m = re.search('mpz\(([0-9]+)\)', str)
y_val3[idx]=int(m.group(1))
print(y_val3)
Althought this approach works, I am not sure if this is a best practice or wether there exist a more elegant way than just plain parsing.
To facilitate things: Here is the original Textfile on GitHub. Note: This Textfile might grow in furure, which brings aspects such as performance and scalability into play.
I tried look at a more elegant solution from both the human-readable perspective and from the performance perspective.
Caveats:
The breakouts and timing below seem to show an order of magnitude difference in several of the approaches, so they may still be of use in gauging level of computational effort.
My first approach was to try and measure the amount of overhead the file read/write added to the process so that we could explore how much computational effort was focused on just the data processing step.
To do this, I made a function that included the file read and measured the whole process, end to end to see how long it took with my mini example file. I did this using %timeit
in a Jupyter notebook.
I then broke out the file reading step into it's own function and then used %timeit
on just the data processing step to help show us:
import re
def original():
text_file = open("../prod_sum_copy.txt", "r")
content = text_file.read()[1:-1]
text_file.close()
content_list = content.split(",")
y_val3 = [0]*10000
for idx, element in enumerate(content_list):
m = re.search('mpz\(([0-9]+)\)', element)
y_val3[idx]=int(m.group(1))
return y_val3
I am gonna presume that a significant portion of the processing time for my really short example data is just gonna be the time used to open the file on disk, read the data into memory, close the file, etc.
%timeit original()
140 µs ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
This approach includes a minor improvement to the file reading process. The timing test does not include the file reading process, so we won't know how much that minor change affects the overall process. For the record, I eliminated the manual call to the .close()
method by encapsulating the reading process in a with
context manager (which handles closing in the background) as this is a Python best practice for reading in files.
import re
def read_filea():
with open("../prod_sum_copy.txt", "r") as text_file:
content = text_file.read()[1:-1]
return content
content = read_filea()
print(content)
def a():
y_val3 = [0]*10000
content_list = content.split(",")
for idx, element in enumerate(content_list):
m = re.search('mpz\(([0-9]+)\)', element)
y_val3[idx]=int(m.group(1))
return y_val3
By timing just the data processing portion, we see that it appears as though our prediction that file read (IO) plays a big component in this simple test case. It also provides us with an idea for how much time we should expect to take for just the data processing portion. Let's look at another approach to see if we can trim that time down a bit.
%timeit read_filea()
21.5 µs ± 185 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Here we will try to use some Python best practices OR Python tools to cut down on the overall time, including:
re.findall()
method to eliminate some of the direct and repeated calls to the re.search()
function and the direct and repeated calls to the m.group()
method (NOTE: findall is likely doing some of that in the background and I honestly don't know if us avoiding it will have a benefit). BUT I find the readability of this approach to be higher than the original approach.Let's look at the code:
import re
def read_fileb():
with open("../prod_sum_copy.txt", "r") as text_file:
content = text_file.read()[1:-1]
return content
content = read_fileb()
def b():
y_val3 = [int(element) for element in re.findall(r'mpz\(([0-9]+)\)', content)]
return y_val3
The data processing portion of this approach is about 10 times faster than the data processing steps in the original approach.
%timeit b()
2.89 µs ± 210 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)