My environment:
OS: Windows 11
Python version: 3.13.2
NumPy version: 2.1.3
According to NumPy Fundementals guide describing how to use numpy.genfromtxt
function:
The optional argument
comments
is used to define a character string that marks the beginning of a comment. By default, genfromtxt assumescomments='#'
. The comment marker may occur anywhere on the line. Any character present after the comment marker(s) is simply ignored.Note: There is one notable exception to this behavior: if the optional argument
names=True
, the first commented line will be examined for names.
To do a test about the above-mentioned note (indicated in bold), I created the following data file and I put the header line, as a commented line:
C:\tmp\data.txt
#firstName|LastName
Anthony|Quinn
Harry|POTTER
George|WASHINGTON
And the following program to read and print the content of the file:
with open("C:/tmp/data.txt", "r", encoding="UTF-8") as fd:
result = np.genfromtxt(fd,
comments="#",
delimiter="|",
dtype=str,
names=True,
skip_header=0)
print(f"result = {result}")
But the result is not what I expected:
result = [('', '') ('', '') ('', '')]
I cannot figure out where is the error in my code and I don't understand why the content of my data file, and in particular, its header line after the comment indicator #, is not interpreted correctly.
I'd appriciate if you could kindly make some clarification.
Thanks to what @mehdi-sahraei suggested, I changed the dtype
to None
and this permitted to parse other rows (any row after the header line) correctly. Finally, it seems that there is no bug about how the header line is treated but rather a lack of clarity in the documentation. As indicated in my original post, the documentation says:
... if the optional argument names=True, the first commented line will be examined for names ...
But what the documentation doesn't tell you, is that in that case, the detected header is stored in dtype.names
and not beside other rows that come after the header in the file. So the header line is actually there but it is not directly accessible like other rows in the file. Here is a working test case for those who might be interested to check how this works in preactice:
C:\tmp\data.txt
#firstName|LastName
Anthony|Quinn
Harry|POTTER
George|WASHINGTON
And the program:
with open("C:/tmp/data.txt", "r", encoding="UTF-8") as fd:
result = np.genfromtxt(
fd,
delimiter="|",
comments="#",
dtype=None,
names=True,
skip_header=0,
autostrip=True,
)
print(f"result = {result}\n\n")
print("".join([
"After parsing the file entirely, the detected ", "header line is: ",
f"{result.dtype.names}"
]))
Which gives the expected result:
result = [('Anthony', 'Quinn') ('Harry', 'POTTER') ('George', 'WASHINGTON')]
After parsing the file entirely, the detected header line is: ('firstName', 'LastName')
Thanks everyone for your time and your help and I hope this might clarify the issue for those who have encountered the same problem.