Read Contents Tarfile into Python - "seeking backwards is not allowed"

I am new to python. I am having trouble reading the contents of a tarfile into python.

The data are the contents of a journal article (hosted at pubmed central). See info below. And link to tarfile which I want to read into Python.

http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi?id=PMC13901 ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz

I have a list of similar .tar.gz file I will eventually want to read in as well. I think (know) all of the tarfiles have a .nxml file associated with them. It is the content of the .nxml files I am actually interested in extracting/reading. Open to any suggestions on the best way to do this...

Here is what I have if I save the tarfile to my PC. All runs as expected.

tarfile_name = "F:/PMC_OA_TextMining/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
tfile = tarfile.open(tarfile_name)

tfile_members = tfile.getmembers()

tfile_members1 = []
for i in range(len(tfile_members)):
tfile_members_name = tfile_members[i].name
tfile_members1.append(tfile_members_name)

tfile_members2 = []
for i in range(len(tfile_members1)):
if tfile_members1[i].endswith('.nxml'):
    tfile_members2.append(tfile_members1[i])

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

I learned today that to in order to access the tarfile directly from the pubmed centrals FTP site I have to set up a network request using urllib. Below is the revised code (and link to stackoverflow answer I received):

Read contents of .tar.gz file from website into a python 3.x object

tarfile_name = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(tarfile_name)
tfile = tarfile.open(fileobj=ftpstream, mode="r|gz")

However, when I run the remaining piece of the code (below) I get an error message ("seeking backwards is not allowed"). How come?

tfile_members = tfile.getmembers()

tfile_members1 = []
for i in range(len(tfile_members)):
tfile_members_name = tfile_members[i].name
tfile_members1.append(tfile_members_name)

tfile_members2 = []
for i in range(len(tfile_members1)):
if tfile_members1[i].endswith('.nxml'):
    tfile_members2.append(tfile_members1[i])

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

The code fails on the last line, where I try to read the .nxml content associated with my tarfile. Below is the actual error message I receive. What does it mean? What is my best workaround for reading/accessing the content of these .nxml files which are all embedded in tarfiles?

Traceback (most recent call last):
File "F:\PMC_OA_TextMining\test2.py", line 135, in <module>
tfile_extract1_text = tfile_extract1.read()
File "C:\Python30\lib\tarfile.py", line 804, in read
buf += self.fileobj.read()
File "C:\Python30\lib\tarfile.py", line 715, in read
return self.readnormal(size)
File "C:\Python30\lib\tarfile.py", line 722, in readnormal
self.fileobj.seek(self.offset + self.position)
File "C:\Python30\lib\tarfile.py", line 531, in seek
raise StreamError("seeking backwards is not allowed")
tarfile.StreamError: seeking backwards is not allowed

Thanks in advance for your help. Chris

Solution

What's going wrong: Tar files are stored interleaved. They come in the order header, data, header, data, header, data, etc. When you enumerated the files with getmembers(), you've already read through the entire file to get the headers. Then when you asked the tarfile object to read the data, it tried to seek backward from the last header to the first data. But you can't seek backward in a network stream without closing and reopening the urllib request.

How to work around it: You'll need to download the file, save a temporary copy to disk or to a StringIO, enumerate the files in this temporary copy, and then extract the files you want.

#!/usr/bin/env python3
from io import BytesIO
import urllib.request
import tarfile

tarfile_url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(tarfile_url)

# BytesIO creates an in-memory temporary file.
# See the Python manual: http://docs.python.org/3/library/io.html
tmpfile = BytesIO()
while True:
    # Download a piece of the file from the connection
    s = ftpstream.read(16384)

    # Once the entire file has been downloaded, tarfile returns b''
    # (the empty bytes) which is a falsey value
    if not s:  
        break

    # Otherwise, write the piece of the file to the temporary file.
    tmpfile.write(s)
ftpstream.close()

# Now that the FTP stream has been downloaded to the temporary file,
# we can ditch the FTP stream and have the tarfile module work with
# the temporary file.  Begin by seeking back to the beginning of the
# temporary file.
tmpfile.seek(0)

# Now tell the tarfile module that you're using a file object
# that supports seeking backward.
# r|gz forbids seeking backward; r:gz allows seeking backward
tfile = tarfile.open(fileobj=tmpfile, mode="r:gz")

# You want to limit it to the .nxml files
tfile_members2 = [filename
                  for filename in tfile.getnames()
                  if filename.endswith('.nxml')]

tfile_extract1 = tfile.extractfile(tfile_members2[0])
tfile_extract1_text = tfile_extract1.read()

# And when you're done extracting members:
tfile.close()
tmpfile.close()