pythonfiletexttell

Text file incorrect tell() positions


Trying to index text file (formatted as one sentence per line) ... but the character positions seem to get skewed at some point ... in the example below this happen at line/sentence 19.

First I check to find the positions and then I read the same range, but the calculated range(279) is 4 characters bigger !

Any idea what could be the reason ?


    start = fh.tell()
    line = fh.readline()
    end = fh.tell()
    print(f"line:{i},pos:{start}-{end} : {end-start}\n{line}")

    line:19,pos:2703-2982 : 279
    Many revolutionaries of the 19th century such as William Godwin ( 1756 – 1836 ) and Wilhelm Weitling ( 1808 – 1871 ) would contribute to the anarchist doctrines of the next generation but did not use " anarchist " or " anarchism " in describing themselves or their beliefs .

    fh.seek(2703)
    print(fh.read(279))


    Many revolutionaries of the 19th century such as William Godwin ( 1756 – 1836 ) and Wilhelm Weitling ( 1808 – 1871 ) would contribute to the anarchist doctrines of the next generation but did not use " anarchist " or " anarchism " in describing themselves or their beliefs .
    The 

hex view


Solution

  • tell on text-oriented files has no documented meaning. It's officially an opaque cookie that can be used with seek, and otherwise cannot be counted on to have any meaning. This provides flexibility to encode additional information required by the encoding in use into the cookie, and even when it represents a number of bytes into a one-byte-per-character encoding file, it can differ from the number of characters read because of line-ending translations. Similarly, in encodings with variable length character encodings, a byte position would not correspond to character counts. Per the docs:

    tell()

    Return the current stream position as an opaque number. The number does not usually represent a number of bytes in the underlying binary storage.

    In short, you cannot and should not rely on the number produced to have any meaning that is useful to you aside from passing it back into a seek call. The value is an implementation detail.


    sten: this seems like working.. will test more to be sure, thanks this is life saver.. will approve the answer in a jiff ;)

    The simpler way would be to do file.seek(ix[pos]) then do file.readline()