tclchannel

Is there a way to chan seek or chan tell based on characters rather than bytes?


According to the Tcl/TL man page for channels, chan read reads characters rather than bytes; but, if the channel is configured as binary, it'll read bytes. That is, chan read numChars will read numChars bytes.

Both chan seek and chan tell refer to bytes. Is there any equivalent of the chan read change from chars to bytes for these two commands, such that they will refer to character positions in the channel rather than bytes?

The context is this. I have to write to a channel in binary and the data includes multi-byte characters. There is a set of pointers, that track the start byte and the character length of all the segments of text written to the channel. The byte length could be tracked instead of or additionally. There is not an issue in later reading the exact same segments from the channel. The issue is that the pointers often and rapidly need to be split into two parts--a front and an end--and only the character position of the split is known.

For example, I'll know that the pointer starts a byte b, has a character length of 50, has the character position of 484 in a larger string. I'll also know that starting at character 500, 20 characters are to be deleted. That means two pointers are needed, now, one from characters 484 to 499 and another from 520 to 533. The front pointer will have start byte b and char length 500-484. The end pointer will have char length 534-520; but I see no way of knowing the start byte; for it depends on the number of bytes that preceded it and that is not knowable without first reading the bytes. This is to be done by math only, there's no time to retrieve text and count it.

So, I need a way to be on a character basis for the split. Start byte and character length are okay for reading the data later; but splitting the segments in between those reads has me confused.

I should add that I think I could track everything in characters and just read the binary data in as a string. But that would require reading all the data in at once, then splitting the string into segments, and joining. I thought it might be easier to read the segments by seeking to the correct start position and reading so many characters. But, perhaps, all that seeking is a lot of work also, such that reading all the data into memory, including the unwanted segments, would be more efficient.

Thank you for any guidance you may be able to provide.


Solution

  • The OS API for seeking/file positioning works with bytes and always has done; Tcl's API reflects that pretty much directly. How would you compute the actual byte index of the n'th character in a variable width encoding like UTF-8? It's going to be at an arbitrary byte index between n and 4×n (with a probability distribution towards the lower end of the range, depending on the language in use).

    The only known reliable method for computing the index is to read the characters from the file until you have got to the count of characters that you want, and then see what the position is (with chan tell). Tcl handles tracking that on a per-character position despite doing buffering behind the scenes. Then you can chan seek back to it any time you want (provided that part of the file isn't rewritten).

    If you know that a file is written with a constant-width encoding (or is just binary) then computing the position is easy, but knowing when that is true is up to the script level. It's a level of understanding that the channel system in Tcl doesn't have and realistically isn't ever going to gain.