Seek time vs Sequential read

Let's assume that on a hard drive I have some very large data file of a sequence of characters:

ABRDZ....

My question is as follows, if the head is positioned at the beginning of the file, and I need 5 characters every 1000 positions interval, would it better be to do a Seek (since I know where to look) or simply have a large buffer that just reads sequentially then do job in memory.

Naively I'd have answered that reading 'A' then seek to read 'V' is faster than >> reading all the file until ,say, position 200 (the position of 'V'). Ok, this is just an example, since smallest I/O is 512bytes.

Edit: my previous self-naive-answer is partly justified by the following case: given a 100Gb file I need the first and the last charcters; Here I obviously would do a seek .... right?

Maybe there is a trade off between how "long" is the seek vs how much data to retrieve?

Can someone clarify this to me?

Solution

[UPDATE] Generally, from your original numbers 5 out of every 1000, (Ill assume that the 5 bytes is part of the 1000, thus making your step count 1000), if your step count is less 2x your block size, than my original answer is a pretty good explanation. It does get a bit more tricky once you get past 2x your HD block size, because at that point, you would easily be wasting read times, when you could be speeding up by seeking past un-used (or for that matter un-necessary) HD blocks.

[ORIGINAL] Well, this is an extremely interesting question, with what I beleive to be an equally interesting answer (also somewhat complex). I think that actually this comes down to a couple of other questions, like how big is the block size you have implemented on your drive (or the drive your software is going to run on). If your block size is 4KB, then the (true) minimum your hard drive will get for you at a time is 4096 bytes. In your case if you truly need 5 chars every 1000, then if you did this with ALL disk IO, then you would be essentially re-reading the same block 4 times, and doing 3 seeks in between (REALLY NOT EFFICIENT).

My personal belief is that you could (if you wanted to be drive efficient) in your code, try to understand what the block size of the drive that you are using is, then use that size number to know how many bytes at a time you should bring into RAM. This way you wouldn't have to have a HUGE RAM buffer, but at the same time not really have to SEEK, nor would you be wasting (or performing) any extra reads.

IS THIS THE MOST EFFICIENT. I dont think it is the most efficient, but it may be good enough for the performance you need, who knows. I do think that even if the read head is where you want it to be, that if you perform algorithmic work in the middle of each block read, rather than reading the whole file all at once, that you will lose time in waiting for the next rotation of the drive platters. Whereas, if you were to read it all at once, the drive should be able to perform a sequential read of all parts of the file at once. Again not as simple though, as if your file is truly more than 1 block, on a rotational drive, you may suffer IF your drive has not been defragmented as it may have to perform random seeks just to get to the next block.

Sorry, for the long winded answer, but par usual, there is no simple answer in your case.

I do think that overall performance would PROBABLY be better if you simply read the whole file at once. There is no way to assure this, as each system is going to have inherently different parameters of their drive setup, etc...