shellunixsedtext-filesunix-head

How do I limit (or truncate) text file by number of lines?


I would like to use a terminal/shell to truncate or otherwise limit a text file to a certain number of lines.

I have a whole directory of text files, for each of which only the first ~50k lines are useful.

How do I delete all lines over 50000?


Solution

  • In-place truncation

    To truncate the file in-place with sed, you can do the following:

    sed -i '50001,$ d' filename
    

    You can make a backup of the file by adding an extension argument to -i, for example, .backup or .bak:

    sed -i.backup '50001,$ d' filename
    

    In OS-X or FreeBSD you must provide an argument to -i - so to do this while avoiding making a backup:

    sed -i '' '50001,$ d' filename
    

    The long argument name version is as follows, with and without the backup argument:

    sed --in-place '50001,$ d' filename
    sed --in-place=.backup '50001,$ d' filename
    

    New File

    To create a new truncated file, just redirect from head to the new file:

    head -n50000 oldfilename > newfilename
    

    It is unfortunate that you cannot redirect into the same file, which is why sed is recommended for in-place truncation.

    No sed? Try Python!

    This is a bit more typing than sed. Sed is short for "Stream Editor" after all, and that's another reason to use it, it's what the tool is suited for.

    This was tested on Linux and Windows with Python 3:

    from collections import deque
    from itertools import islice
    
    def truncate(filename, lines):
        with open(filename, 'r+') as f:
            blackhole = deque((),0).extend
            file_iterator = iter(f.readline, '')
            blackhole(islice(file_iterator, lines))
            f.truncate(f.tell())
    

    To explain the Python:

    The blackhole works like /dev/null. It's a bound extend method on a deque with maxlen=0, which is the fastest way to exhaust an iterator in Python (that I'm aware of).

    We can't simply loop over the file object because the tell method would be blocked, so we need the iter(f.readline, '') trick.

    This function demonstrates the context manager, but it's a bit superfluous since Python would close the file on exiting the function. Usage is simply:

    >>> truncate('filename', 50000)