text-filescode-golfcounting

Code Golf 4th of July Edition: Counting Top Ten Occurring Words


Given the following list of presidents do a top ten word count in the smallest program possible:

INPUT FILE

    Washington
    Washington
    Adams
    Jefferson
    Jefferson
    Madison
    Madison
    Monroe
    Monroe
    John Quincy Adams
    Jackson
    Jackson
    Van Buren
    Harrison 
    DIES
    Tyler
    Polk
    Taylor 
    DIES
    Fillmore
    Pierce
    Buchanan
    Lincoln
    Lincoln 
    DIES
    Johnson
    Grant
    Grant
    Hayes
    Garfield 
    DIES
    Arthur
    Cleveland
    Harrison
    Cleveland
    McKinley
    McKinley
    DIES
    Teddy Roosevelt
    Teddy Roosevelt
    Taft
    Wilson
    Wilson
    Harding
    Coolidge
    Hoover
    FDR
    FDR
    FDR
    FDR
    Dies
    Truman
    Truman
    Eisenhower
    Eisenhower
    Kennedy 
    DIES
    Johnson
    Johnson
    Nixon
    Nixon 
    ABDICATES
    Ford
    Carter
    Reagan
    Reagan
    Bush
    Clinton
    Clinton
    Bush
    Bush
    Obama

To start it off in bash 97 characters

cat input.txt | tr " " "\n" | tr -d "\t " | sed 's/^$//g' | sort | uniq -c | sort -n | tail -n 10

Output:

      2 Nixon
      2 Reagan
      2 Roosevelt
      2 Truman
      2 Washington
      2 Wilson
      3 Bush
      3 Johnson
      4 FDR
      7 DIES

Break ties as you see fit! Happy fourth!

For those of you who care more information on presidents can be found here.


Solution

  • A shorter shell version:

    xargs -n1 < input.txt | sort | uniq -c | sort -nr | head
    

    If you want case insensitive ranking, change uniq -c into uniq -ci.

    Slightly shorter still, if you're happy about the rank being reversed and readability impaired by lack of spaces. This clocks in at 46 characters:

    xargs -n1<input.txt|sort|uniq -c|sort -n|tail
    

    (You could strip this down to 38 if you were allowed to rename the input file to simply "i" first.)

    Observing that, in this special case, no word occur more than 9 times we can shave off 3 more characters by dropping the '-n' argument from the final sort:

    xargs -n1<input.txt|sort|uniq -c|sort|tail
    

    That takes this solution down to 43 characters without renaming the input file. (Or 35, if you do.)

    Using xargs -n1 to split the file into one word on each line is preferable to the tr \ \\n solution, as that creates lots of blank lines. This means that the solution is not correct, because it misses out Nixon and shows a blank string showing up 256 times. However, a blank string is not a "word".