powershellknuth

Word frequency elegantly in Powershell


Donald Knuth once got the task to write a literate program computing the word frequency of a file.

Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.

Doug McIlroy famously rewrote the 10 pages of Pascal in a few lines of sh:

tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

As a little exercise, I converted this to Powershell:

(-split ((Get-Content -Raw test.txt).ToLower() -replace '[^a-zA-Z]',' ')) |
  Group-Object |
  Sort-Object -Property count -Descending |
  Select-Object -First $Args[0] |
  Format-Table count, name

I like that Powershell combines sort | uniq -c into a single Group-Object.

The first line looks ugly, so I wonder if it can be written more elegantly? Maybe there is a way to load the file with a regex delimiter somehow?

One obvious way to shorten the code would be to uses the aliases, but that does not help readability.


Solution

  • Thanks js2010 and LotPings for important hints. To document what is probably the best solution:

    $Input -split '\W+' |
      Group-Object -NoElement |
      Sort-Object count -Descending |
      Select-Object -First $Args[0]
    

    Things I learned: