bashshellparagraph

Paragraph counter


I have a text file and I wrote a script for Linux where it counts all the characters (with spaces) the number of lines and words. I also have to write something that counts the number of paragraphs but I don't know how. If anyone could help me I would really appreciate it.

This is my script:

#!/bin/bash

is_text_file() {
  if [[ $(grep -c -P '[\x01-\x7F]' "$1") -gt 0 ]]; then
    return 0
  else
    return 1
  fi
}

if [ $# -lt 2 ]; then
  echo "Usage: $0 -file FILE_PATH [-occurrences NUMBER]"
  exit 1
fi

while [ "$#" -gt 0 ]; do
  case "$1" in
    -file)
      file="$2"
      shift 2
      ;;
    -occurrences)
      occurrences="$2"
      shift 2
      ;;
    *)
      echo "Invalid flag: $1"
      exit 1
      ;;
  esac
done

if ! is_text_file "$file"; then
  echo "The specified file is not a text file."
  exit 1
fi

word_count=$(cat "$file" | tr -s '[:space:]' '\n' | wc -w)
line_count=$(cat "$file" | grep -c '^')
character_count=$(cat "$file" | wc -c)
paragraph_count=$(awk 'BEGIN { RS = "" } { print NF }' "$file")

echo "Word count: $word_count"
echo "Line count: $line_count"
echo "Character count: $character_count"
echo "Paragraph count: $paragraph_count"

if [ -n "$occurrences" ]; then
  echo "Most frequent words:"
  cat "$file" | tr -s '[:space:]' '\n' | tr -d '[:punct:]' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | head -n "$occurrences"
fi

This is my text file:

This is a sample text file.  
It contains some words and paragraphs.

Feel free to add more, content for testing.

It should say that there are 2 paragraphs, but it says 1.


Solution

  • I appears you may have DOS line endings. That will interfere with using awk in paragraph mode which is the easist way to solve your issue.

    Given:

    $ printf "This is a sample text file.\r\nIt contains some words and paragraphs.\r\n\r\nFeel free to add more, content for testing.">file
    

    You can use GNU awk or a recent awk that supports more than one character for a record separator:

    $ awk 'BEGIN{RS="\r?\n(\r?\n)+"} END{print NR}' file
    

    (Note: Setting RS="\r?\n(\r?\n)+" is not the same as RS="" See here and here)

    If you want to use classic Paragraph Mode in awk (with RS=""), use a method to remove the \r at the line endings. The best is either a nested awk or sed:

    $ sed 's/\r$//' file | awk -v RS= 'END{print NR}'
    $ awk '{sub(/\r$/,"")}1' file | awk -v RS= 'END{print NR}'
    

    Or Ruby:

    $ ruby -e 'puts $<.read.split(/\R\R+/).length' file
    

    Any of those should print 2 regardless of DOS or Unix line endings.

    If you want total paragraphs, words, and punctuation marks I would do that in Ruby:

    ruby -e '
    inp=$<.read
    para=inp.split(/\R\R+/)
    words=para.map{|p| p.split}
    puts "Paragraphs=#{para.length}"
    puts "words=#{words.map(&:length).sum}"
    puts "Punctuation marks=#{inp.scan(/[[:punct:]]/).length}"
    ' file
    

    Or in awk:

    awk '
    BEGIN{RS="\r?\n(\r?\n)+"} 
    {
        p+=gsub(/[[:punct:]]/, "&")
        split($0, words)
        w+=length(words)
    }
    END{printf("Paragraphs=%s\nwords=%s\nPunctuation marks=%s\n", NR, w, p)}' file
    

    Either prints:

    Paragraphs=2
    words=20
    Punctuation marks=4
    

    Either of those is easily modified (based on that skeleton) to count whatever you want to count in the text.