Paragraph counter

I have a text file and I wrote a script for Linux where it counts all the characters (with spaces) the number of lines and words. I also have to write something that counts the number of paragraphs but I don't know how. If anyone could help me I would really appreciate it.

This is my script:

#!/bin/bash

is_text_file() {
  if [[ $(grep -c -P '[\x01-\x7F]' "$1") -gt 0 ]]; then
    return 0
  else
    return 1
  fi
}

if [ $# -lt 2 ]; then
  echo "Usage: $0 -file FILE_PATH [-occurrences NUMBER]"
  exit 1
fi

while [ "$#" -gt 0 ]; do
  case "$1" in
    -file)
      file="$2"
      shift 2
      ;;
    -occurrences)
      occurrences="$2"
      shift 2
      ;;
    *)
      echo "Invalid flag: $1"
      exit 1
      ;;
  esac
done

if ! is_text_file "$file"; then
  echo "The specified file is not a text file."
  exit 1
fi

word_count=$(cat "$file" | tr -s '[:space:]' '\n' | wc -w)
line_count=$(cat "$file" | grep -c '^')
character_count=$(cat "$file" | wc -c)
paragraph_count=$(awk 'BEGIN { RS = "" } { print NF }' "$file")

echo "Word count: $word_count"
echo "Line count: $line_count"
echo "Character count: $character_count"
echo "Paragraph count: $paragraph_count"

if [ -n "$occurrences" ]; then
  echo "Most frequent words:"
  cat "$file" | tr -s '[:space:]' '\n' | tr -d '[:punct:]' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | head -n "$occurrences"
fi

This is my text file:

This is a sample text file.  
It contains some words and paragraphs.

Feel free to add more, content for testing.

It should say that there are 2 paragraphs, but it says 1.

Solution

I appears you may have DOS line endings. That will interfere with using awk in paragraph mode which is the easist way to solve your issue.

Given:

$ printf "This is a sample text file.\r\nIt contains some words and paragraphs.\r\n\r\nFeel free to add more, content for testing.">file

You can use GNU awk or a recent awk that supports more than one character for a record separator:

$ awk 'BEGIN{RS="\r?\n(\r?\n)+"} END{print NR}' file

(Note: Setting RS="\r?\n(\r?\n)+" is not the same as RS="" See here and here)

If you want to use classic Paragraph Mode in awk (with RS=""), use a method to remove the \r at the line endings. The best is either a nested awk or sed:

$ sed 's/\r$//' file | awk -v RS= 'END{print NR}'
$ awk '{sub(/\r$/,"")}1' file | awk -v RS= 'END{print NR}'

Or Ruby:

$ ruby -e 'puts $<.read.split(/\R\R+/).length' file

Any of those should print 2 regardless of DOS or Unix line endings.

If you want total paragraphs, words, and punctuation marks I would do that in Ruby:

ruby -e '
inp=$<.read
para=inp.split(/\R\R+/)
words=para.map{|p| p.split}
puts "Paragraphs=#{para.length}"
puts "words=#{words.map(&:length).sum}"
puts "Punctuation marks=#{inp.scan(/[[:punct:]]/).length}"
' file

Or in awk:

awk '
BEGIN{RS="\r?\n(\r?\n)+"} 
{
    p+=gsub(/[[:punct:]]/, "&")
    split($0, words)
    w+=length(words)
}
END{printf("Paragraphs=%s\nwords=%s\nPunctuation marks=%s\n", NR, w, p)}' file

Either prints:

Paragraphs=2
words=20
Punctuation marks=4

Either of those is easily modified (based on that skeleton) to count whatever you want to count in the text.