I have a text file and I wrote a script for Linux where it counts all the characters (with spaces) the number of lines and words. I also have to write something that counts the number of paragraphs but I don't know how. If anyone could help me I would really appreciate it.
This is my script:
#!/bin/bash
is_text_file() {
if [[ $(grep -c -P '[\x01-\x7F]' "$1") -gt 0 ]]; then
return 0
else
return 1
fi
}
if [ $# -lt 2 ]; then
echo "Usage: $0 -file FILE_PATH [-occurrences NUMBER]"
exit 1
fi
while [ "$#" -gt 0 ]; do
case "$1" in
-file)
file="$2"
shift 2
;;
-occurrences)
occurrences="$2"
shift 2
;;
*)
echo "Invalid flag: $1"
exit 1
;;
esac
done
if ! is_text_file "$file"; then
echo "The specified file is not a text file."
exit 1
fi
word_count=$(cat "$file" | tr -s '[:space:]' '\n' | wc -w)
line_count=$(cat "$file" | grep -c '^')
character_count=$(cat "$file" | wc -c)
paragraph_count=$(awk 'BEGIN { RS = "" } { print NF }' "$file")
echo "Word count: $word_count"
echo "Line count: $line_count"
echo "Character count: $character_count"
echo "Paragraph count: $paragraph_count"
if [ -n "$occurrences" ]; then
echo "Most frequent words:"
cat "$file" | tr -s '[:space:]' '\n' | tr -d '[:punct:]' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | head -n "$occurrences"
fi
This is my text file:
This is a sample text file.
It contains some words and paragraphs.
Feel free to add more, content for testing.
It should say that there are 2 paragraphs, but it says 1.
I appears you may have DOS line endings. That will interfere with using awk in paragraph mode which is the easist way to solve your issue.
Given:
$ printf "This is a sample text file.\r\nIt contains some words and paragraphs.\r\n\r\nFeel free to add more, content for testing.">file
You can use GNU awk or a recent awk that supports more than one character for a record separator:
$ awk 'BEGIN{RS="\r?\n(\r?\n)+"} END{print NR}' file
(Note: Setting RS="\r?\n(\r?\n)+"
is not the same as RS=""
See here and here)
If you want to use classic Paragraph Mode in awk (with RS=""
), use a method to remove the \r
at the line endings. The best is either a nested awk or sed:
$ sed 's/\r$//' file | awk -v RS= 'END{print NR}'
$ awk '{sub(/\r$/,"")}1' file | awk -v RS= 'END{print NR}'
Or Ruby:
$ ruby -e 'puts $<.read.split(/\R\R+/).length' file
Any of those should print 2
regardless of DOS or Unix line endings.
If you want total paragraphs, words, and punctuation marks I would do that in Ruby:
ruby -e '
inp=$<.read
para=inp.split(/\R\R+/)
words=para.map{|p| p.split}
puts "Paragraphs=#{para.length}"
puts "words=#{words.map(&:length).sum}"
puts "Punctuation marks=#{inp.scan(/[[:punct:]]/).length}"
' file
Or in awk:
awk '
BEGIN{RS="\r?\n(\r?\n)+"}
{
p+=gsub(/[[:punct:]]/, "&")
split($0, words)
w+=length(words)
}
END{printf("Paragraphs=%s\nwords=%s\nPunctuation marks=%s\n", NR, w, p)}' file
Either prints:
Paragraphs=2
words=20
Punctuation marks=4
Either of those is easily modified (based on that skeleton) to count whatever you want to count in the text.