Here is an example text:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do a
eiusmod tempor incididunt ut labore et dolore magna aliqua. Do b c
quis autem vel eum iure reprehenderit qui in ea voluptate velit esse
---minim a b veniam, quis nostrud exercitation ullamco laboris d.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do a
eiusmod tempor incididunt ut labore et dolore magna aliqua. Do b c
quis autem vel eum iure reprehenderit qui in ea voluptate velit esse
---minim a b veniam, quis nostrud exercitation ullamco laboris d.
I need to process this text using Awk or maybe Perl so that
Rule 1: Each single-letter word, if it happened to be at the end of a line, and this line is not the last line of a paragraph, is moved to the next line.
Rule 2: Otherwise, it is moved to the next line together with the nearest word that is at least two letters.
Rule 3: Three hyphens, if they happened to be at the beginning of a line, and this line is not the first line of a paragraph, are treated the same as single-letter words in Rule 2.
That is, the text above should be re-formatted as follows:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do
a eiusmod tempor incididunt ut labore et dolore magna aliqua. Do
b c quis autem vel eum iure reprehenderit qui in ea voluptate velit
esse---minim a b veniam, quis nostrud exercitation ullamco
laboris d.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do
a eiusmod tempor incididunt ut labore et dolore magna aliqua. Do
b c quis autem vel eum iure reprehenderit qui in ea voluptate velit
esse---minim a b veniam, quis nostrud exercitation ullamco
laboris d.
I understand that probably nobody will waste his/her time to write the whole script for me, but I need at least to have some entry points to start working on. Maybe a solution that is 50% or 25% workable.
To ease the task, we can assume the paragraphs are separated using a blank line instead of first-line indent:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do a
eiusmod tempor incididunt ut labore et dolore magna aliqua. Do b c
quis autem vel eum iure reprehenderit qui in ea voluptate velit esse
---minim a b veniam, quis nostrud exercitation ullamco laboris d.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do a
eiusmod tempor incididunt ut labore et dolore magna aliqua. Do b c
quis autem vel eum iure reprehenderit qui in ea voluptate velit esse
---minim a b veniam, quis nostrud exercitation ullamco laboris d.
Use perl, you can load your data with paragraph-mode. for your sample text, I use \n
to split each paragraph, then do some s/pattern/replacement/
operations on each paragraph to build your rules, see below:
perl -C -lpe '
BEGIN{ $/="\n " } # setup RS
s/ (\w(?: \w)*)\r?\n/\n$1 /g; # rule-1: 1+ consecutive single-char words followed by newline switched to the next line
s/ (\w{2,}(?: \w)+[?.!])\s*$/\n$1/; # rule-2: 1+ consecutive single-char words at end of para(trailing with `.` or `?` or `!`) and some potential whitespaces including \r. (extra empty newlines will be removed from the result)
s/ (\w+)\r?\n(?=---)/\n$1/g; # rule-3: rule for `---`
s/^ */ / # fix the missing leading spaces for paragraphs
' file
For your sample text, this yields:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do
a eiusmod tempor incididunt ut labore et dolore magna aliqua. Do
b c quis autem vel eum iure reprehenderit qui in ea voluptate velit
esse---minim a b veniam, quis nostrud exercitation ullamco
laboris d.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, satoru do
a eiusmod tempor incididunt ut labore et dolore magna aliqua. Do
b c quis autem vel eum iure reprehenderit qui in ea voluptate velit
esse---minim a b veniam, quis nostrud exercitation ullamco
laboris d.