I have an rtf file that, ultimately, I want to convert into a chunked html, splitting on the level 1 headings.
My first step is to convert the rtf to one html file, which is straightforward with:
pandoc -f rtf -t html -o inputfile.html inputfile.rtf
The resulting html file has headings defined by <strong></strong>
rather than <h1></h1>
so I have to edit the file in a text editor to change all these. Here is a sample from the file:
<p><strong>George Stewart</strong></p>
<p>Title: George Stewart</p>
<p>Type: Task</p>
<p>Date:1734</p>
<p>Description: Christening</p>
<p>Status: +open</p>
<p>Repository: LDS Library</p>
<p>Last action: 8 May 2024</p>
<p><strong>Ann Hill</strong></p>
<p>Title: Ann Hill</p>
<p>Type: Task</p>
<p>Date: 1799</p>
<p>Description: Family</p>
<p>Status: +ToDo</p>
<p>Repository: LDS Library</p>
which has to be edited to:
<p><h1>George Stewart</h1></p>
<p>Title: George Stewart</p>
<p>Type: Task</p>
<p>Date:1734</p>
<p>Description: Christening</p>
<p>Status: +open</p>
<p>Repository: LDS Library</p>
<p>Last action: 8 May 2024</p>
<p><h1>Ann Hill</h1></p>
<p>Title: Ann Hill</p>
<p>Type: Task</p>
<p>Date: 1799</p>
<p>Description: Family</p>
<p>Status: +ToDo</p>
<p>Repository: LDS Library</p>
Then I can run the next step which is to chunk the html into many files splitting at the h1 level with another Pandoc command.
pandoc -t chunkedhtml --split-level=1 -o RN_File inputfile.html
I would like to be able to do that heading conversion inline as part of the Pandoc command. It may be possible with a filter (json/lua?) but I cannot work out the syntax.
Ideally, I would also like to merge the two Pandoc steps, but do not know if this is possible. It seems there might be a method of doing this with a pipe function, but perhaps someone could confirm with an example.
The Pandoc Lua filters guide suggests I need a code block like:
function Strong(elem)
return pandoc.SmallCaps(elem.content)
end
but I need to capture <p><strong>
and replace with <h1>
, this does not work but may be gives a clue of what I am trying to achieve ...
function Para+Strong(elem)
return pandoc.Header(1)
end
You could use sed
on the inputfile.html
between the two pandoc
commands.
#!/bin/bash
pandoc -f rtf -t html -o inputfile.html inputfile.rtf
cat inputfile.html | sed 's/<p><strong>\(.*\)<\/strong><\/p>/<h1>\1<\/h1>/g' > inputfile-fixed.html && rm inputfile.html
pandoc -t chunkedhtml --split-level=1 -o RN_File inputfile-fixed.html
Save as: fix_heading.sh
Change mode executable: chmod +x fix_heading.sh
Usage: ./fix_heading.sh
I used cat
as a precaution. If you want to directly edit the file, inline, replace the cat
line with:
sed -i 's/<p><strong>\(.*\)<\/strong><\/p>/<h1>\1<\/h1>/g' inputfile.html
That will eliminate need of the intermediate file.