bashunixawktext-processingunix-text-processing

Split Markdown text file by regular expression that defines headings


I am trying to use a commandline program to split a larger text file into chunks with:

The text file is of the format:

# Title

# 2020-01-01

Multi-line content
goes here

# 2020-01-02

Other multi-line content
goes here

Output should be these two files with the following filenames and contents:

2020-01-01.md ↓

# 2020-01-01

Multi-line content
goes here

2020-01-02.md ↓

# 2020-01-02

Other multi-line content
goes here

I can't seem to get all the criteria right.

The regex pattern to split on (separator) is simple enough, something along the lines of ^# (2020-.*)$

Either I can't set up a multi-line regex pattern that goes over \n newlines and stops at the next occurrence of the separator pattern.

Or I can split with csplit on the regex pattern, but I can't name the files with what is captured in (2020-.*)

Same for awk split() or match(), can't get it to work entirely.

I'm looking for a general solution, with the parameter being the regex patterns that define the chunk beginnings (eg. # 2020-01-01) and endings (eg. the next date heading # 2020-01-02 or EOF)


Solution

  • Using any awk in any shell on every Unix box:

    $ awk '/^# [0-9]/{ close(out); out=$2".md" } out!=""{print > out}' file
    
    $ head *.md
    ==> 2020-01-01.md <==
    # 2020-01-01
    
    Multi-line content
    goes here
    
    
    ==> 2020-01-02.md <==
    # 2020-01-02
    
    Other multi-line content
    goes here
    

    if /^# [0-9]/ isn't an adequate regexp then change it to whatever you like, e.g. /^# [0-9]{4}(-[0-9]{2}){2}$/ would be more restrictive. FWIW though I wouldn't have used a regexp at all for this if you hadn't asked for one. I'd have used:

    awk '($1=="#") && (c++){ close(out); out=$2".md" } out!=""{print > out}' file