bash unix awk text-processing unix-text-processing

Split Markdown text file by regular expression that defines headings

I am trying to use a commandline program to split a larger text file into chunks with:

split on defined regex pattern
filenames defined by a capturing group in that regex pattern

The text file is of the format:

# Title

# 2020-01-01

Multi-line content
goes here

# 2020-01-02

Other multi-line content
goes here

Output should be these two files with the following filenames and contents:

2020-01-01.md ↓

# 2020-01-01

Multi-line content
goes here

2020-01-02.md ↓

# 2020-01-02

Other multi-line content
goes here

I can't seem to get all the criteria right.

The regex pattern to split on (separator) is simple enough, something along the lines of ^# (2020-.*)$

Either I can't set up a multi-line regex pattern that goes over \n newlines and stops at the next occurrence of the separator pattern.

Or I can split with csplit on the regex pattern, but I can't name the files with what is captured in (2020-.*)

Same for awk split() or match(), can't get it to work entirely.

I'm looking for a general solution, with the parameter being the regex patterns that define the chunk beginnings (eg. # 2020-01-01) and endings (eg. the next date heading # 2020-01-02 or EOF)

Solution

Using any awk in any shell on every Unix box:

$ awk '/^# [0-9]/{ close(out); out=$2".md" } out!=""{print > out}' file

$ head *.md
==> 2020-01-01.md <==
# 2020-01-01

Multi-line content
goes here


==> 2020-01-02.md <==
# 2020-01-02

Other multi-line content
goes here

if /^# [0-9]/ isn't an adequate regexp then change it to whatever you like, e.g. /^# [0-9]{4}(-[0-9]{2}){2}$/ would be more restrictive. FWIW though I wouldn't have used a regexp at all for this if you hadn't asked for one. I'd have used:

awk '($1=="#") && (c++){ close(out); out=$2".md" } out!=""{print > out}' file