Bash Script to Convert XML to CSV Not Fully Dynamic and Outputs Empty Content

I have a Bash script that is supposed to convert an XML file into CSV format. However, the script is not fully universal, and when I check the content of the output file, it appears to be empty.

Expected Output

The script should generate two CSV files (faculty.csv and students.csv) with the following headers and values:

faculty.csv:

name,building,room,capacity
"Faculty of Art","B59","R1000","240"
"Faculty of Art","B59","R1001","360"
"Faculty of Art","B59","R1002","210"
...

students.csv:

student_name,student_id,student_email,programme,year,address,contact,module_id,module_name,module_leader,lecturer1,lecturer2,name,building,room,exam_mark,coursework1,coursework2,coursework3
"Michael Jennings","273644","MJ1Y24@UoNEHT.ac.uk","Fine Art","4","08832 Kim Groves Suite 335 North Mathewview, RI 43333","3976680297","FINE4401","Theoretical Sketching","Maria Tyler","","","Faculty of Art","B81","R1002","","33","7","26"
...

The Script

#!/bin/bash

# Check for input file
if [[ -z "$1" ]]; then
    echo "Usage: $0 <input.xml> [output.csv]"
    exit 1
fi

input_xml="$1"
output_csv="${2:-${input_xml%.xml}.csv}"

# Extract unique tags (excluding XML declaration and root tags)
tags=$(grep -oP '<\K[^/>]+(?=>[^<]+</[^>]+>)' "$input_xml" | grep -vP '^(faculties|students|\?xml)' | sort | uniq | tr '\n' ',' | sed 's/,$//')

# Create CSV header
echo "$tags" > "$output_csv"

# Process data records
awk -v tags="$tags" '
    BEGIN {
        FS="[<>]"
        OFS=","
        split(tags, header, /,/)
        gsub(/ /, "", header[1])  # Remove any accidental spaces
    }
    /<faculty>/,/<\/faculty>/ || /<student>/,/<\/student>/ {
        if ($2 in header) values[$2] = $3
        if ($0 ~ /<\/faculty>|<\/student>/) {
            for (i=1; i<=length(header); i++) {
                printf "%s%s", (header[i] in values ? values[header[i]] : ""), (i<length(header) ? OFS : ORS)
            }
            delete values
        }
    }
' "$input_xml" >> "$output_csv"

echo "CSV generated: $output_csv"

XML File Structure

faculty.xml:

<?xml version="1.0" ?>
<faculties>
  <faculty>
    <faculty>Faculty of Art</faculty>
    <building>B59</building>
    <room>R1000</room>
    <capacity>240</capacity>
  </faculty>
  ...
</faculties>

students.xml:

<?xml version="1.0" ?>
<students>
  <student>
    <student_name>Michael Jennings</student_name>
    <student_id>273644</student_id>
    <student_email>MJ1Y24@UoNEHT.ac.uk</student_email>
    <programme>Fine Art</programme>
    <year>4</year>
    <address>08832 Kim Groves Suite 335 North Mathewview, RI 43333</address>
    <contact>3976680297</contact>
    <module_id>FINE4401</module_id>
    <module_name>Theoretical Sketching</module_name>
    <module_leader>Maria Tyler</module_leader>
    <lecturer1/>
    <lecturer2/>
    <faculty>Faculty of Art</faculty>
    <building>B81</building>
    <room>R1002</room>
    <exam_mark/>
    <coursework1>33</coursework1>
    <coursework2>7</coursework2>
    <coursework3>26</coursework3>
  </student>
  ...
</students>

Problem

The script is not fully dynamic. It should handle different XML structures without hardcoding tags.
The output CSV files are empty or not in the expected format.

Additional Information

I am trying to avoid using external tools like xmllint. The script should ideally be universal, but a semi-generic solution would also be acceptable.

Solution

It's not possible to create a "universal" solution without external tools such as xmllint, and you'd have to define your requirements for "semi-generic" to be able to get a solution that can handle whatever subset of XML you find acceptable. A fixed hierarchy of non-overlapping, non-nested tags without comments or strings containing tag-like substrings such as you show would be possible to parse with an awk script or similar, but that's a long way from the "universal" case you want and may not cover the "semi-generic" cases you have in mind.

Having said that, try this using GNU awk for various extensions. It's far from a "universal" solution for parsing XML but it might be adequate for your needs if all of your XML files look like the 2 examples you provided:

$ cat tst.sh
#!/usr/bin/env bash

awk '
    BEGIN { OFS="," }
    match($0, /^\s*<([^<>]+)(\/|>(.*)<\/[^<>]+)>\s*$/, a) {
        # The above regexp matches either of these formats of lines:
        #    <tag>value</tag>
        #    <tag/>
        numFlds++
        tags[numFlds] = a[1]
        vals[numFlds] = a[3]
    }
    /^\s*<\/[^<>]+>\s*$/ {
        # The above regexp matches this format of lines:
        #    </tag>
        if ( !doneHdr++ ) {
            prtCsv(tags)
        }
        prtCsv(vals)
        numFlds = 0
    }
    function prtCsv(arr,    i, out) {
        for ( i=1; i<=numFlds; i++ ) {
            out = arr[i]
            gsub(/"/, "\"\"", out)
            printf "\"%s\"%s", out, (i<numFlds ? OFS : ORS)
        }
    }
' "${@:--}"

$ ./tst.sh faculty.xml
"faculty","building","room","capacity"
"Faculty of Art","B59","R1000","240"

$ ./tst.sh students.xml
"student_name","student_id","student_email","programme","year","address","contact","module_id","module_name","module_leader","lecturer1","lecturer2","faculty","building","room","exam_mark","coursework1","coursework2","coursework3"
"Michael Jennings","273644","MJ1Y24@UoNEHT.ac.uk","Fine Art","4","08832 Kim Groves Suite 335 North Mathewview, RI 43333","3976680297","FINE4401","Theoretical Sketching","Maria Tyler","","","Faculty of Art","B81","R1002","","33","7","26"