xmldatabasebashshellcsv

Bash Script to Convert XML to CSV Not Fully Dynamic and Outputs Empty Content


I have a Bash script that is supposed to convert an XML file into CSV format. However, the script is not fully universal, and when I check the content of the output file, it appears to be empty.

Expected Output

The script should generate two CSV files (faculty.csv and students.csv) with the following headers and values:

faculty.csv:

name,building,room,capacity
"Faculty of Art","B59","R1000","240"
"Faculty of Art","B59","R1001","360"
"Faculty of Art","B59","R1002","210"
...

students.csv:

student_name,student_id,student_email,programme,year,address,contact,module_id,module_name,module_leader,lecturer1,lecturer2,name,building,room,exam_mark,coursework1,coursework2,coursework3
"Michael Jennings","273644","MJ1Y24@UoNEHT.ac.uk","Fine Art","4","08832 Kim Groves Suite 335 North Mathewview, RI 43333","3976680297","FINE4401","Theoretical Sketching","Maria Tyler","","","Faculty of Art","B81","R1002","","33","7","26"
...

The Script

#!/bin/bash

# Check for input file
if [[ -z "$1" ]]; then
    echo "Usage: $0 <input.xml> [output.csv]"
    exit 1
fi

input_xml="$1"
output_csv="${2:-${input_xml%.xml}.csv}"

# Extract unique tags (excluding XML declaration and root tags)
tags=$(grep -oP '<\K[^/>]+(?=>[^<]+</[^>]+>)' "$input_xml" | grep -vP '^(faculties|students|\?xml)' | sort | uniq | tr '\n' ',' | sed 's/,$//')

# Create CSV header
echo "$tags" > "$output_csv"

# Process data records
awk -v tags="$tags" '
    BEGIN {
        FS="[<>]"
        OFS=","
        split(tags, header, /,/)
        gsub(/ /, "", header[1])  # Remove any accidental spaces
    }
    /<faculty>/,/<\/faculty>/ || /<student>/,/<\/student>/ {
        if ($2 in header) values[$2] = $3
        if ($0 ~ /<\/faculty>|<\/student>/) {
            for (i=1; i<=length(header); i++) {
                printf "%s%s", (header[i] in values ? values[header[i]] : ""), (i<length(header) ? OFS : ORS)
            }
            delete values
        }
    }
' "$input_xml" >> "$output_csv"

echo "CSV generated: $output_csv"

XML File Structure

faculty.xml:

<?xml version="1.0" ?>
<faculties>
  <faculty>
    <faculty>Faculty of Art</faculty>
    <building>B59</building>
    <room>R1000</room>
    <capacity>240</capacity>
  </faculty>
  ...
</faculties>

students.xml:

<?xml version="1.0" ?>
<students>
  <student>
    <student_name>Michael Jennings</student_name>
    <student_id>273644</student_id>
    <student_email>MJ1Y24@UoNEHT.ac.uk</student_email>
    <programme>Fine Art</programme>
    <year>4</year>
    <address>08832 Kim Groves Suite 335 North Mathewview, RI 43333</address>
    <contact>3976680297</contact>
    <module_id>FINE4401</module_id>
    <module_name>Theoretical Sketching</module_name>
    <module_leader>Maria Tyler</module_leader>
    <lecturer1/>
    <lecturer2/>
    <faculty>Faculty of Art</faculty>
    <building>B81</building>
    <room>R1002</room>
    <exam_mark/>
    <coursework1>33</coursework1>
    <coursework2>7</coursework2>
    <coursework3>26</coursework3>
  </student>
  ...
</students>

Problem

  1. The script is not fully dynamic. It should handle different XML structures without hardcoding tags.
  2. The output CSV files are empty or not in the expected format.

Additional Information

I am trying to avoid using external tools like xmllint. The script should ideally be universal, but a semi-generic solution would also be acceptable.


Solution

  • It's not possible to create a "universal" solution without external tools such as xmllint, and you'd have to define your requirements for "semi-generic" to be able to get a solution that can handle whatever subset of XML you find acceptable. A fixed hierarchy of non-overlapping, non-nested tags without comments or strings containing tag-like substrings such as you show would be possible to parse with an awk script or similar, but that's a long way from the "universal" case you want and may not cover the "semi-generic" cases you have in mind.

    Having said that, try this using GNU awk for various extensions. It's far from a "universal" solution for parsing XML but it might be adequate for your needs if all of your XML files look like the 2 examples you provided:

    $ cat tst.sh
    #!/usr/bin/env bash
    
    awk '
        BEGIN { OFS="," }
        match($0, /^\s*<([^<>]+)(\/|>(.*)<\/[^<>]+)>\s*$/, a) {
            # The above regexp matches either of these formats of lines:
            #    <tag>value</tag>
            #    <tag/>
            numFlds++
            tags[numFlds] = a[1]
            vals[numFlds] = a[3]
        }
        /^\s*<\/[^<>]+>\s*$/ {
            # The above regexp matches this format of lines:
            #    </tag>
            if ( !doneHdr++ ) {
                prtCsv(tags)
            }
            prtCsv(vals)
            numFlds = 0
        }
        function prtCsv(arr,    i, out) {
            for ( i=1; i<=numFlds; i++ ) {
                out = arr[i]
                gsub(/"/, "\"\"", out)
                printf "\"%s\"%s", out, (i<numFlds ? OFS : ORS)
            }
        }
    ' "${@:--}"
    
    $ ./tst.sh faculty.xml
    "faculty","building","room","capacity"
    "Faculty of Art","B59","R1000","240"
    
    $ ./tst.sh students.xml
    "student_name","student_id","student_email","programme","year","address","contact","module_id","module_name","module_leader","lecturer1","lecturer2","faculty","building","room","exam_mark","coursework1","coursework2","coursework3"
    "Michael Jennings","273644","MJ1Y24@UoNEHT.ac.uk","Fine Art","4","08832 Kim Groves Suite 335 North Mathewview, RI 43333","3976680297","FINE4401","Theoretical Sketching","Maria Tyler","","","Faculty of Art","B81","R1002","","33","7","26"