I have a Bash script that is supposed to convert an XML file into CSV format. However, the script is not fully universal, and when I check the content of the output file, it appears to be empty.
The script should generate two CSV files (faculty.csv
and students.csv
) with the following headers and values:
faculty.csv:
name,building,room,capacity
"Faculty of Art","B59","R1000","240"
"Faculty of Art","B59","R1001","360"
"Faculty of Art","B59","R1002","210"
...
students.csv:
student_name,student_id,student_email,programme,year,address,contact,module_id,module_name,module_leader,lecturer1,lecturer2,name,building,room,exam_mark,coursework1,coursework2,coursework3
"Michael Jennings","273644","MJ1Y24@UoNEHT.ac.uk","Fine Art","4","08832 Kim Groves Suite 335 North Mathewview, RI 43333","3976680297","FINE4401","Theoretical Sketching","Maria Tyler","","","Faculty of Art","B81","R1002","","33","7","26"
...
#!/bin/bash
# Check for input file
if [[ -z "$1" ]]; then
echo "Usage: $0 <input.xml> [output.csv]"
exit 1
fi
input_xml="$1"
output_csv="${2:-${input_xml%.xml}.csv}"
# Extract unique tags (excluding XML declaration and root tags)
tags=$(grep -oP '<\K[^/>]+(?=>[^<]+</[^>]+>)' "$input_xml" | grep -vP '^(faculties|students|\?xml)' | sort | uniq | tr '\n' ',' | sed 's/,$//')
# Create CSV header
echo "$tags" > "$output_csv"
# Process data records
awk -v tags="$tags" '
BEGIN {
FS="[<>]"
OFS=","
split(tags, header, /,/)
gsub(/ /, "", header[1]) # Remove any accidental spaces
}
/<faculty>/,/<\/faculty>/ || /<student>/,/<\/student>/ {
if ($2 in header) values[$2] = $3
if ($0 ~ /<\/faculty>|<\/student>/) {
for (i=1; i<=length(header); i++) {
printf "%s%s", (header[i] in values ? values[header[i]] : ""), (i<length(header) ? OFS : ORS)
}
delete values
}
}
' "$input_xml" >> "$output_csv"
echo "CSV generated: $output_csv"
faculty.xml:
<?xml version="1.0" ?>
<faculties>
<faculty>
<faculty>Faculty of Art</faculty>
<building>B59</building>
<room>R1000</room>
<capacity>240</capacity>
</faculty>
...
</faculties>
students.xml:
<?xml version="1.0" ?>
<students>
<student>
<student_name>Michael Jennings</student_name>
<student_id>273644</student_id>
<student_email>MJ1Y24@UoNEHT.ac.uk</student_email>
<programme>Fine Art</programme>
<year>4</year>
<address>08832 Kim Groves Suite 335 North Mathewview, RI 43333</address>
<contact>3976680297</contact>
<module_id>FINE4401</module_id>
<module_name>Theoretical Sketching</module_name>
<module_leader>Maria Tyler</module_leader>
<lecturer1/>
<lecturer2/>
<faculty>Faculty of Art</faculty>
<building>B81</building>
<room>R1002</room>
<exam_mark/>
<coursework1>33</coursework1>
<coursework2>7</coursework2>
<coursework3>26</coursework3>
</student>
...
</students>
I am trying to avoid using external tools like xmllint
. The script should ideally be universal, but a semi-generic solution would also be acceptable.
It's not possible to create a "universal" solution without external tools such as xmllint
, and you'd have to define your requirements for "semi-generic" to be able to get a solution that can handle whatever subset of XML you find acceptable. A fixed hierarchy of non-overlapping, non-nested tags without comments or strings containing tag-like substrings such as you show would be possible to parse with an awk script or similar, but that's a long way from the "universal" case you want and may not cover the "semi-generic" cases you have in mind.
Having said that, try this using GNU awk for various extensions. It's far from a "universal" solution for parsing XML but it might be adequate for your needs if all of your XML files look like the 2 examples you provided:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN { OFS="," }
match($0, /^\s*<([^<>]+)(\/|>(.*)<\/[^<>]+)>\s*$/, a) {
# The above regexp matches either of these formats of lines:
# <tag>value</tag>
# <tag/>
numFlds++
tags[numFlds] = a[1]
vals[numFlds] = a[3]
}
/^\s*<\/[^<>]+>\s*$/ {
# The above regexp matches this format of lines:
# </tag>
if ( !doneHdr++ ) {
prtCsv(tags)
}
prtCsv(vals)
numFlds = 0
}
function prtCsv(arr, i, out) {
for ( i=1; i<=numFlds; i++ ) {
out = arr[i]
gsub(/"/, "\"\"", out)
printf "\"%s\"%s", out, (i<numFlds ? OFS : ORS)
}
}
' "${@:--}"
$ ./tst.sh faculty.xml
"faculty","building","room","capacity"
"Faculty of Art","B59","R1000","240"
$ ./tst.sh students.xml
"student_name","student_id","student_email","programme","year","address","contact","module_id","module_name","module_leader","lecturer1","lecturer2","faculty","building","room","exam_mark","coursework1","coursework2","coursework3"
"Michael Jennings","273644","MJ1Y24@UoNEHT.ac.uk","Fine Art","4","08832 Kim Groves Suite 335 North Mathewview, RI 43333","3976680297","FINE4401","Theoretical Sketching","Maria Tyler","","","Faculty of Art","B81","R1002","","33","7","26"