For various reasons, I have been tasked with importing data from a large gml file into a MySQL database. The gml files I am looking at are the ones at https://use-land-property-data.service.gov.uk/datasets/inspire/download.
Basically, all I am interested in is the coordinates of each object (polygon). I can import the coordinates, as they appear in the gml file (lat1 long1 lat2 long2...). I also need to include the maximum and minimum latitude and longitude as four separate columns (this needs to be calculated; it is not included in the file).
My attempt so far is as follows:
#!/bin/bash
gmlfile=$1
gmldir="/home/user.name/gml_files"
gml_short=$(echo $gmlfile | sed 's/\.gml//')
echo "INSERT IGNORE INTO gmldata.uk (region,coordinates,maxLat,maxLong,minLat,minLong) VALUES" > gml_sql_query.sql
xmllint --xpath "//*[local-name()='posList']" "$gmldir/$gmlfile" | cut -d">" -f2 | cut -d"<" -f1 > "$gmldir/$gml_short-coords.txt" #Dump all coordinates to coordinate file
while read coordinates; do
latitude=$(echo $coordinates | tr ' ' '\n' | paste -d' ' - - | cut -d" " -f1) #odd elements
longitude=$(echo $coordinates | tr ' ' '\n' | paste -d' ' - - | cut -d" " -f2) #even elements
max_lat=$(echo "$latitude" | tr " " '\n' | sort -n | tail -1)
min_lat=$(echo "$latitude" | tr " " '\n' | sort -n | head -1)
max_long=$(echo "$longitude" | tr " " '\n' | sort -n | tail -1)
min_long=$(echo "$longitude" | tr " " '\n' | sort -n | head -1)
echo "('$gml_short','$coordinates','$max_lat','$max_long','$min_lat','$min_long')," >> gml_sql_query.sql
done < "$gmldir/$gml_short-coords.txt"
sed -i '$s/,$/;/' gml_sql_query.sql #Replace last , with ;
mysql < gml_sql_query.sql
The xmllint line and mysql import itself are both fast. The population of the "gml_sql_query.sql" file is really, really slow; around 1000 every 30 seconds. Every one of those gml files has tens of 1000s of coordinates, so this is taking a long, long time. Which parts of my script are the most inefficient, and is there any way to speed this up?
Two asides:
lat_long
and lat_process
(etc) variables that did most of the piping and just did a final pipe for the first/second column or whatever. I didn't find that had any improvement.mapfile -t lat_array <<< "$latitude"
max_lat=0
min_lat=9999999.0
for num in "${lat_array[@]}"; do
if [[ $(echo "$num > $max_lat" | bc -l) -eq 1 ]]; then
max_lat=$num
fi
if [[ $(echo "$num < $min_lat" | bc -l) -eq 1 ]]; then
min_lat=$num
fi
done
mapfile -t long_array <<< "$longitude"
max_long=0
min_long=9999999.0
for num in "${long_array[@]}"; do
if [[ $(echo "$num > $max_long" | bc -l) -eq 1 ]]; then
max_long=$num
fi
if [[ $(echo "$num < $min_long" | bc -l) -eq 1 ]]; then
min_long=$num
fi
done
EDIT: example input data:
525873.122 104945.339 525873.13 104945.234 525872.916 104949.942 525873.308 104950.628 525874.582 104951.118 525875.268 104951.804 525875.301 104951.902 525871.972 104955.36 525847.95 104958.6 525824.152 104961.864 525821.7 104962.2 525798.7 104964.55 525778.14 104966.53 525773.63 104966.97 525778.15 104958.65 525781.863 104952.578 525783.98 104948.99 525797.06 104948.84 525819.82 104948.3 525845.38 104948.07 525856.52 104948.02 525871.5 104945.55 525871.69 104945.52 525872.5 104945.4 525873.122 104945.339
525661.75 105117.9 525685.085 105118.801 525684.99 105119.75 525684.5 105124.8 525684.47 105125.2 525681.819 105124.974 525680.35 105124.85 525671.99 105124.09 525664.24 105123.45 525661.29 105123.2 525661.75 105117.9
525685.1 105118.65 525686.35 105118.85 525698.949 105119.35 525700.4 105119.45 525701.699 105119.5 525703.15 105119.5 525709.949 105119.85 525709.29 105128.05 525707.25 105127.85 525699.55 105127.15 525697.9 105127.05 525689.8 105126.35 525685.819 105126 525684.686 105125.905 525684.406 105125.882 525684.47 105125.2 525684.5 105124.8 525684.99 105119.75 525685.1 105118.65
Desired MySQL insert statement:
INSERT IGNORE INTO gmldata.uk (region,coordinates,maxLat,maxLong,minLat,minLong) VALUES
('Adur_District_Council','525873.122 104945.339 525873.13 104945.234 525872.916 104949.942 525873.308 104950.628 525874.582 104951.118 525875.268 104951.804 525875.301 104951.902 525871.972 104955.36 525847.95 104958.6 525824.152 104961.864 525821.7 104962.2 525798.7 104964.55 525778.14 104966.53 525773.63 104966.97 525778.15 104958.65 525781.863 104952.578 525783.98 104948.99 525797.06 104948.84 525819.82 104948.3 525845.38 104948.07 525856.52 104948.02 525871.5 104945.55 525871.69 104945.52 525872.5 104945.4 525873.122 104945.339','525875.301','104966.97','525773.63','104945.234'),
('Adur_District_Council','525661.75 105117.9 525685.085 105118.801 525684.99 105119.75 525684.5 105124.8 525684.47 105125.2 525681.819 105124.974 525680.35 105124.85 525671.99 105124.09 525664.24 105123.45 525661.29 105123.2 525661.75 105117.9','525685.085','105125.2','525661.29','105117.9'),
('Adur_District_Council','525685.1 105118.65 525686.35 105118.85 525698.949 105119.35 525700.4 105119.45 525701.699 105119.5 525703.15 105119.5 525709.949 105119.85 525709.29 105128.05 525707.25 105127.85 525699.55 105127.15 525697.9 105127.05 525689.8 105126.35 525685.819 105126 525684.686 105125.905 525684.406 105125.882 525684.47 105125.2 525684.5 105124.8 525684.99 105119.75 525685.1 105118.65','525709.949','105128.05','525684.406','105118.65');
Using any awk:
$ cat tst.sh
#!/usr/bin/env bash
gmlfile='Adur_District_Council.gml'
gml_short="${gmlfile%%.gml}"
awk -v q="'" -v gml_short="$gml_short" '
BEGIN {
OFS = q "," q
print "INSERT IGNORE INTO gmldata.uk (region,coordinates,maxLat,maxLong,minLat,minLong) VALUES"
ORS = ""
}
{
max_lat = min_lat = $1
max_lon = min_lon = $2
for ( i=3; i<NF; i+=2 ) {
lat = $i
lon = $(i+1)
max_lat = (lat > max_lat ? lat : max_lat)
min_lat = (lat < min_lat ? lat : min_lat)
max_lon = (lon > max_lon ? lon : max_lon)
min_lon = (lon < min_lon ? lon : min_lon)
}
print sep "(" q gml_short, $0, max_lat, max_lon, min_lat, min_lon q ")"
sep = ",\n"
}
END {
print ";\n"
}
' coords.txt
$ cat coords.txt
525873.122 104945.339 525873.13 104945.234 525872.916 104949.942 525873.308 104950.628 525874.582 104951.118 525875.268 104951.804 525875.301 104951.902 525871.972 104955.36 525847.95 104958.6 525824.152 104961.864 525821.7 104962.2 525798.7 104964.55 525778.14 104966.53 525773.63 104966.97 525778.15 104958.65 525781.863 104952.578 525783.98 104948.99 525797.06 104948.84 525819.82 104948.3 525845.38 104948.07 525856.52 104948.02 525871.5 104945.55 525871.69 104945.52 525872.5 104945.4 525873.122 104945.339
525661.75 105117.9 525685.085 105118.801 525684.99 105119.75 525684.5 105124.8 525684.47 105125.2 525681.819 105124.974 525680.35 105124.85 525671.99 105124.09 525664.24 105123.45 525661.29 105123.2 525661.75 105117.9
525685.1 105118.65 525686.35 105118.85 525698.949 105119.35 525700.4 105119.45 525701.699 105119.5 525703.15 105119.5 525709.949 105119.85 525709.29 105128.05 525707.25 105127.85 525699.55 105127.15 525697.9 105127.05 525689.8 105126.35 525685.819 105126 525684.686 105125.905 525684.406 105125.882 525684.47 105125.2 525684.5 105124.8 525684.99 105119.75 525685.1 105118.65
$ ./tst.sh
INSERT IGNORE INTO gmldata.uk (region,coordinates,maxLat,maxLong,minLat,minLong) VALUES
('Adur_District_Council','525873.122 104945.339 525873.13 104945.234 525872.916 104949.942 525873.308 104950.628 525874.582 104951.118 525875.268 104951.804 525875.301 104951.902 525871.972 104955.36 525847.95 104958.6 525824.152 104961.864 525821.7 104962.2 525798.7 104964.55 525778.14 104966.53 525773.63 104966.97 525778.15 104958.65 525781.863 104952.578 525783.98 104948.99 525797.06 104948.84 525819.82 104948.3 525845.38 104948.07 525856.52 104948.02 525871.5 104945.55 525871.69 104945.52 525872.5 104945.4 525873.122 104945.339','525875.301','104966.97','525773.63','104945.234'),
('Adur_District_Council','525661.75 105117.9 525685.085 105118.801 525684.99 105119.75 525684.5 105124.8 525684.47 105125.2 525681.819 105124.974 525680.35 105124.85 525671.99 105124.09 525664.24 105123.45 525661.29 105123.2 525661.75 105117.9','525685.085','105125.2','525661.29','105117.9'),
('Adur_District_Council','525685.1 105118.65 525686.35 105118.85 525698.949 105119.35 525700.4 105119.45 525701.699 105119.5 525703.15 105119.5 525709.949 105119.85 525709.29 105128.05 525707.25 105127.85 525699.55 105127.15 525697.9 105127.05 525689.8 105126.35 525685.819 105126 525684.686 105125.905 525684.406 105125.882 525684.47 105125.2 525684.5 105124.8 525684.99 105119.75 525685.1 105118.65','525709.949','105128.05','525684.406','105118.65');
So your full script would become (untested):
#!/usr/bin/env bash
gmlfile="$1"
gmldir='/home/user.name/gml_files'
gml_short="${gmlfile%%.gml}"
xmllint --xpath "//*[local-name()='posList']" "$gmldir/$gmlfile" | cut -d">" -f2 | cut -d"<" -f1 > "$gmldir/$gml_short-coords.txt" #Dump all coordinates to coordinate file
awk -v q="'" -v gml_short="$gml_short" '
BEGIN {
OFS = q "," q
print "INSERT IGNORE INTO gmldata.uk (region,coordinates,maxLat,maxLong,minLat,minLong) VALUES"
ORS = ""
}
{
max_lat = min_lat = $1
max_lon = min_lon = $2
for ( i=3; i<NF; i+=2 ) {
lat = $i
lon = $(i+1)
max_lat = (lat > max_lat ? lat : max_lat)
min_lat = (lat < min_lat ? lat : min_lat)
max_lon = (lon > max_lon ? lon : max_lon)
min_lon = (lon < min_lon ? lon : min_lon)
}
print sep "(" q gml_short, $0, max_lat, max_lon, min_lat, min_lon q ")"
sep = ",\n"
}
END {
print ";\n"
}
' "$gmldir/$gml_short-coords.txt" > gml_sql_query.sql
mysql < gml_sql_query.sql
or if you'd rather not have that apparently temporary gml_sql_query.sql
file:
$ cat tst.sh
#!/usr/bin/env bash
gmlfile="$1"
gmldir='/home/user.name/gml_files'
gml_short="${gmlfile%%.gml}"
xmllint --xpath "//*[local-name()='posList']" "$gmldir/$gmlfile" | cut -d">" -f2 | cut -d"<" -f1 > "$gmldir/$gml_short-coords.txt" #Dump all coordinates to coordinate file
mysql < <(
awk -v q="'" -v gml_short="$gml_short" '
BEGIN {
OFS = q "," q
print "INSERT IGNORE INTO gmldata.uk (region,coordinates,maxLat,maxLong,minLat,minLong) VALUES"
ORS = ""
}
{
max_lat = min_lat = $1
max_lon = min_lon = $2
for ( i=3; i<NF; i+=2 ) {
lat = $i
lon = $(i+1)
max_lat = (lat > max_lat ? lat : max_lat)
min_lat = (lat < min_lat ? lat : min_lat)
max_lon = (lon > max_lon ? lon : max_lon)
min_lon = (lon < min_lon ? lon : min_lon)
}
print sep "(" q gml_short, $0, max_lat, max_lon, min_lat, min_lon q ")"
sep = ",\n"
}
END {
print ";\n"
}
' "$gmldir/$gml_short-coords.txt"
)