I have a very large textfile in which some entries are missing. The logic is persistent, as the first line of each "section" has the correct entries, every line after this initial line is missing these entries. I'm trying to update every line which misses these entries with the information from the initial line until a new "inital information line" is found. After that I'm continuing with this new found data.
I have build a solution in bash with the help of sed, but the process is very, very slow and takes hours to complete. I guess the reason for the delay is the fact that I'm reading line by line, process these in bash and write them to a new file. My guess is that a sed script with variables and on the file itself (-f) could speed up the process dramatically. I'm not an expert in these advanced usages of sed. I am open to other suggestions or tools, too - as long as they can be called from a bash script, as this is part of an automation.
The example inputfile:
{"Initial line with more information like headers, unimportant, really only one line"
"Alpha","OldTheme","Some more text"]
"","","Another rest text"]
"","","Yet another text"]
"Yadda","NewTheme","Crazy Text"]
"","","More crazy text"]
The expected result:
"Alpha","OldTheme","Some more text"]
"Alpha","OldTheme","Another rest text"]
"Alpha","OldTheme","Yet another text"]
"Yadda","NewTheme","Crazy Text"]
"Yadda","NewTheme","More crazy text"]
And here's my working (but very slow) bash script:
#!/bin/bash
first=0
cat inputfile | \
while read line; do
if [ ${first} -eq 0 ]; then
first=1; continue
fi
partline=$(echo "${line}" | grep -o '","\(.*\)')
newinitial=$(echo "${line}" | sed 's/",".*//; s/^"//')
if [ ! -z "${newinitial}" ]; then
initial=${newinitial}
fi
newtheme=$(echo "${partline}" | sed 's/^","//; s/",".*//')
if [ ! -z "${newtheme}" ]; then
theme=${newtheme}
fi
restline=$(echo ${partline} | sed 's/^","//' | grep -o '","\(.*\)')
echo "\"${initial}\",\"${theme}${restline}"
done >outputfile
Clarification: The input is NOT a well-formed CSV, the first line starts indeed with a {
and every line ends with a ]
, therefore any library using pure csv-style as input won't work. I've updated the input data, sorry for the confusion.
This is really hacky, but then so are the input and output formats... Use with One True Awk or GNU awk 5.3+:
awk --csv -v OFS='","' \
'NR>1 {for (i=1;i<NF;i++) {a[i]=$i=$i?$i:a[i]; gsub(/"/,"\"\"",$i)} print "\"" $0}'
"Alpha","OldTheme","Some more text"]
"Alpha","OldTheme","Another rest text"]
"Alpha","OldTheme","Yet another text"]
"Yadda","NewTheme","Crazy Text"]
"Yadda","NewTheme","More crazy text"]