im having some challenges in 1 CSV files im trying to decode. as a delimiter i have pipe to identified column numbers and double quote to identify big string column.
I need to do prep processing script to remove this unwanted chars and for the sample2 add the missing double quote.
sample 1|Pending|"XXX LPA. LTD. - GridBird TEST (Monthly) (Collage Maker | Photo Editor)"|749|...
sample 2|Pending|"Take Five Entertainment" Limited Liability Partnership - Pack with cats "Halloween" (Cat Simulator: My Pets)|...
desire output:
sample 1|Pending|"XXX LPA. LTD. - GridBird TEST (Monthly) (Collage Maker Photo Editor)"|749|...
sample 2|Pending|"Take Five Entertainment Limited Liability Partnership - Pack with cats Halloween (Cat Simulator: My Pets)"|749|...
im trying to regex to identify the first sample and replace with sed but it seems im missing some sed behavior how it works as first i need to identify the string pattern and then replace only that chars in this case any | or "
^.*\|\".*(\||\").*(\"\|\d|\|\d)
on this regex I'm identifying the string between |" and ("|digits or just |digits).
Next step is what I'm confused how to do with Sed.
Best Regards
Solution for this based on new samples i got. Thank you @Ed Morton
awk '
{
{
gsub(/\{/, " ");
gsub(/\}/, " ");
# First split the record on `"`s and replace every `|` (plus
# any surrounding spaces) inside a quoted field with a blank.
FS=OFS="\""; $0=$0
for ( i=2; i<=NF; i+=2 ) {
#print $i
gsub(/[[:space:]][|][[:space:]]*/, " ", $i)
}
FS=OFS="\""; $0=$0
for ( i=2; i<=NF; i+=2 ) {
#print $i
if ($i ~ /[#\-\+\?A-Za-z][|][#\-\+\?A-Za-z]*/)
{
#print $i
sub(/[|]{1}/, " ", $i)
}
}
# Now split on `|`s, remove `"`s within fields and
# add `"`s around every field that contained `"`s
FS=OFS="|"; $0=$0;
for ( i=1; i<=NF; i++ ) {
gsub(/"/, "", $i);
}
print
}
Here's one way to get the output you provided from the sample input you provided using any POSIX awk:
$ awk '
{
# First split the record on `"`s and replace every `|` (plus
# any surrounding spaces) inside a quoted field with a blank.
FS=OFS="\""; $0=$0
for ( i=2; i<=NF; i+=2 ) {
gsub(/[[:space:]]*[|][[:space:]]*/, " ", $i)
}
# Now split on `|`s, remove `"`s within fields and
# add `"`s around every field that contained `"`s
FS=OFS="|"; $0=$0;
for ( i=1; i<=NF; i++ ) {
if ( gsub(/"/, "", $i) ) {
$i = "\"" $i "\""
}
}
print
}
' file
sample 1|Pending|"XXX LPA. LTD. - GridBird TEST (Monthly) (Collage Maker Photo Editor)"|749|...
sample 2|Pending|"Take Five Entertainment Limited Liability Partnership - Pack with cats Halloween (Cat Simulator: My Pets)"|...