I'm currently learning awk and trying to normalize csv file. The result - send a scv file for processing and receive a new one scv.
I need to make all duplicate emails get a number before the @.
Source lines:
id,location_id,name,title,email,department
1,10,Sharon Petersen,Administrator,spetersen@email.com,
2,11,Sharon Petersen,Administrator,spetersen@email.com,
3,14,Andres Espinoza,"Manager, Commanding Officer",,
4,8,Elizabeth Feeney,CEO,e.feeney@gmail.org,Operations
5,8,Marilyn Baker-Venturini,Director,,
Expected lines:
id,location_id,name,title,email,department
1,10,Sharon Petersen,Administrator,spetersen1@email.com,
2,11,Sharon Petersen,Administrator,spetersen2@email.com,
3,14,Andres Espinoza,"Manager, Commanding Officer",aespinoza@email.com,
4,8,Elizabeth Feeney,CEO,efeeney@email.com,Operations
5,8,Marilyn Baker-Venturini,Director,mbaker-venturini@email.com,
I have already made few changes to normalize:
Sharon Petersen ---> spetersen@email.com
awk -v FPAT='([^,]*)|(\"[^\"]+\")' -v OFS=',' '{ split($3,a," "); print $1 "," $2 "," toupper(substr(a[1], 1, 1)) substr(a[1], 2) " " toupper(substr(a[2], 1, 1)) substr(a[2], 2) "," $4 "," tolower(substr(a[1], 1, 1) substr(a[2], 1)) "@email.com" "," $6 }' file.csv
I haven't had success adding numbering of email duplicates to the current awk command line. =(
I partially managed to solve the issue using the awk -v FPAT='[^,]*|("([^"]|"")*")' -v OFS=',' '++arr1[$5]>1{$5=$5"-"(++arr[$5])}1'
command, but I was unable to add numbering before @.
I really need advice. =)
Lots of articles and ready-made solutions.
Given your updated input and requirements, here's how I'd really do everything you're trying to do using a 2-pass approach with GNU awk for FPAT
:
$ cat tst.sh
#!/usr/bin/env bash
awk -v FPAT='[^,]*|("([^"]|"")*")' -v OFS=',' '
NR == FNR {
if ( FNR == 1 ) {
for ( i=1; i<=NF; i++ ) {
f[$i] = i
}
}
else {
numSpaceSeps = split(tolower($(f["name"])), spaceSeps, " ")
name = email = sep = ""
for ( i=1; i<=numSpaceSeps; i++ ) {
spaceSep = spaceSeps[i]
email = ( i==1 ? substr(spaceSep,1,1) : email spaceSep )
numHyphenSeps = split(spaceSep, hyphenSeps, "-")
for ( j=1; j<=numHyphenSeps; j++ ) {
hyphenSep = hyphenSeps[j]
name = name sep toupper(substr(hyphenSep,1,1)) substr(hyphenSep,2)
sep = "-"
}
sep = " "
}
fnr2name[FNR] = name
fnr2email[FNR] = email
emailCnts[email]++
}
next
}
{
if ( FNR > 1 ) {
email = fnr2email[FNR]
if ( emailCnts[email] > 1 ) {
email = email (++order[email])
}
$(f["name"]) = fnr2name[FNR]
$(f["email"]) = email "@email.com"
}
print
}
' "$1" "$1"
$ ./tst.sh file
id,location_id,name,title,email,department
1,10,Sharon Petersen,Administrator,spetersen1@email.com,
2,11,Sharon Petersen,Administrator,spetersen2@email.com,
3,14,Andres Espinoza,"Manager, Commanding Officer",aespinoza@email.com,
4,8,Elizabeth Feeney,CEO,efeeney@email.com,Operations
5,8,Marilyn Baker-Venturini,Director,mbaker-venturini@email.com,
Unlike your original script, the above will handle:
name
fields that include any number of middle names (John Jacob Smith
)Jon-bob walton-dude
)Sheryl-Ann Jones
),Bob The-Big-Cheese
)name
and/or email
columns occurring in any order in the input as it just reads those strings from the header line to determine their position rather than hard-coding it, and,"Manager, ""Commanding"" Officer",
) courtesy of the improved FPAT
value.You may also want to take a look at What's the most robust way to efficiently parse CSV using awk?.
Original answer:
Here's a 2-pass approach, assuming that email addresses that are not duplicates do not get a number 1
inserted before the @
, they are just left alone, and that the input isn't guaranteed to be sorted on the email address field and you want to retain the input order in the output (otherwise you probably wouldn't need a 2-pass approach):
$ awk 'BEGIN{FS=OFS=","} NR==FNR{tot[$5]++; next} tot[$5]>1{sub(/@/,++cnt[$5]"&",$5)} 1' file file
1,10,Sharon Petersen,Administrator,spetersen1@email.com,
2,11,Sharon Petersen,Administrator,spetersen2@email.com,
Set FS
as you see fit to handle quoted fields, etc., I'm just showing how to handle the email field after you isolate it as you already do.