bashawk

Adding sequential numbers before duplicates in email


I'm currently learning awk and trying to normalize csv file. The result - send a scv file for processing and receive a new one scv.

I need to make all duplicate emails get a number before the @.

Source lines:

id,location_id,name,title,email,department
1,10,Sharon Petersen,Administrator,spetersen@email.com,
2,11,Sharon Petersen,Administrator,spetersen@email.com,
3,14,Andres Espinoza,"Manager, Commanding Officer",,
4,8,Elizabeth Feeney,CEO,e.feeney@gmail.org,Operations
5,8,Marilyn Baker-Venturini,Director,,

Expected lines:

id,location_id,name,title,email,department
1,10,Sharon Petersen,Administrator,spetersen1@email.com,
2,11,Sharon Petersen,Administrator,spetersen2@email.com,
3,14,Andres Espinoza,"Manager, Commanding Officer",aespinoza@email.com,
4,8,Elizabeth Feeney,CEO,efeeney@email.com,Operations
5,8,Marilyn Baker-Venturini,Director,mbaker-venturini@email.com,

I have already made few changes to normalize:

awk -v FPAT='([^,]*)|(\"[^\"]+\")' -v OFS=',' '{ split($3,a," "); print $1 "," $2 "," toupper(substr(a[1], 1, 1)) substr(a[1], 2) " " toupper(substr(a[2], 1, 1)) substr(a[2], 2) "," $4 "," tolower(substr(a[1], 1, 1) substr(a[2], 1)) "@email.com" "," $6 }' file.csv

I haven't had success adding numbering of email duplicates to the current awk command line. =( I partially managed to solve the issue using the awk -v FPAT='[^,]*|("([^"]|"")*")' -v OFS=',' '++arr1[$5]>1{$5=$5"-"(++arr[$5])}1' command, but I was unable to add numbering before @.

I really need advice. =)

Lots of articles and ready-made solutions.


Solution

  • Given your updated input and requirements, here's how I'd really do everything you're trying to do using a 2-pass approach with GNU awk for FPAT:

    $ cat tst.sh
    #!/usr/bin/env bash
    
    awk -v FPAT='[^,]*|("([^"]|"")*")' -v OFS=',' '
        NR == FNR {
            if ( FNR == 1 ) {
                for ( i=1; i<=NF; i++ ) {
                    f[$i] = i
                }
            }
            else {
                numSpaceSeps = split(tolower($(f["name"])), spaceSeps, " ")
                name = email = sep = ""
                for ( i=1; i<=numSpaceSeps; i++ ) {
                    spaceSep = spaceSeps[i]
                    email = ( i==1 ? substr(spaceSep,1,1) : email spaceSep )
                    numHyphenSeps = split(spaceSep, hyphenSeps, "-")
                    for ( j=1; j<=numHyphenSeps; j++ ) {
                        hyphenSep = hyphenSeps[j]
                        name = name sep toupper(substr(hyphenSep,1,1)) substr(hyphenSep,2)
                        sep = "-"
                    }
                    sep = " "
                }
                fnr2name[FNR] = name
                fnr2email[FNR] = email
                emailCnts[email]++
            }
            next
        }
        {
            if ( FNR > 1 ) {
                email = fnr2email[FNR]
                if ( emailCnts[email] > 1 ) {
                    email = email (++order[email])
                }
                $(f["name"]) = fnr2name[FNR]
                $(f["email"]) = email "@email.com"
            }
            print
        }
    ' "$1" "$1"
    

    $ ./tst.sh file
    id,location_id,name,title,email,department
    1,10,Sharon Petersen,Administrator,spetersen1@email.com,
    2,11,Sharon Petersen,Administrator,spetersen2@email.com,
    3,14,Andres Espinoza,"Manager, Commanding Officer",aespinoza@email.com,
    4,8,Elizabeth Feeney,CEO,efeeney@email.com,Operations
    5,8,Marilyn Baker-Venturini,Director,mbaker-venturini@email.com,
    

    Unlike your original script, the above will handle:

    1. name fields that include any number of middle names (John Jacob Smith)
    2. Correctly upper-casing the first letter of each part of hyphenated names (Jon-bob walton-dude)
    3. Hyphenated first and/or middle names (Sheryl-Ann Jones),
    4. Multiply-hyphenated names (Bob The-Big-Cheese)
    5. The name and/or email columns occurring in any order in the input as it just reads those strings from the header line to determine their position rather than hard-coding it, and
    6. Escaped quotes within quoted fields (,"Manager, ""Commanding"" Officer",) courtesy of the improved FPAT value.

    You may also want to take a look at What's the most robust way to efficiently parse CSV using awk?.


    Original answer:

    Here's a 2-pass approach, assuming that email addresses that are not duplicates do not get a number 1 inserted before the @, they are just left alone, and that the input isn't guaranteed to be sorted on the email address field and you want to retain the input order in the output (otherwise you probably wouldn't need a 2-pass approach):

    $ awk 'BEGIN{FS=OFS=","} NR==FNR{tot[$5]++; next} tot[$5]>1{sub(/@/,++cnt[$5]"&",$5)} 1' file file
    1,10,Sharon Petersen,Administrator,spetersen1@email.com,
    2,11,Sharon Petersen,Administrator,spetersen2@email.com,
    

    Set FS as you see fit to handle quoted fields, etc., I'm just showing how to handle the email field after you isolate it as you already do.