excel

Data cleanup for large data set


I am currently doing this in excel 64bit, and excel becomes un-responsive when I enter the data. The only current option is to break the data into sections and do those sections 1 by 1, which would be time consuming enough to make the task not worthwhile.

Basically, I have a huge list of addresses that do not follow the same rules when entered. Ex - some entries will have "41 st", some will have "41st st". So I have created a few formulas to identify those differences and remove them.

This formula will pull the Street / Ave # with the the suffix ‑st, ‑nd, ‑rd, ‑th

Original Data in A7

In Cell D7

=IFERROR(TEXTAFTER(TEXTBEFORE(A7," Ave",1,1,0,0)," 
",1,1,0,TEXTAFTER(TEXTBEFORE(A7," St",1,1,0,0),"
",1,1,0,TEXTAFTER(TEXTBEFORE(A7," CT",1,1,0,0),"
",1,1,0,TEXTAFTER(TEXTBEFORE(A7," CIR",1,1,0,0),"
",1,1,0,TEXTAFTER(TEXTBEFORE(A7," St CIR",1,1,0,0),"
",1,1,0,TEXTAFTER(TEXTBEFORE(A7," Ave CIR",1,1,0,0),"
",1,1,0,"")))))),"")

Then, it runs through this formula. Basically it looks at a table of numbers with suffixes and returns either nothing or a # without a suffix.

In Cell E7

=IFERROR(VLOOKUP(D7,$L$1:$M$302,1,FALSE),IFERROR(VLOOKUP(D7,$K$1:$L$302,2,FALSE),D7))

Final Formula simply looks at D7 and E7 to see if they are different, then substitutes if they are different with the desired value.

in Cell F7 - (ignore the IF for I7, I've set the if to be true for this example)

=IFERROR(IF(I7=1,TRIM(SUBSTITUTE(A7," "&D7," "&E7,1)),TRIM(SUBSTITUTE(A7," "&D7," "&D7,1))),"")

So - Trying to run this on a data set with 700k Rows only 1 column wide is causing Excel to hang / not respond for hours. Any thoughts on how I can make these entries more efficient, or thoughts on moving to something that can handle things like this better than excel? Or thoughts on how to do this efficiently in sections?

I have already broken this spreadsheet out to be a standalone with nothing else on it aside from the above. My PC is an i7-9700k with 32 gigs of ram, operating on SDD. Auto-save is off, and calculations are set to manual. File is downloaded to my device, but is also on my onedrive, but that shouldn't matter with autosave off, right?


Solution

  • Use a regular expression. If you have Office 360 the RegExpMatch function is built-in. If not, you can add a RegExpMatch function as follows:

    Step 1 - enable the VBSscript regex library in Your Excel sheet. Get into a VBA for applications window and select the Tools..References menu picks. Scroll down and check the box next to "Microsoft VBScript Regular Expressions 5.5" and press OK.

    Step 2 - Still inside the VBA for applications window, select Insert..Module. Then paste in the following VBA code:

    '  Some function wrappers to make the VBScript RegExp reference Library useful in both VBA code and in Excel & Access formulas
    '
    Private rg As RegExp    'All of the input data to control the RegExp parsing
    '  RegExp object contains 3 Boolean options that correspond to the 'i', 'g', and 'm' options in Unix-flavored regexp
    '    IgnoreCase - pretty self-evident.  True means [A-Z] matches lowercase letters and vice versa, false means it won't
    '    IsGlobal - True means after the first match has been processed, continue on from the current point in DataString and look to process more matches.  False means stop after first match is processed.
    '    MultiLine - False means ^ and $ match only Start and End of DataString, True means they match embedded newlines.  This provides an option to process line-by-line when Global is true also.
    '
    '  Returns true/false: does DataString match pattern?  IsGlobal=True makes no sense here
    Public Function RegExpMatch(DataString As String, Pattern As String, Optional IgnoreCase As Boolean = True, Optional IsGlobal As Boolean = False, Optional MultiLine As Boolean = False) As Boolean
    If rg Is Nothing Then Set rg = New RegExp
    rg.IgnoreCase = IgnoreCase
    rg.Global = IsGlobal
    rg.MultiLine = MultiLine
    rg.Pattern = Pattern
    RegExpMatch = rg.Test(DataString)
    End Function
    '
    '  Find <pattern> in <DataString>, replace with <ReplacePattern>
    '       Default IsGlobal=True means replace all matching occurrences.  Call with False to replace only first occurrence.
    '
    Public Function RegExpReplace(DataString As String, Pattern As String, ReplacePattern As String, Optional IgnoreCase As Boolean = True, Optional IsGlobal As Boolean = True, Optional MultiLine As Boolean = False) As String
    If rg Is Nothing Then Set rg = New RegExp
    rg.IgnoreCase = IgnoreCase
    rg.Global = IsGlobal
    rg.MultiLine = MultiLine
    rg.Pattern = Pattern
    RegExpReplace = rg.Replace(DataString, ReplacePattern)
    End Function
    

    Now save this module under a name but do not name it RegExpMatch or RegExpReplace. This VB code is a wrapper around the VBScript that turns them into usable Worksheet Functions.

    Now a simple =RegExpMatch(A7, "Ave( CIR)?|St( CIR)?|CIR|CT") formula will return a True if any of your alternatives are present or a False if not present.

    This might still take a while to recalc but not hours.

    You can also save this module in a separate Excel file as an "Add-in" and install it on your Excel instance so it is always imported every time you start up Excel. I do this and I always have RegExp functions available to use in formulas.