excelvbadatabaseoptimizationindex-match

VBA Speeding Up Index-Match for Large Dataset


I'm writing some code to reformat some datasets. At the end, I have some empty columns in my output dataset, which I need to populate by looking up values from another sheet (Index-Match effectively).

I have tried many approaches, and two were successful. The problem is, they are far too slow, and unfortunately I am limited to using VBA. To prefix, I'm aware this problem could be solved by simply writing the Index-Match formula into the cells manually at the end and dragging them down, but I am trying to make life as simple as possible for the people who will need to perform this task monthly (also to reduce room for error).

Background info:

ws_ps example Column B as lookup value, with Column E to populate if a row in Column B is not empty

ws_priips example Column G corresponds to the lookup value in ws_ps Column B, with the value to look for coming from Column E

Approach 1: Looping through each row and using WorksheetFunction - 12 min runtime

    On Error Resume Next
    last_row_ps = ws_ps.UsedRange.Rows.Count
    For ps_row = 2 To last_row_ps
        ws_ps.Cells(ps_row, 5).Value = WorksheetFunction.IfError(WorksheetFunction.Index(ws_priips.Range("A:G"), _
        WorksheetFunction.Match(ws_ps.Cells(ps_row, 2), ws_priips.Range("G:G"), 0), 5), "")
    Next ps_row
    On Error GoTo -1

Approach 2: Loading worksheets into arrays and writing to worksheet if conditions are met - 15min runtime

    last_row_ps = ws_ps.UsedRange.Rows.Count
    last_row_priips = ws_priips.UsedRange.Rows.Count
    ps_array = ws_ps.Range("A1:X" & last_row_ps).Value
    priips_array = ws_priips.Range("A1:AZ" & last_row_priips).Value
    
    For ps_row = 2 To UBound(ps_array, 1)
        For priips_row = 2 To UBound(priips_array, 1)
            If ps_array(ps_row, 2) <> "" Then
                If ps_array(ps_row, 2) = priips_array(priips_row, 7) Then
                    ws_ps.Cells(ps_row, 5).Value = priips_array(priips_row, 5)
                    GoTo SkipLoop
                End If
            Else
                GoTo SkipLoop
            End If
        Next priips_row
SkipLoop:
    Next ps_row

Am I out of luck? So far I've only implemented these solutions for 1 column, but I will need to apply it to around 10. Is there any way to drastically speed things up, without resorting to typing the functions into the worksheet or using python? I would be grateful for any pointers. I am still very much a beginner.


Solution

  • Read the data from the lookup sheet into a 2D array, then build a dictionary object with the ISIN value as the key, and the array "row" number as the value. Use the dictionary to extract matched row values from the array to populate to the sheet being filled. Faster if you first pull the "to be filled" and ISIN columns from the output dataset sheet, fill them using the dictionary and the lookup array, then at the end write them back to the output sheet. Basically try to avoid any cell-by-cell operations.

    Using the code below, with both tables containing 150k rows (second table sorted randomly), I filled two columns in the first table in 3-4 sec.

    Note if you need cross-platform support you can use a Collection in place of the dictionary, which is not available on Mac.

    Sub Tester()
    
        Dim arrDest, arrSrc, ws As Worksheet, isin As String
        Dim dict As Object, r As Long, rMatch As Long
        Dim rngDest As Range
        
        'both my tables are on one sheet for testing
        Set ws = ThisWorkbook.Worksheets("Sheet1")
        'table to be filled
        Set rngDest = ws.Range("A1").CurrentRegion
        arrDest = rngDest.Value 'read to 2D array
        
        'data used as lookup table, as an array
        arrSrc = ws.Range("I1").CurrentRegion.Value
    
        Set dict = CreateObject("scripting.dictionary")
        'map isin to row number in lookup table array
        For r = 2 To UBound(arrSrc, 1)
            isin = arrSrc(r, 1)
            If Len(isin) > 0 Then dict(isin) = r 'assumes no duplicates in lookup table
        Next r
        
        'loop the report table and try to match values in the lookup table
        For r = 2 To UBound(arrDest, 1)
            isin = arrDest(r, 1)
            If Len(isin) > 0 Then
                If dict.Exists(isin) Then 'have a match?
                    rMatch = dict(isin) 'matched row
                    arrDest(r, 3) = arrSrc(rMatch, 2) 'copy a couple of values
                    arrDest(r, 5) = arrSrc(rMatch, 3)
                End If 'have match
            End If     'not zero-length
        Next r
        
        rngDest.Value = arrDest 'replace data with updated array
    
    End Sub
    

    My test tables (shorter version):
    test data

    Added: for anyone who needs cross-platform support (Windows/Mac) here's the basic "row map" approach using a Collection in place of a Dictionary. Loading the collection with 150k items takes about 10% longer than loading a dictionary of the same size, but retrieval by key is actually about 5x faster using a collection. Also seems like the Collection performance scales better for larger numbers of keys - for 400k items load and read is approx. 2.2 and 0.7 sec but for dictionary it's approx. 10.5 and 10 sec (much slower)

    'Using a Collection like a dictionary for mapping column values
    '   to their position in a dataset
    Sub CollectionTest()
    
        Dim arr, r As Long, col As Collection, t, k As String, v
        
        Set col = New Collection
        
        'Source range A1:A150000 filled with "Val_000001", "Val_000002", etc
        '   sorted randomly
        arr = Range("A1").CurrentRegion.Value
        
        t = Timer
        For r = 1 To UBound(arr, 1)
            k = arr(r, 1)
            'Add the row number as value and cell content as key
            '  note your row keys should be unique
            If IsEmpty(KeyValue(col, k)) Then col.Add r, k
        Next r
        Debug.Print col.Count & " items"
        Debug.Print "Loaded row map in " & Timer - t
        
        t = Timer
        For r = 1 To 150000
            k = "Val_" & Format(r, "000000")
            v = KeyValue(col, k)
            
            If r < 5 Then Debug.Print "Key " & k & " at row# " & v
        
        Next r
        Debug.Print "Retrieved values in " & Timer - t
    
    End Sub
    
    'Retrieve value for key `k` from collection `col`
    '  Returns Empty if there's no such key
    Function KeyValue(col As Collection, k As String)
        On Error Resume Next 'ignore error if no match for `k`
        KeyValue = col.Item(k)
    End Function