arrayspowershellcollectionsactive-directorypowershell-5.1

How can I speed up the counting of matching entries between two large arrays in Powershell?


I am trying to count the number of matching elements between two large Powershell arrays/collections generated by capturing Active Directory ("AD") data.
However, it is very slow.

Concept

Note: feel free to skip ahead to the Application section.

Current concept

I am looking to count the number of matching instances of

Suppose array A contains [-30, -2, 1, 3, 4, 6, 20, 50] and the subset of interest is all positive numbers, and array B contains [50, 6, 80, 10, -2, 30, 6, 6, 50]. Then the correct count would be 2 (6 & 50 appear in both arrays; -2 is ignored).

Note that in the real implementation (see below):

Alternative concept

Alternatively, I could easily create a new array a = [1, 3, 4, 6, 20, 50] comprising the subset of A of interest and extract an ordered list of unique entries from B to form b = [-2, 6, 10, 30, 50, 80].
Then I just need to add up the number of elements appearing in both a and b (correct answer is still 2).

Basic description of algorithm

Array A has roughly 450,000 string elements, of which roughly 400,000 are of interest.
Array B has roughly 20,000 string elements, comprising roughly 19,000 unique values.
Currently I iterate through all elements of array A, I use if to check whether the element is in the subset of interest, and, if so, then I test whether the element appears in array B using the -contains operator.

Problem with algorithm

This takes over an hour to run.

Possibilities to speed up algorithm

I wondered whether it would be faster to iterate through unique elements of array B (in other words, the elements of array b), and test whether they appear in array A (or perhaps array a). But I haven't tried it, because logically it seems as though it should take roughly the same amount of time.

Alternatively, rather than using basic arrays, is there some cleverer object like an OrderedList class that would be faster to search?

If there's no OrderedList data type, then would it be practical to split the data into subarrays myself? For example, I could split array b into beta_1 = [-2, 6, 10] and beta_2 = [30, 50, 80]. Thus, if the selected element of array A has a value less than 20 I will look for it in array beta_1, and otherwise in array beta_2.
In the real problem I'm dealing with strings, and arrays with thousands of elements, so perhaps one subarray per initial (capitalised) letter of the alphabet (beta_A, beta_B, beta_C, ..., beta_Z) would be apt. I wouldn't want to have to hard code a switch statement to choose one of 26 subarrays, so — if necessary — I'd rather index into a collection of arrays: beta[0] rather than beta_A, through to beta[25] rather than beta_Z. And I would anticipate using the code value of each capital letter to select the appropriate one.

Application

Scenario

Some people are recorded in AD in both of the organisational units: Permanent and Visitors. But with different details recorded in each OU (so I cannot match on telephone, address, email, etc.). Nevertheless, as each pair of such records relate to a single individual person, it should be reasonably informative to match on full name. (In reality a small fraction may change their name, or provide a different version of their name.) Full name is stored in the Description attribute (aside from a scattering of data entry errors).

The ultimate intention is to calculate statistics on such multiple records that exist for certain individuals.

Example:

I parse $visitors.Description to tally frequency, by which it is ordered (descending):

I skip Betty Butternut (appears more than once in $descriptions), and then check whether the other $numUnique=4 names (Adam Aardvark, Cathy Cabbage, Daniel Dingo, or Enid Endive) appear in $permanents.Description.
The number of matches in this example would be $numMatching=2 (corresponding to Cathy Cabbage and Enid Endive); it doesn't matter if an element of $descriptions appears in $permanents.Description more than once, that's still just counted as one match. This is ultimately reported as 2 out of 4 being matched.

Current code

Here's a trimmed & anonymised version of my actual code.

# Acquire active visitors' data.  
$aGroup = "Visitors"
$sb  = "OU="+ $aGroup +",OU=Active,OU=Acme-Users,DC=acmeauth,DC=acme,DC=com" 
$visitors = Get-ADUser -Filter * -Properties Description -SearchBase $sb | select Name, Description

# Sort Description parameter for visitors by frequency.  
$descriptions = $visitors | Group-Object {$_.Description} -NoElement | Sort-Object -Property Count -Descending

# Acquire active permanent staff data.  
$aGroup = "Permanent"
$sb  = "OU="+ $aGroup +",OU=Active,OU=Acme-Users,DC=acmeauth,DC=acme,DC=com" 
$permanents = Get-ADUser -Filter * -Properties Description -SearchBase $sb | select Name, Description

# Iterate through _non-duplicated_ visitor records to find matching permanent staff Description parameters and tally.  
$numUnique = 0
$numMatching = 0

$maxIts = $descriptions.count

for ($index = 0; $index -lt $maxIts; $index++)
{
    $person = $descriptions[$index]
    
    if ($person.count -eq 1) {
        # This person has a unique Description among the (active) visitor cohort.  
        $numUnique++
        if ($permanents.Description -contains $person.name) {
            # This person has a matching Description among the (active) permanent staff cohort.  
            $numMatching++
        }
    }
}

In the real application:

Future code

It turns out that the Description field has data entry errors for a scattering of records (e.g. "Enid" in the example under Scenario), so I am planning to modify the code to instead use a concatenation of GivenName and Surname parameters (with a space character in between), assigned to arrays as simple strings using foreach.

In a nutshell

How can I optimise this code?

In particular, how can I get it to run faster?

Possible improvements

Tinkering with algorithm

Implement one of more of the conceptual modifications described under Alternative concept and Possibilities to speed up algorithm.

Parallel computation

Maybe a simple way would be to get the most computationally demanding & repetitive part(s) of the code to run in parallel?? I haven't looked into this in detail, but so far it's a little confusing: for Powershell 5.1 a post by js2010 on SO indicates a few alternatives, including "workflows" with a foreach -parallel command. However, Microsoft says "There is currently no parallel support for the foreach keyword." Maybe the latter statement was because the support was removed, so it's not "current" support??

And I don't know for sure whether it's feasible for this application to run in parallel, or (if so) which of the various alternatives would be best (easiest to implement would be my priority, if performance is similar), and whether it'd help much.

Other ideas

Anything in this SO thread relevant? (Excluding the alternative to use something outside of Powershell.)

System

There is a reasonably powerful workstation that this job can run on.

I do not want to put any unnecessary load on the AD server.


Solution

  • Specific to your example:

    $visitor = ConvertFrom-Csv @'
    Description
    Daniel Dingo
    Betty Butternut
    Adam Aardvark
    Betty Butternut
    Cathy Cabbage
    Enid Endive
    '@
    
    $Permanents = ConvertFrom-Csv @'
    Description
    Enid
    Enid Endive
    Betty Butternut
    Cathy Cabbage
    Fred Flamingo
    Cathy Cabbage
    '@
    

    I am not sure why you remove the duplicate descriptions completely from the $Visitors
    (I find this a weird constrain), anyways:

    $VisitorHash = @{}
    foreach ($Description in $Visitor.Description) { $VisitorHash[$Description]++ }
    $VisitorDescription = $VisitorHash.GetEnumerator().where{ $_.Value -eq 1 }.Name
    

    And not from the $Permanents (Cathy Cabbage) ???

    $PermanentsHash = @{}
    foreach ($Description in $Permanents.Description) { $PermanentsHash[$Description]++ }
    $PermanentsDescription = $PermanentsHash.GetEnumerator().where{ $_.Value -eq 1 }.Name
    
    $PermanentsDescription = $Permanents.Description
    

    Than you might simply use a where method or even a [Linq.Enumerable]::Intersect
    (See also Comparing two arrays & get the values which are not common):

    $Intersect = [Linq.Enumerable]::Intersect([string[]]$VisitorDescription, [string[]]$PermanentsDescription)
    $Intersect
    Cathy Cabbage
    Enid Endive
    

    You might count the results by forcing the Linq.Enumerable to an array (using the Array subexpression operator @( )):

    @($Intersect).count
    2