powershellparallel-processingforeach-object

Use COM Object inside ForEach-Object -Parallel


I wrote a script to convert over 1000 Excel files to PDF. In the past, I used the command line interface of LibreOffice to achieve it. Now, I am trying to use Microsoft Excel and I am hoping to parallelize the process with ForEach-Object -Parallel. My issue is that the COM Object cannot be accessed inside the parallel loop despite me employing $using:. Is it possible (at all) to use a COM Object in a parallel loop? Here is my code:

# Get working directory
$wd = pwd

# Output directory for converted PDFs
$output_dir = "$wd\data\sample_curriculums_fall2023\pdf"

# Excel fixed format (important to specify the conversion target (i.e. PDF))
$ExcelFixedFormat = “Microsoft.Office.Interop.Excel.xlFixedFormatType” -as [type]

# Get Excel files
$excel_files = Get-ChildItem -Path "$wd\data\sample_curriculums_fall2023\excel\" -Filter *xlsm

# Create COM Object for Excel and make it invisible (i.e. headless)
$ExcelObject = New-Object -ComObject Excel.Application
$ExcelObject.Visible = $false

$excel_files | ForEach-Object -ThrottleLimit 20 -Parallel {

        $file = $_
        $obj = $using:ExcelObject
        $fm = $using:ExcelFixedFormat

        # Make name for PDF output
        $output = Join-Path -Path $using:output_dir -ChildPath ($file.BaseName + ".pdf")
    
        # Open Excel file to convert
        $workbook = $obj.Workbooks.Open($file.FullName, 3) # PROBLEM!!!!
        $workbook.Saved = $true
        $workbook.ExportAsFixedFormat($fm::xlTypePDF, $output)
        $obj.Workbooks.Close()
    
}

$ExcelObject.Quit()

The error is thrown at this line inside the parallel loop:

$workbook = $obj.Workbooks.Open($file.FullName, 3)

I get the following error:

You cannot call a method on a null-valued expression.

Which suggests that the $obj variable does not contain the COM Object and is null. Note that the script works perfectly with a regular foreach function.

Thank you in advance for your help.


Solution

  • However, you can try to batch your input files, so that you only create a limited number of Excel processes that each process a batch of input files:

    # How many Excel instances to run in parallel.
    # Tweak this number based on your system's CPU count and memory.
    $throttleLimit = 4
    
    # Collect all input files.
    $files = 1..100
      # Get-ChildItem "$wd\data\sample_curriculums_fall2023\excel\" -Filter *xlsm
    
    # Determine how many files to pass to each Excel instance.
    $chunkSize = [Math]::Ceiling($files.Count / $throttleLimit)
    
    # Batch the input files and process each batch with ForEach-Object -Parallel
    $files |
      ForEach-Object `
        -Begin { $i = 0; $chunk = [System.Collections.Generic.List[object]]::new($chunkSize) } `
        -Process { 
          $chunk.Add($_)
          if (++$i -eq $chunkSize) {
            , $chunk.ToArray()
            $i = 0; $chunk.Clear()
          }
        } `
        -End {
          if ($i) {
            , $chunk.ToArray
          }
        } |
      ForEach-Object -ThrottleLimit $throttleLimit -Parallel {
        $xl = New-Object -ComObject Excel.Application
        foreach ($file in $_) {
          $output = Join-Path $using:output_dir ($file.BaseName + '.pdf')
          $workbook = $xl.Workbooks.Open($file.FullName, 3)
          $workbook.Saved = $true
          $workbook.ExportAsFixedFormat(0, $output) # 0 = [Microsoft.Office.Interop.Excel.xlFixedFormatType]::xlTypePDF
          $xl.Workbooks.Close()
        }
        $xl.Quit()
      }
    

    Note the use of an auxiliary ForEach-Object call to batch (chunk) the array of files to process.
    Potentially building this functionality into PowerShell itself is the subject of GitHub issue #8270.