I wrote a script to convert over 1000 Excel files to PDF. In the past, I used the command line interface of LibreOffice to achieve it. Now, I am trying to use Microsoft Excel and I am hoping to parallelize the process with ForEach-Object -Parallel
. My issue is that the COM Object cannot be accessed inside the parallel loop despite me employing $using:
. Is it possible (at all) to use a COM Object in a parallel loop? Here is my code:
# Get working directory
$wd = pwd
# Output directory for converted PDFs
$output_dir = "$wd\data\sample_curriculums_fall2023\pdf"
# Excel fixed format (important to specify the conversion target (i.e. PDF))
$ExcelFixedFormat = “Microsoft.Office.Interop.Excel.xlFixedFormatType” -as [type]
# Get Excel files
$excel_files = Get-ChildItem -Path "$wd\data\sample_curriculums_fall2023\excel\" -Filter *xlsm
# Create COM Object for Excel and make it invisible (i.e. headless)
$ExcelObject = New-Object -ComObject Excel.Application
$ExcelObject.Visible = $false
$excel_files | ForEach-Object -ThrottleLimit 20 -Parallel {
$file = $_
$obj = $using:ExcelObject
$fm = $using:ExcelFixedFormat
# Make name for PDF output
$output = Join-Path -Path $using:output_dir -ChildPath ($file.BaseName + ".pdf")
# Open Excel file to convert
$workbook = $obj.Workbooks.Open($file.FullName, 3) # PROBLEM!!!!
$workbook.Saved = $true
$workbook.ExportAsFixedFormat($fm::xlTypePDF, $output)
$obj.Workbooks.Close()
}
$ExcelObject.Quit()
The error is thrown at this line inside the parallel loop:
$workbook = $obj.Workbooks.Open($file.FullName, 3)
I get the following error:
You cannot call a method on a null-valued expression.
Which suggests that the $obj
variable does not contain the COM Object and is null
. Note that the script works perfectly with a regular foreach
function.
Thank you in advance for your help.
The Microsoft Office COM Automation servers aren't thread-safe, so you cannot share a given instance between multiple threads.
Creating a separate instance with New-Object -ComObject Excel.Application
for each file to convert isn't worth doing, because each such call creates a new Excel child process, which is costly both in terms of performance and memory use.
However, you can try to batch your input files, so that you only create a limited number of Excel processes that each process a batch of input files:
# How many Excel instances to run in parallel.
# Tweak this number based on your system's CPU count and memory.
$throttleLimit = 4
# Collect all input files.
$files = 1..100
# Get-ChildItem "$wd\data\sample_curriculums_fall2023\excel\" -Filter *xlsm
# Determine how many files to pass to each Excel instance.
$chunkSize = [Math]::Ceiling($files.Count / $throttleLimit)
# Batch the input files and process each batch with ForEach-Object -Parallel
$files |
ForEach-Object `
-Begin { $i = 0; $chunk = [System.Collections.Generic.List[object]]::new($chunkSize) } `
-Process {
$chunk.Add($_)
if (++$i -eq $chunkSize) {
, $chunk.ToArray()
$i = 0; $chunk.Clear()
}
} `
-End {
if ($i) {
, $chunk.ToArray
}
} |
ForEach-Object -ThrottleLimit $throttleLimit -Parallel {
$xl = New-Object -ComObject Excel.Application
foreach ($file in $_) {
$output = Join-Path $using:output_dir ($file.BaseName + '.pdf')
$workbook = $xl.Workbooks.Open($file.FullName, 3)
$workbook.Saved = $true
$workbook.ExportAsFixedFormat(0, $output) # 0 = [Microsoft.Office.Interop.Excel.xlFixedFormatType]::xlTypePDF
$xl.Workbooks.Close()
}
$xl.Quit()
}
Note the use of an auxiliary ForEach-Object
call to batch (chunk) the array of files to process.
Potentially building this functionality into PowerShell itself is the subject of GitHub issue #8270.