I am trying to read the data from CSV file which has 2200000 records using PowerShell and storing each record in JSON file, but this takes almost 12 hours.
Sample CSV Data:
We will only concern about the 1st column value's.
Code:
function Read-IPData
{
$dbFilePath = Get-ChildItem -Path $rootDir -Filter "IP2*.CSV" | ForEach-Object{ $_.FullName }
Write-Host "file path - $dbFilePath"
Write-Host "Reading..."
$data = Get-Content -Path $dbFilePath | Select-Object -Skip 1
Write-Host "Reading data finished"
$count = $data.Count
Write-host "Total $count records found"
return $data
}
function Convert-NumbetToIP
{
param(
[Parameter(Mandatory=$true)][string]$number
)
try
{
$w = [int64]($number/16777216)%256
$x = [int64]($number/65536)%256
$y = [int64]($number/256)%256
$z = [int64]$number%256
$ipAddress = "$w.$x.$y.$z"
Write-Host "IP Address - $ipAddress"
return $ipAddress
}
catch
{
Write-Host "$_"
continue
}
}
Write-Host "Getting IP Addresses from $dbFileName"
$data = Read-IPData
Write-Host "Checking whether output.json file exist, if not create"
$outputFile = Join-Path -Path $rootDir -ChildPath "output.json"
if(!(Test-Path $outputFile))
{
Write-Host "$outputFile doestnot exist, creating..."
New-Item -Path $outputFile -type "file"
}
foreach($item in $data)
{
$row = $item -split ","
$ipNumber = $row[0].trim('"')
Write-Host "Converting $ipNumber to ipaddress"
$toIpAddress = Convert-NumbetToIP -number $ipNumber
Write-Host "Preparing document JSON"
$object = [PSCustomObject]@{
"ip-address" = $toIpAddress
"is-vpn" = "true"
"@timestamp" = (Get-Date).ToString("o")
}
$document = $object | ConvertTo-Json -Compress -Depth 100
Write-Host "Adding document - $document"
Add-Content -Path $outputFile $document
}
Could you please help optimize the code or is there a better way to do it. or is there a way like multi-threading.
Here is a possible optimization:
function Get-IPDataPath
{
$dbFilePath = Get-ChildItem -Path $rootDir -Filter "IP2*.CSV" | ForEach-Object FullName | Select-Object -First 1
Write-Host "file path - $dbFilePath"
$dbFilePath # implicit output
}
function Convert-NumberToIP
{
param(
[Parameter(Mandatory=$true)][string]$number
)
[Int64] $numberInt = 0
if( [Int64]::TryParse( $number, [ref] $numberInt ) ) {
if( ($numberInt -ge 0) -and ($numberInt -le 0xFFFFFFFFl) ) {
# Convert to IP address like '192.168.23.42'
([IPAddress] $numberInt).ToString()
}
}
# In case TryParse() returns $false or the number is out of range for an IPv4 address,
# the output of this function will be empty, which converts to $false in a boolean context.
}
$dbFilePath = Get-IPDataPath
$outputFile = Join-Path -Path $rootDir -ChildPath "output.json"
Write-Host "Converting CSV file $dbFilePath to $outputFile"
$object = [PSCustomObject]@{
'ip-address' = ''
'is-vpn' = 'true'
'@timestamp' = ''
}
# Enclose foreach loop in a script block to be able to pipe its output to Set-Content
& {
foreach( $item in [Linq.Enumerable]::Skip( [IO.File]::ReadLines( $dbFilePath ), 1 ) )
{
$row = $item -split ','
$ipNumber = $row[0].trim('"')
if( $ip = Convert-NumberToIP -number $ipNumber )
{
$object.'ip-address' = $ip
$object.'@timestamp' = (Get-Date).ToString('o')
# Implicit output
$object | ConvertTo-Json -Compress -Depth 100
}
}
} | Set-Content -Path $outputFile
Remarks for improving performance:
Get-Content
, especially for line-by-line processing it tends to be slow. A much faster alternative is the File.ReadLines
method. To skip the header line, use the Linq.Enumerable.Skip()
method.ReadLines
in a foreach
loop does lazy enumeration, i. e. it reads only one line per loop iteration. This works because it returns an enumerator instead of a collection of lines.try
and catch
if exceptions occur often, because the "exceptional" code path is very slow. Instead use Int64.TryParse()
which returns a boolean
indicating successful conversion.IPAddress
class which has a constructor that takes an integer number. Use its method .GetAddressBytes()
to get an array of bytes in network (big-endian) order. Finally use the PowerShell -join
operator to create a string of the expected format.[pscustomobject]
for each row, which has some overhead. Create it once before the loop and inside the loop only assign the values.Write-Host
(or any output to the console) within inner loops.Unrelated to performance:
New-Item
call to create the output file, which isn't necessary because Set-Content
automatically creates the file if it doesn't exist.[ ]
and insert a comma ,
between each row.Modified processing loop to write a regular JSON file instead of NDJSON file:
& {
'[' # begin array
$first = $true
foreach( $item in [Linq.Enumerable]::Skip( [IO.File]::ReadLines( $dbFilePath ), 1 ) )
{
$row = $item -split ','
$ipNumber = $row[0].trim('"')
if( $ip = Convert-NumberToIP -number $ipNumber )
{
$object.'ip-address' = $ip
$object.'@timestamp' = (Get-Date).ToString('o')
$row = $object | ConvertTo-Json -Compress -Depth 100
# write array element delimiter if necessary
if( $first ) { $row; $first = $false } else { ",$row" }
}
}
']' # end array
} | Set-Content -Path $outputFile