regexparsingpowershellemlemail-parsing

Parse EML text With Regular Expression


Could you help me please parse EML text with regular expression.

I want to get separately:

1). text between Content-Transfer-Encoding: base64 and --=_alternative, if there is above line Content-Type: text/html

2). text between Content-Transfer-Encoding: base64 and --=_related, if there is two lines above line Content-Type: image/jpeg

Take a look, please, on peace of code in powershell:

$text = @"
--=_alternative XXXXXXXXXXXXXX_=
Content-Type: text/html; charset="KOI8-R"
Content-Transfer-Encoding: base64

111111111111111111111111111111111111111111111111111111

--=_alternative XXXXXXXXXXXXXX_=
Content-Type: text/html; charset="KOI8-R"
Content-Transfer-Encoding: base64

222222222222222222222222222222222222222222222222222222
--=_alternative XXXXXXXXXXXXXX_=--
--=_related XXXXXXXXXXXXXX_=--_=
Content-Type: image/jpeg
Content-ID: <_2_XXXXXXXXXXXXXX>
Content-Transfer-Encoding: base64

333333333333333333333333333333333333333333333333333333
--=_related XXXXXXXXXXXXXX_=
Content-Type: image/jpeg
Content-ID: <_2_XXXXXXXXXXXXXX>
Content-Transfer-Encoding: base64
444444444444444444444444444444444444444444444444444444

--=_related XXXXXXXXXXXXXX_=
Content-Type: image/jpeg
Content-ID: <_2_XXXXXXXXXXXXXX>
Content-Transfer-Encoding: base64

555555555555555555555555555555555555555555555555555555
--=_related XXXXXXXXXXXXXX_=--
"@

$regex1 = "(?ms).+?Content-Transfer-Encoding: base64(.+?)--=_alternative"
$text1 = ([regex]::Matches($text,$regex1) | foreach {$_.groups[1].value})
Write-Host "text1 : " -fore red
Write-Host  $text1

#I want to get as output elements (of array, maybe, or one after another)
#1). text between  Content-Transfer-Encoding: base64 and --=_alternative, if there is above line Content-Type: text/html
#this
#1111111111111111111111111111111111111111111111111111111
#then this
#2222222222222222222222222222222222222222222222222222222

$regex2 = "(?ms).+?Content-Transfer-Encoding: base64(.+?)--=_related"
$text2 = ([regex]::Matches($text,$regex2) | foreach {$_.groups[1].value})
#I want to get as output elements (of array, maybe, or one after another)
#2). text between  Content-Transfer-Encoding: base64 and --=_related, if there is two lines above line Content-Type: image/jpeg
#this
#3333333333333333333333333333333333333333333333333333333
#then this
#4444444444444444444444444444444444444444444444444444444
#then this
#5555555555555555555555555555555555555555555555555555555
Write-Host "text2 : " -fore red
Write-Host  $text2

Thanks for your help. Have a nice day.

P.S. Based on code of Jessie Westlake, here is a little edited version of RegEx, that worked for me:

$files = Get-ChildItem -Path "\\<SERVER_NAME>\mailroot\Drop"
Foreach ($file in $files){
    $text = Get-Content $file.FullName

    $RegexText = '(?:Content-Type: text/html.+?Content-Transfer-Encoding: base64(.+?)(?:--=_))'
    $RegexImage = '(?:Content-Type: image/jpeg.+?Content-Transfer-Encoding: base64(.+?)(?:--=_))'

    $TextMatches = [Regex]::Matches($text, $RegexText, [System.Text.RegularExpressions.RegexOptions]::Singleline)
    $ImageMatches = [Regex]::Matches($text, $RegexImage, [System.Text.RegularExpressions.RegexOptions]::Singleline)

    If ($TextMatches[0].Success)
    {
        Write-Host "Found $($TextMatches.Count) Text Matches:"
        Write-Output $TextMatches.ForEach({$_.Groups[1].Value})
    }
    If ($ImageMatches[0].Success)
    {
        Write-Host "Found $($ImageMatches.Count) Image Matches:"
        Write-Output $ImageMatches.ForEach({$_.Groups[1].Value})
    }
}

Solution

  • TL;DR : Just go to the code at the bottom...

    The code below is pretty ugly, so forgive me.

    Essentially I just I created a regular expression that matches starting with Content-Type: text/html. It matches anything following that, lazily until it hits a newline \n, carriage return \r, or a combination of one after the other \r\n.

    You have to wrap those in parentheses in order to use the or | operator. We don't want to actually capture/return any of those groups, so we use the non-capturing group syntax of (?:text-to-match). We use this elsewhere as you can see. You can place capturing and non-capturing groups inside of each other too.

    Anyway, continuing on. After matching the new line, we want to see Content-Transfer-Encoding: base64. That seems to be required in each of your examples.

    After that we want to identify the next newline, just like the last time. Except this time we want to match 1 or more, by using the +. The reason we need to match more than one, is that there seems to be times when your data that you want to save is preceded by an extra line. But since sometimes it is NOT preceded by an extra line, we need to make it "lazy" by following the plus with a question mark +?.

    After that comes the part where we will be capturing your actual data. This will be the first time we use an actual capturing group, versus a non-capturing group (i.e. no question mark followed by a colon).

    We want to capture anything that is NOT a new line, because it seems that sometimes your data is followed by a new line and sometimes not. By not allowing ourselves to capture any new lines, it will also force our previous group to gobble up any extra new lines that are preceding our data. That capturing group is ([^(?:\n|\n\r)]+)

    What we were doing there is wrapping the regex in parentheses in order to capture it. We place the expression inside of brackets because we want to create our own "class" of characters. Any of the characters inside of brackets is going to be what our code is looking for. The difference with ours, though, is that we put a carat ^ as the first character inside the brackets. That means NOT any of these characters. Obviously we want to match everything until the next line, so we want to capture anything that is not a newline, once or more, as many times as possible.

    We then make sure our regex is anchored to some ending text, so we keep trying to match. Starting with another newline matching at least one, but as few as required to make our capture a success (?:\n|\r|\r\n)+?.

    Lastly, we anchor to what we know for sure will be where we can stop looking for our important data. And that is the --=_. I wasn't sure if we would stumble across an "alternative" word or "related", so I didn't go that far. Now it's done.

    THE KEY TO IT ALL

    We wouldn't be able to match through new lines if we didn't add the regular expression "SingleLine" mode. In order to enable that we have to use the .NET language to create our matches. We type accelerate from the [System.Text.RegularExpressions.RegexOptions] type. The options are "SingleLine" and "MultiLine".

    I create a separate regex for the text/html and the image/jpeg searches. We save the results of those matches into their respective variables.

    We can test the success of the matches by indexing into the 0 index, which would contain the entire match object and accessing its .success property, which returns a boolean value. The count of matches is accessible with the .count property. In order to access the specific groups and captures, we have to dot notate into them after finding the appropriate capture group index. Since we are only using one capturing group and the rest are non-capturing, we will have the [0] index for our entire text match, and [1] should contain the match of our capture group. Because it is an object, we have to access the value property.

    Obviously the below code will require your $text variable to contain the data to search.

    $RegexText = '(?:Content-Type: text/html.+?(?:\n|\r|\r\n)Content-Transfer-Encoding: base64(?:\n|\r|\r\n)+?([^(?:\n|\n\r)]+)(?:\n|\r|\r\n)+?(?:\n|\r|\r\n)(?:--=_))'
    $RegexImage = '(?:Content-Type: image/jpeg.+?(?:\n|\r|\r\n)Content-Transfer-Encoding: base64(?:\n|\r|\r\n)+?([^(?:\n|\n\r)]+)(?:\n|\r|\r\n)+?(?:\n|\r|\r\n)(?:--=_))'
    
    $TextMatches = [Regex]::Matches($text, $RegexText, [System.Text.RegularExpressions.RegexOptions]::Singleline)
    $ImageMatches = [Regex]::Matches($text, $RegexImage, [System.Text.RegularExpressions.RegexOptions]::Singleline)
    
    If ($TextMatches[0].Success)
    {
        Write-Host "Found $($TextMatches.Count) Text Matches:"
        Write-Output $TextMatches.ForEach({$_.Groups[1].Value})
    }
    If ($ImageMatches[0].Success)
    {
        Write-Host "Found $($ImageMatches.Count) Image Matches:"
        Write-Output $ImageMatches.ForEach({$_.Groups[1].Value})
    }
    

    The code above results in the below output to the screen:

    Found 2 Text Matches:
    111111111111111111111111111111111111111111111111111111
    222222222222222222222222222222222222222222222222222222
    Found 3 Image Matches:
    333333333333333333333333333333333333333333333333333333
    444444444444444444444444444444444444444444444444444444
    555555555555555555555555555555555555555555555555555555