I have a file that looks like below. I want to scan this text file for keywords PageType=PERSONAL_ID,STUB,PAGE,ENVELOPE, and print out the next 2 lines keywords that end in .tif Filename="***.tif"(I don't need the Filename just the 00010.tif) to a new text file. I want to loop between the different PageTypes and print .tif names below and before next keyword if the PageName is ENVELOPE, break loop then start a new loop if PERSONAL_ID appears again the repeat til the end of file.
Example edited - Can I print beside the tif first or second image?
if($line =~ /PERSONAL_ID/){
print .tifs that are listed lines below this occurrence
This is the first image 00010.tif
This is the second image 00011.tif
elsif($line =~ /STUB/){
print .tifs that are listed lines below this occurrence
This is the first image 00020.tif
This is the second image 00021.tif
elsif($oine =~ /PAGE/){
This is the first image 00030.tif
This is the second image 00031.tif
elsif($line =~ /ENVELOPE/){
print .tif that is on the nextline
this is the first image 00040.tif
elsif($line =~ /PERSONAL_ID/){
This is the first image 00050.tif
This is the second image 00051.tif
this elsif or else is what I can't figure out if PageType=PERSONAL_ID,STUB,PAGE,ENVELOPE appears again I would want the .tifs to appear/print below that occurrence and repeat to the end of the file. The .tif files can be any number
Output
PERSONAL_ID
00010.tif
00011.tif
STUB
00020.tif
00021.tif
ENVELOPE
00040.tif
Loop New Occurrence
PERSONAL_ID
00050.tif
00051.tif
<?xml version="1.0" encoding="utf-8"?>
<Batch FormatVersion="" BaseMachine="MODEL_72" ScanDevice="" SoftwareVersion="06.00.17.00" TransportId="4" BatchIdentifier="" JobType="UNSTRUCTURED" OperatingMode="MANUAL_SCAN" JobName="" OperatorName="OPEX Technician" StartTime="2025-02-26 11:05:30" ReceiveDate="2025-02-26" ProcessDate="2025-02-26" ImageFilePath="" PluginMessage="" DeveloperReserved="04,01">
<Transaction TransactionID="0001">
<Page BatchSequence="1" TransactionSequence="1" ScanSequence="1" ItemStatus="VALID" IsVirtual="NO" PageType="PERSONAL_ID" PageName="NULL" SubPageName="" OperatorSelect="NO" Bin="1" Length="5.977 IN" Height="2.730 IN" EnvelopeDetect="NO" AverageThickness="0.90" SkewDegrees="-0.06" DeskewStatus="NO" FrontStreakDetectStatus="INACTIVE" BackStreakDetectStatus="INACTIVE" PlugInPageMessage="">
<Image Index="1" Status="GOOD" Side="FRONT" Type="FULL" Depth="1" Format="TIFF" Filename="00010.tif" Filesize="32126" Length="1793" Height="819" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Image Index="2" Status="GOOD" Side="BACK" Type="FULL" Depth="1" Format="TIFF" Filename="00011.tif" Filesize="672" Length="1794" Height="820" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Micr Status="" RtStatus="" CheckType="" Side="FRONT" Value="d091000080d7355495c1465"/>
<Ocr Index="1" Side="FRONT" Name="OCR 1"/>
</Page>
<Page BatchSequence="2" TransactionSequence="2" ScanSequence="2" ItemStatus="VALID" IsVirtual="NO" PageType="STUB" PageName="NULL" SubPageName="" OperatorSelect="NO" Bin="1" Length="8.413 IN" Height="3.593 IN" EnvelopeDetect="NO" AverageThickness="0.80" SkewDegrees="-0.12" DeskewStatus="YES" FrontStreakDetectStatus="INACTIVE" BackStreakDetectStatus="INACTIVE" PlugInPageMessage="">
<Image Index="1" Status="GOOD" Side="FRONT" Type="FULL" Depth="1" Format="TIFF" Filename="00020.tif" Filesize="29118" Length="2524" Height="1078" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Image Index="2" Status="GOOD" Side="BACK" Type="FULL" Depth="1" Format="TIFF" Filename="00021.tif" Filesize="13972" Length="2527" Height="1080" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Ocr Index="1" Name="OCR 1"/>
</Page>
<Page BatchSequence="3" TransactionSequence="3" ScanSequence="3" ItemStatus="VALID" IsVirtual="NO" PageType="PAGE" PageName="NULL" SubPageName="" OperatorSelect="NO" Bin="1" Length="8.437 IN" Height="7.257 IN" EnvelopeDetect="NO" AverageThickness="0.80" SkewDegrees="0.23" DeskewStatus="YES" FrontStreakDetectStatus="INACTIVE" BackStreakDetectStatus="INACTIVE" PlugInPageMessage="">
<Image Index="1" Status="GOOD" Side="FRONT" Type="FULL" Depth="1" Format="TIFF" Filename="00030.tif" Filesize="18582" Length="2531" Height="2177" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Image Index="2" Status="GOOD" Side="BACK" Type="FULL" Depth="1" Format="TIFF" Filename="00031.tif" Filesize="4550" Length="2533" Height="2177" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
</Page>
<Page BatchSequence="4" TransactionSequence="4" ScanSequence="4" ItemStatus="VALID" IsVirtual="NO" PageType="ENVELOPE" PageName="NULL" SubPageName="" OperatorSelect="NO" Bin="1" Length="9.460 IN" Height="4.137 IN" EnvelopeDetect="YES" AverageThickness="1.80" SkewDegrees="-0.06" DeskewStatus="NO" FrontStreakDetectStatus="INACTIVE" BackStreakDetectStatus="INACTIVE" PlugInPageMessage="">
<Image Index="1" Status="GOOD" Side="FRONT" Type="FULL" Depth="1" Format="TIFF" Filename="00040.tif" Filesize="64590" Length="2838" Height="1241" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
</Page>
</Transaction>
<Transaction TransactionID="0002">
<Page BatchSequence="5" TransactionSequence="1" ScanSequence="5" ItemStatus="VALID" IsVirtual="NO" PageType="PERSONAL_ID" PageName="NULL" SubPageName="" OperatorSelect="NO" Bin="1" Length="5.977 IN" Height="2.740 IN" EnvelopeDetect="NO" AverageThickness="1.00" SkewDegrees="-0.09" DeskewStatus="NO" FrontStreakDetectStatus="INACTIVE" BackStreakDetectStatus="INACTIVE" PlugInPageMessage="">
<Image Index="1" Status="GOOD" Side="FRONT" Type="FULL" Depth="1" Format="TIFF" Filename="00050.tif" Filesize="26258" Length="1793" Height="822" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Image Index="2" Status="GOOD" Side="BACK" Type="FULL" Depth="1" Format="TIFF" Filename="00051.tif" Filesize="746" Length="1795" Height="824" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Micr Status="" RtStatus="" CheckType="" Side="FRONT" Value="d074000201d1355503c1473"/>
<Ocr Index="1" Side="FRONT" Value="d074000201d 1355503c 1473" Name="OCR 1"/>
</Page>
<Page BatchSequence="6" TransactionSequence="2" ScanSequence="6" ItemStatus="VALID" IsVirtual="NO" PageType="STUB" PageName="NULL" SubPageName="" OperatorSelect="NO" Bin="1" Length="8.440 IN" Height="3.720 IN" EnvelopeDetect="NO" AverageThickness="0.70" SkewDegrees="-0.15" DeskewStatus="YES" FrontStreakDetectStatus="INACTIVE" BackStreakDetectStatus="INACTIVE" PlugInPageMessage="">
<Image Index="1" Status="GOOD" Side="FRONT" Type="FULL" Depth="1" Format="TIFF" Filename="00060.tif" Filesize="11880" Length="2532" Height="1116" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Image Index="2" Status="GOOD" Side="BACK" Type="FULL" Depth="1" Format="TIFF" Filename="00061.tif" Filesize="4644" Length="2532" Height="1119" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Ocr Index="1" Side="FRONT" Value="7000000000000000001000050124000000006660000634419001612020908000001" Name="OCR 1"/>
</Page>
<Page BatchSequence="7" TransactionSequence="3" ScanSequence="7" ItemStatus="VALID" IsVirtual="NO" PageType="PAGE" PageName="NULL" SubPageName="" OperatorSelect="NO" Bin="1" Length="8.463 IN" Height="7.260 IN" EnvelopeDetect="NO" AverageThickness="0.90" SkewDegrees="-0.10" DeskewStatus="NO" FrontStreakDetectStatus="INACTIVE" BackStreakDetectStatus="INACTIVE" PlugInPageMessage="">
<Image Index="1" Status="GOOD" Side="FRONT" Type="FULL" Depth="1" Format="TIFF" Filename="00070.tif" Filesize="33196" Length="2539" Height="2178" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Image Index="2" Status="GOOD" Side="BACK" Type="FULL" Depth="1" Format="TIFF" Filename="00071.tif" Filesize="4962" Length="2546" Height="2177" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
</Page>
<Page BatchSequence="8" TransactionSequence="4" ScanSequence="8" ItemStatus="VALID" IsVirtual="NO" PageType="ENVELOPE" PageName="NULL" SubPageName="" OperatorSelect="NO" Bin="1" Length="9.460 IN" Height="4.140 IN" EnvelopeDetect="YES" AverageThickness="1.90" SkewDegrees="-0.09" DeskewStatus="NO" FrontStreakDetectStatus="INACTIVE" BackStreakDetectStatus="INACTIVE" PlugInPageMessage="">
<Image Index="1" Status="GOOD" Side="FRONT" Type="FULL" Depth="1" Format="TIFF" Filename="00080.tif" Filesize="65086" Length="2838" Height="1242" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
</Page>
</Transaction>
<Transaction TransactionID="0003">
<Page BatchSequence="9" TransactionSequence="1" ScanSequence="9" ItemStatus="VALID" IsVirtual="NO" PageType="PERSONAL_ID" PageName="Null" SubPageName="" OperatorSelect="NO" Bin="1" Length="5.980 IN" Height="2.737 IN" EnvelopeDetect="NO" AverageThickness="1.00" SkewDegrees="-0.07" DeskewStatus="NO" FrontStreakDetectStatus="INACTIVE" BackStreakDetectStatus="INACTIVE" PlugInPageMessage="">
<Image Index="1" Status="GOOD" Side="FRONT" Type="FULL" Depth="1" Format="TIFF" Filename="00090.tif" Filesize="30110" Length="1794" Height="821" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Image Index="2" Status="GOOD" Side="BACK" Type="FULL" Depth="1" Format="TIFF" Filename="00091.tif" Filesize="706" Length="1792" Height="820" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Micr Status="" RtStatus="" CheckType="" Side="FRONT" Value="d074000201d1355495c1465"/>
<Ocr Index="1" Side="FRONT" Value="d074000201d 1355495c 1465" Name="OCR 1"/>
</Page>
<Page BatchSequence="10" TransactionSequence="2" ScanSequence="10" ItemStatus="VALID" IsVirtual="NO" PageType="STUB" PageName="Null" SubPageName="" OperatorSelect="NO" Bin="1" Length="8.470 IN" Height="3.740 IN" EnvelopeDetect="NO" AverageThickness="0.90" SkewDegrees="-0.02" DeskewStatus="NO" FrontStreakDetectStatus="INACTIVE" BackStreakDetectStatus="INACTIVE" PlugInPageMessage="">
<Image Index="1" Status="GOOD" Side="FRONT" Type="FULL" Depth="1" Format="TIFF" Filename="00100.tif" Filesize="13138" Length="2541" Height="1122" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Image Index="2" Status="GOOD" Side="BACK" Type="FULL" Depth="1" Format="TIFF" Filename="00101.tif" Filesize="6372" Length="2539" Height="1122" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Ocr Index="1" Side="FRONT" Value="8000000000000000001000051524000000118393200412484002327230107008007" Name="OCR 1"/>
</Page>
<Page BatchSequence="11" TransactionSequence="3" ScanSequence="11" ItemStatus="VALID" IsVirtual="NO" PageType="STUB" PageName="NULL" SubPageName="" OperatorSelect="NO" Bin="1" Length="8.460 IN" Height="3.543 IN" EnvelopeDetect="NO" AverageThickness="1.00" SkewDegrees="-0.18" DeskewStatus="YES" FrontStreakDetectStatus="INACTIVE" BackStreakDetectStatus="INACTIVE" PlugInPageMessage="">
<Image Index="1" Status="GOOD" Side="FRONT" Type="FULL" Depth="1" Format="TIFF" Filename="00110.tif" Filesize="19664" Length="2538" Height="1063" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Image Index="2" Status="GOOD" Side="BACK" Type="FULL" Depth="1" Format="TIFF" Filename="00111.tif" Filesize="838" Length="2538" Height="1062" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Ocr Index="1" Side="FRONT" Value="7000000000000000000100060124000000054212500023862000015312305005009" Name="OCR 1"/>
</Page>
<Page BatchSequence="12" TransactionSequence="4" ScanSequence="12" ItemStatus="VALID" IsVirtual="NO" PageType="STUB" PageName="NULL" SubPageName="" OperatorSelect="NO" Bin="1" Length="8.453 IN" Height="3.530 IN" EnvelopeDetect="NO" AverageThickness="0.90" SkewDegrees="-0.19" DeskewStatus="YES" FrontStreakDetectStatus="INACTIVE" BackStreakDetectStatus="INACTIVE" PlugInPageMessage="">
<Image Index="1" Status="GOOD" Side="FRONT" Type="FULL" Depth="1" Format="TIFF" Filename="00120.tif" Filesize="16900" Length="2536" Height="1059" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Image Index="2" Status="GOOD" Side="BACK" Type="FULL" Depth="1" Format="TIFF" Filename="00121.tif" Filesize="896" Length="2536" Height="1061" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Ocr Index="1" Side="FRONT" Value="7000000000000000000100060124000000002649400075106000008090709008008" Name="OCR 1"/>
</Page>
<Page BatchSequence="13" TransactionSequence="5" ScanSequence="13" ItemStatus="VALID" IsVirtual="NO" PageType="ENVELOPE" PageName="NULL" SubPageName="" OperatorSelect="NO" Bin="1" Length="9.463 IN" Height="4.137 IN" EnvelopeDetect="YES" AverageThickness="1.90" SkewDegrees="-0.06" DeskewStatus="NO" FrontStreakDetectStatus="INACTIVE" BackStreakDetectStatus="INACTIVE" PlugInPageMessage="">
<Image Index="1" Status="GOOD" Side="FRONT" Type="FULL" Depth="1" Format="TIFF" Filename="00130.tif" Filesize="63664" Length="2839" Height="1241" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
</Page>
</Transaction>
<Transaction TransactionID="0004">
<Page BatchSequence="14" TransactionSequence="1" ScanSequence="14" ItemStatus="VALID" IsVirtual="NO" PageType="PERSONAL_CHECK" PageName="NULL" SubPageName="" OperatorSelect="NO" Bin="1" Length="8.223 IN" Height="3.003 IN" EnvelopeDetect="NO" AverageThickness="1.10" SkewDegrees="0.04" DeskewStatus="NO" FrontStreakDetectStatus="INACTIVE" BackStreakDetectStatus="INACTIVE" PlugInPageMessage="">
<Image Index="1" Status="GOOD" Side="FRONT" Type="FULL" Depth="1" Format="TIFF" Filename="00140.tif" Filesize="30256" Length="2467" Height="901" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Image Index="2" Status="GOOD" Side="BACK" Type="FULL" Depth="1" Format="TIFF" Filename="00141.tif" Filesize="818" Length="2467" Height="900" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Micr Status="" RtStatus="" CheckType="" Side="FRONT" Value="c000419cd061000146d6300482c"/>
<Ocr Index="1" Side="FRONT" Value="c000419c d061000146d 6300482c" Name="OCR 1"/>
</Page>
<Page BatchSequence="15" TransactionSequence="2" ScanSequence="15" ItemStatus="VALID" IsVirtual="NO" PageType="STUB" PageName="NULL" SubPageName="" OperatorSelect="NO" Bin="1" Length="8.457 IN" Height="3.847 IN" EnvelopeDetect="NO" AverageThickness="0.80" SkewDegrees="-0.11" DeskewStatus="YES" FrontStreakDetectStatus="INACTIVE" BackStreakDetectStatus="INACTIVE" PlugInPageMessage="">
<Image Index="1" Status="GOOD" Side="FRONT" Type="FULL" Depth="1" Format="TIFF" Filename="00150.tif" Filesize="14444" Length="2537" Height="1154" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Image Index="2" Status="GOOD" Side="BACK" Type="FULL" Depth="1" Format="TIFF" Filename="00151.tif" Filesize="3050" Length="2539" Height="1155" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Ocr Index="1" Side="FRONT" Value="7000000000000000000100060124000000256271300016730003122090605001009" Name="OCR 1"/>
</Page>
<Page BatchSequence="16" TransactionSequence="3" ScanSequence="16" ItemStatus="VALID" IsVirtual="NO" PageType="PAGE" PageName="NULL" SubPageName="" OperatorSelect="NO" Bin="1" Length="8.467 IN" Height="7.253 IN" EnvelopeDetect="NO" AverageThickness="0.90" SkewDegrees="-0.90" DeskewStatus="YES" FrontStreakDetectStatus="INACTIVE" BackStreakDetectStatus="INACTIVE" PlugInPageMessage="">
<Image Index="1" Status="GOOD" Side="FRONT" Type="FULL" Depth="1" Format="TIFF" Filename="00160.tif" Filesize="25808" Length="2540" Height="2176" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
<Image Index="2" Status="GOOD" Side="BACK" Type="FULL" Depth="1" Format="TIFF" Filename="00161.tif" Filesize="4168" Length="2551" Height="2177" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
</Page>
<Page BatchSequence="17" TransactionSequence="4" ScanSequence="17" ItemStatus="VALID" IsVirtual="NO" PageType="ENVELOPE" PageName="NULL" SubPageName="" OperatorSelect="NO" Bin="1" Length="9.467 IN" Height="4.137 IN" EnvelopeDetect="YES" AverageThickness="1.80" SkewDegrees="-0.08" DeskewStatus="NO" FrontStreakDetectStatus="INACTIVE" BackStreakDetectStatus="INACTIVE" PlugInPageMessage="">
<Image Index="1" Status="GOOD" Side="FRONT" Type="FULL" Depth="1" Format="TIFF" Filename="00170.tif" Filesize="64182" Length="2840" Height="1241" OffsetLength="0" OffsetHeight="0" ResolutionLength="300" ResolutionHeight="300"/>
</Page>
</Transaction>
</Batch>
As mentioned by @Shawn in a comment, this doesn't look like a job for regex. If you wrap your content in a tag (any tag), this will find all PageType
mentioned by you and print the Filename
attribute if it exist and ends with .tif
:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::LibXML;
my $data_in_tags = sprintf("<xml>%s</xml>", join("", <DATA>));
my $dom = XML::LibXML->load_xml(string => $data_in_tags);
my @nodes = $dom->findnodes('//*[@PageType="PERSONAL_ID" or @PageType="STUB" or @PageType="PAGE" or @PageType="ENVELOPE"]');
my %page_type_image_counter = ();
foreach my $node (@nodes) {
my $page_type = $node->getAttribute("PageType");
foreach my $child ($node->findnodes('./*')) {
my $filename = $child->getAttribute('Filename') || last;
last if (!$filename =~ /\.tif$/);
$page_type_image_counter{$page_type}++;
printf(
"PageType %s image %d: %s\n",
$page_type,
$page_type_image_counter{$page_type},
$filename
);
}
}
__DATA__
<Transaction ....>
[your content here]
</Transaction>
Output running this on your example:
› perl main.pl
PageType PERSONAL_ID image 1: 00010.tif
PageType PERSONAL_ID image 2: 00011.tif
PageType STUB image 1: 00020.tif
PageType STUB image 2: 00021.tif
PageType PAGE image 1: 00030.tif
PageType PAGE image 2: 00031.tif
PageType ENVELOPE image 1: 00040.tif
PageType PERSONAL_ID image 3: 00050.tif
PageType PERSONAL_ID image 4: 00051.tif
PageType STUB image 3: 00060.tif
PageType STUB image 4: 00061.tif
PageType PAGE image 3: 00070.tif
PageType PAGE image 4: 00071.tif
PageType ENVELOPE image 2: 00080.tif
PageType PERSONAL_ID image 5: 00090.tif
PageType PERSONAL_ID image 6: 00091.tif
PageType STUB image 5: 00100.tif
PageType STUB image 6: 00101.tif
PageType STUB image 7: 00110.tif
PageType STUB image 8: 00111.tif
PageType STUB image 9: 00120.tif
PageType STUB image 10: 00121.tif
PageType ENVELOPE image 3: 00130.tif
PageType STUB image 11: 00150.tif
PageType STUB image 12: 00151.tif
PageType PAGE image 5: 00160.tif
PageType PAGE image 6: 00161.tif
PageType ENVELOPE image 4: 00170.tif