I am trying to split <Description>
text by Bit
number and to put into particular Bit
number element. Here is the file, I am parsing.
<Register>
<Name>abc</Name>
<Abstract></Abstract>
<Description>Bit 6 random description
Bit 5 msg octet 2
Bit 4-1
Bit 0 msg octet 4
These registers containpart of the Upstream Message.
They should be written only after the cleared by hardware.
</Description>
<Field>
<Name>qwe</Name>
<Description></Description>
<BitFieldOffset>6</BitFieldOffset>
<Size>1</Size>
<AccessMode>Read/Write</AccessMode>
</Field>
<Field>
<Name>qwe</Name>
<Description></Description>
<BitFieldOffset>5</BitFieldOffset>
<Size>1</Size>
<AccessMode>Read/Write</AccessMode>
</Field>
<Field>
....
</Field>
</Register>
<Register>
<Name>xyz</Name>
<Abstract></Abstract>
<Description>Bit 3 msg octet 1
Bit 2 msg octet 2
Bit 1 msg octet 3
Bit 0 msg octet 4
These registers.
They should be written only after the cleared by hardware.
</Description>
<Field>
....
</Field>
<Field>
....
</Field>
</Register>
The expected output would be:
Expected output:
<Register>
<long_description>
These registers containpart of the Upstream Message.
They should be written only after the cleared by hardware.
</long_description>
<bit_field position="6" width=" 1">
<long_description>
<p> random description</p>
</long_description>
<bit_field position="5" width=" 1">
<long_description>
<p>...</p>
</long_description>
<bit_field position="1" width=" 4">
<long_description>
<p>...</p>
</long_description>
</Register>
<Register>
.
.
.
</Register>
I am using XML-Twig package to parse this file but got stuck into the splitting.
foreach my $register ( $twig->get_xpath('//Register') ) # get each <Register>
{
my $reg_description= $register->first_child('Description')->text;
.
.
.
foreach my $xml_field ($register->get_xpath('Field'))
{
.
.
my @matched = split ('Bit\s+[0-9]', $reg_description);
.
.
}
}
I do not know how to create <bit_field>
accordingly and keep text except Bit
into <Register> <long_description>
. Can anyone please help here?
Edits:
The Bit
in <Description>
can have multiple lines. e.g in following example, Bit 10-9
's description is till starting of Bit 8
<Description>Bit 11 GOOF
Bit 10-9 Clk Selection:
00 : 8 MHz
01 : 4 MHz
10 : 2 MHz
11 : 1 MHz
Bit 8 Clk Enable : 1 = Enable CLK
<Description>
If I got everything right, you could look at the whole text block line by line.
Use a regular expression, to check if a line matches the pattern for a bit. Capture the relevant parts. Cache bit by bit in an array holding hashes storing the details of each bit.
Buffer lines that don't contain the bit pattern. If another line follows, that contains a bit pattern, the buffer must belong to the recent bit. Append it there. All other lines must be part of the overall description. Note: This doesn't distinguish between any additional lines of the description for the last bit. If there is such a bit, its additional lines will make the beginning of the overall description. (But you said such things aren't in your data.)
Proof of concept:
#!/usr/bin/perl
use strict;
use warnings;
my $description_in = 'Bit 6 random description
Bla bla additional line bla bla
bla bla
Bit 5 msg octet 2
Empty line below
Bla bla set to gain instant world domination bla bla
Bit 4-1
Bit 0 msg octet 4
These registers containpart of the Upstream Message.
They should be written only after the cleared by hardware.
Empty line above
Bla bla bla...';
my @bits = ();
my $description_overall = '';
my $line_buffer = '';
foreach my $line (split("\n", $description_in)) {
# if line
# begins with optional white spaces
# followed by "Bit"
# followed by at least one white space
# followed by at least one digit (we capture the digits)
# followed by an optional sequence of optional white spaces, "-", optional white spaces and at least one digit (we capture the digits)
# followed by an optional sequence of at least one white space and any characters (we capture the characters)
# followed by the end of the line
if ($line =~ m/^\s*Bit\s+(\d+)(?:\s*-\s*(\d+))?(?:\s+(.*?))?$/) {
my ($position_begin, $position_end, $description) = ($1, $2, $3);
my $width;
# if there already are bits we've processed
if (scalar(@bits)) {
# the lines possibly buffered belong to the bit before the current one, so append them to its description
$bits[$#bits]->{description} .= (length($bits[$#bits]->{description}) ? "\n" : '') . $line_buffer;
# and reset the line buffer to collect the additional lines of the current bit;
$line_buffer = '';
}
# $position_end is defined only if it was a "Bit n-m"
# otherwise set it to $position_begin
$position_end = defined($position_end) ? $position_end : $position_begin;
$width = abs($position_end - $position_begin) + 1;
# set description to the empty string if not defined (i.e. no description was found)
$description = defined($description) ? $description : '';
# push a ref to a new hash with the keys position, description and width into the list of bits
push(@bits, { position => (sort({$a <=> $b} ($position_begin, $position_end)))[0], # always take the lower position
description => $description,
width => $width });
}
else {
# it's not a bit pattern, so just buffer the line
$line_buffer .= (length($line_buffer) ? "\n" : '') . $line;
}
}
# anything still in the buffer must belong to the overall description
$description_overall .= $line_buffer;
print("<Register>\n <long_description>\n$description_overall\n </long_description>\n");
foreach my $bit (@bits) {
print(" <bit_field position=\"$bit->{position}\" width=\"$bit->{width}\">\n <long_description>\n$bit->{description}\n </long_description>\n </bit_field>\n")
}
print("</Register>\n");
Prints:
<Register>
<long_description>
These registers containpart of the Upstream Message.
They should be written only after the cleared by hardware.
Empty line above
Bla bla bla...
</long_description>
<bit_field position="6" width="1">
<long_description>
random description
Bla bla additional line bla bla
bla bla
</long_description>
</bit_field>
<bit_field position="5" width="1">
<long_description>
msg octet 2
Empty line below
Bla bla set to gain instant world domination bla bla
</long_description>
</bit_field>
<bit_field position="1" width="4">
<long_description>
</long_description>
</bit_field>
<bit_field position="0" width="1">
<long_description>
msg octet 4
</long_description>
</bit_field>
</Register>
I wrote it as stand alone script, so that I could test it. You'll have to adapt it into your script.
Maybe add some processing of the overall description eliminating those long sequences of white spaces.
First I tried using a continuing pattern (while ($x =~ m/^...$/gc)
) but that somehow ate the line endings away resulting in only matching every second line. Lookarounds, to keep them out of the actual match, didn't work (said it wasn't implemented; I guess, I'll have to check my Perl on this computer?), so the explicit splitting into lines is a work around.
It might also be possible to shorten it using grep()
s, map()
s or the like. But the verbose version better demonstrates the ideas, I think. So I didn't even look into that.