I'm stuck. For couple of days been trying to parse this text (look at bottom). But can't figure out some things. Firstly text is formatted in tree structure with fixed width columns but exact column width depends on widest field.
I'm using ruby, first I tried Treetop gem and made some progress, but then decided to try Parslet so I'm using it now and it seems should be easier with it, but it's hard to find detailed documentation for it.
currently I parse each line individually and create array with parsed entries, but that's not correct as I loose structure. I need parse it recursively and handle depth.
here's my current code, it works, but all data is flattened. my current idea is to parse recursively if current line start position is bigger than previous ones (ie. width) thus it means we should go in deeper level. Actually I managed to make it such but then I couldn't get outside properly so I've removed that code.
require 'pp'
require 'parslet'
require 'parslet/convenience'
class TextParser < Parslet::Parser
@@width = 5
root :text
rule(:text) { (line >> newline).repeat }
rule(:line) { left >> ( topline | subline ).as(:entry) }
rule(:topline) {
float.as(:number) >> str('%') >> space >> somestring.as(:string1) >> space >> specialstring.as(:string2) >> space >> specialstring.as(:string3)
}
rule(:subline) {
dynamic { |source, context|
width = context.captures[:width].to_s.length
width = width-1 if context.captures[:width].to_s[-1] == '|'
if width > @@width
# should be recursive
result = ( specialline | lastline | otherline | empty )
else
result = ( specialline | lastline | otherline | empty )
end
@@width = width
result
}
}
rule(:otherline) {
somestring.as(:string1)
}
rule(:specialline) {
float.as(:number) >> str('%') >> dash >> space? >> specialstring.as(:string1)
}
rule(:lastline) {
float.as(:number) >> str('%') >> dash >> space? >> str('[...]')
}
rule(:empty) {
space?
}
rule(:left) { seperator.capture(:width) >> dash?.capture(:dash) >> space? }
rule(:somestring) { match['0-9A-Za-z\.\-'].repeat(1) }
rule(:specialstring) { match['0-9A-Za-z&()*,\.:<>_~'].repeat(1) }
rule(:space) { match('[ \t]').repeat(1) }
rule(:space?) { space.maybe }
rule(:newline) { space? >> match('[\r\n]').repeat(1) }
rule(:seperator) { space >> (str('|') >> space?).repeat }
rule(:dash) { space? >> str('-').repeat(1) }
rule(:dash?) { dash.maybe }
rule(:float) { (digits >> str('.') >> digits) }
rule(:digits) { match['0-9'].repeat(1) }
end
parser = TextParser.new
file = File.open("text.txt", "rb")
contents = file.read.to_s
file.close
pp parser.parse_with_debug(contents)
text looks like this (https://gist.github.com/davispuh/4726538)
1.23% somestring specialstring specialstring
|
--- specialstring
|
|--12.34%-- specialstring
| specialstring
| |
| |--12.34%-- specialstring
| | specialstring
| | |
| | |--12.34%-- specialstring
| | --1.12%-- [...]
| |
| --2.23%-- specialstring
| |
| |--12.34%-- specialstring
| | specialstring
| | specialstring
| | |
| | |--12.34%-- specialstring
| | | specialstring
| | | specialstring
| | --1.23%-- [...]
| |
| --1.23%-- [...]
|
--1.05%-- [...]
1.23% somestring specialstring specialstring
2.34% somestring specialstring specialstring
|
--- specialstring
specialstring
specialstring
|
|--23.34%-- specialstring
| specialstring
| specialstring
--34.56%-- [...]
|
--- specialstring
specialstring
|
|--12.34%-- specialstring
| |
| |--100.00%-- specialstring
| | specialstring
| --0.00%-- [...]
--23.34%-- [...]
thanks :)
I was going to say the same thing as "the Tin Man". There has to be another format you can generate the data in.
If you want to parse this however... Parslet works like a map/reduce algorythm. You're first pass (parsing) is not intended to give you your final output, just to capture all the information you need from your source document.
Once you have that stored in a tree, you can then transform it to get the output you want.
So... I would write a parser that records each white space as a node, aswell as matching the text and percentages you need. I would group the white space nodes in an "indentation" node.
I would then use a transform to replace the whitespace nodes with a count of nodes to calculate the indentations.
Remember: Parslet generates a standard ruby hash. You can then write whatever code you like to make sense of this tree.
The parser is just converting the text file into a data-stucture you can manipulate.
Just to reiterate though. I think "the Tin Man" has the right answer.. generate the data in a machine readable way instead.
Update:
For an alternative approach you can check out: Indentation sensitive parser using Parslet in Ruby?