rubytext

Plain text table data organization


I have a plain text table like this. I need to group the result line so that the data is together in their respective columns.

I can split the string (one line) on a space and then I will get an array like:

["2", "1/47", "M4044", "25:03*", "856", "12:22", "12:41", "17.52", "Some", "Name", "Yo", "Prairie", "Inn", "Harriers", "Runni", "25:03"]

I can also split on two spaces, which gets me close, but still inconsistent, as you see with the name:

["2", " 1/47", "M4044", " 25:03*", "856", " 12:22", " 12:41", "17.52 Some Name Yo", "", "", "", "", "", "", "Prairie Inn Harriers Runni", " 25:03 "]

I can specify which indexes to join on, but I need to grab possibly thousands of files just like this, and the columns are not always going to be in the same order.

The one constant is that the column data is never longer than the divider between column name and data (the ====). I tried to use this to my advantage, but found some loopholes.

I need to write an algorithm to detect what stays in the name column and what stays in whatever other 'word' columns. Any thoughts?


Solution

  • First we set up the problem:

    data = <<EOF
    Place Div/Tot Div   Guntime  PerF 1sthalf 2ndhalf 100m   Name                      Club                       Nettime 
    ===== ======= ===== =======  ==== ======= ======= ====== ========================= ========================== ======= 
        1   1/24  M3034   24:46   866   12:11   12:35  15.88 Andy Bas                  Prairie Inn Harriers         24:46 
        2   1/47  M4044   25:03*  856   12:22   12:41  17.52 Some Name Yo              Prairie Inn Harriers Runni   25:03 
    EOF
    lines = data.split "\n"
    

    I like to make a format string for String#unpack:

    format = lines[1].scan(/(=+)(\s+)/).map{|f, s| "A#{f.size}" + 'x' * s.size}.join
    #=> A5xA7xA5xA7xxA4xA7xA7xA6xA25xA26xA7x
    

    The rest is easy:

    headers = lines[0].unpack format
    lines[2..-1].each do |line|
      puts Hash[headers.zip line.unpack(format).map(&:strip)]
    end
    #=> {"Place"=>"1", "Div/Tot"=>"1/24", "Div"=>"M3034", "Guntime"=>"24:46", "PerF"=>"866", "1sthalf"=>"12:11", "2ndhalf"=>"12:35", "100m"=>"15.88", "Name"=>"Andy Bas", "Club"=>"Prairie Inn Harriers", "Nettime"=>"24:46"}
    #=> {"Place"=>"2", "Div/Tot"=>"1/47", "Div"=>"M4044", "Guntime"=>"25:03", "PerF"=>"856", "1sthalf"=>"12:22", "2ndhalf"=>"12:41", "100m"=>"17.52", "Name"=>"Some Name Yo", "Club"=>"Prairie Inn Harriers Runni", "Nettime"=>"25:03"}