rubyregexrubular

Unable to split data properly with Ruby Regex Rubular


I am trying to organize and break up contents within emails that has been extracted through Net::POP3. In the code, when I use

p mail.pop

I get

****************************\r\n>>=20\r\n>>11) <> Summary: Working with Vars on Social Influence =\r\nplatform=20\r\n>>=20\r\n>> Name: Megumi Lindon \r\n>>=20\r\n>> Category: Social Psychology=20\r\n>>=20\r\n>> Email: information@example.com =\r\n<mailto:information@example.com>=20\r\n>>=20\r\n>> Journal News: Saving Grace \r\n>>=20\r\n>> Deadline: 10:00 PM EST - 15 February=20\r\n>>=20\r\n>> Query:=20\r\n>>=20\r\n>> Lorem ipsum dolor sit amet \r\n>> consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.\r\n>>=20\r\n>> Duis aute irure dolor in reprehenderit in voluptate \r\n>> velit esse cillum dolore eu fugiat nulla pariatur. =20\r\n>>=20\r\n>> Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.=20\r\n>> Requirements:=20\r\n>>=20\r\n>> Psychologists; anyone with good knowdledge\r\n>> with sociology and psychology.=20\r\n>>=20\r\n>> Please do send me your article and profile\r\n>> you want to be known as well. Thank you!=20\r\n>> Back to Top <x-msg://30/#top> Back to Category Index =\r\n<x-msg://30/#SocialPsychology>\r\n>>-----------------------------------\r\n>>=20\r\n>> 

I am trying to break it up and organize it to

11) Summary: Working with Vars on Social Influence 

Name: Megumi Lindon 

Category: Social Psychology 

Email: information@example.com 

Journal News: Saving Grace 

Deadline: 10:00 PM EST - 15 February

Questions:Lorem ipsum dolor sit amet consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Requirements: Psychologists; anyone with good knowdledge with sociology and psychology.

So far, I have been using rubular, but with varying results as I am still learning how to use regex, gsub and split properly. My code thus far is as below.

  p mail.pop.scan(/Summary: (.+) Name:/)
  p mail.pop.scan(/Name: (.+) Category:/)
  p mail.pop.scan(/Category: (.+) Email:/) 
  p mail.pop.scan(/Email: (.+) Journal News:/)     
  p mail.pop.scan(/Journal News: (.+) Deadline:/)       
  p mail.pop.scan(/Deadline: (.+) Questions:/)    
  p mail.pop.scan(/Questions:(.+) Requirements:/) 
  p mail.pop.scan(/Requirements:(.+) Back to Top/)  

But I have been getting empty arrays.

[]
[]
[]
[]
[]
[]
[]
[]

Wondering how I can do this better. Thanks in advance.


Solution

  • Oh, my! What a mess!

    There are many ways to approach this problem, of course, but I expect they all involve multiple steps and lots of trial and error. I can only say how I went about it.

    Lots of little steps is a good thing, for a couple of reasons. Firstly, it breaks the problem down into manageable tasks whose solutions can be tested individually. Secondly, the parsing rules may change in the future. If you have several steps you may only have to change and/or add one or two operations. If you have few steps and complex regular expressions, you may as well start over, particular if the code was written by someone else.

    Let's say text is a variable containing your string.

    Firstly, I don't like all those newlines, because they complicate regex's, so the first thing I'd do is get rid of them:

    s1 = text.gsub(/\n/, '')
    

    Next, there are many "20\r"'s which can be troublesome, as we may want to keep other text that contains numbers, so we can remove those (as well as "7941\r"):

    s2 = s1.gsub(/\d+\r/, '') 
    

    Now let's look at the fields you want and the immediately-preceding and immediately-following text:

    puts s2.scan(/.{4}(?:\w+\s+)*\w+:.{15}/)
      # <> Summary: Working with V
      #=>> Name: Megumi Lindon 
      #=>> Category: Social Psychol
      #=>> Email: information@ex
      #<mailto:information@exa
      #=>> Journal News: Saving Grace 
      #=>> Deadline: 10:00 PM EST -
      #=>> Query:=>>=>> Lorem ip
      #=>> Requirements:=>>=>> Psycholo
      # <x-msg://30/#top> Back
      #<x-msg://30/#SocialPsy
    

    We see that the fields of interest begin with "> " and the field name is followed by ": " or ":=". Let's simplify by changing ":=" to ": " after the field name and "> " to " :" before the field name:

    s3 = s2.gsub(/(?<=\w):=/, ": ")
    s4 = s3.gsub(/>\s+(?=(?:\w+\s+)*\w+: )/, " :")
    

    In the regex for s3, (?<=\w) is a "positive lookbehind": the match must be immediately preceded by a word character (which is not included as part of the match); in the regex for s4, (?=(?:\w+\s+)*\w+: ) is a "positive lookahead": the match must be immediately followed by one or more words followed by a colon then a space. Note that s3 and s4 must be calculated in the given order.

    We can now remove all the non-word characters other than punctuation characters and spaces:

    s5 = s4.gsub(/[^a-zA-Z0-9 :;.?!-()\[\]{}]/, "")
    

    and then (finally) split on the fields:

    a1 = s5.split(/((?<= :)(?:\w+\s+)*\w+:\s+)/)
      # => ["11)  :", "Summary: ", "Working with Vars on Social Influence platform :",
      #     "Name: ", "Megumi Lindon  :",
      #     "Category: ", "Social Psychology :",
      #     "Email: ", "informationexample.com mailto:informationexample.com :",
      #     "Journal News: ", "Saving Grace  :",
      #     "Deadline: ", "10:00 PM EST  15 February :",
      #     "Query:  ", "Lorem ipsum ...laborum. :",
      #     "Requirements:  ", "Psychologists; anyone...psychology...Top xmsg:30#top...Psychology"] 
    

    Note that I have enclosed (?<= :)(?:\w+\s+)*\w+:\s+ in a capture group, so that String#split will include the bits that it splits on in the resulting array.

    All that remains is some cleaning-up:

    a2 = a1.map { |s| s.chomp(':') }
    a2[0] = a2.shift + a2.first
      #=> "11)  Summary: "
    a3 = a2.each_slice(2).to_a
      #=> [["11)  Summary: ", "Working with Vars on Social Influence platform "],
      #    ["Name: ", "Megumi Lindon  "],
      #    ["Category: ", "Social Psychology "],
      #    ["Email: ", "informationexample.com mailto:informationexample.com "],
      #    ["Journal News: ", "Saving Grace  "],
      #    ["Deadline: ", "10:00 PM EST  15 February "],
      #    ["Query:  ", "Lorem...est laborum. "],
      #    ["Requirements:  ", "Psychologists;...psychology. Please...xmsg:30#SocialPsychology"]] 
    
    idx = a3.index { |n,_| n =~ /Email: / }
      #=> 3 
    a3[idx][1] = a3[idx][1][/.*?\s/] if idx
      #=> "informationexample.com " 
    

    Join the strings and remove extra spaces:

    a4 = a3.map { |b| b.join(' ').split.join(' ') }
      #=> ["11) Summary: Working with Vars on Social Influence platform",
      #    "Name: Megumi Lindon",
      #    "Category: Social Psychology",
      #    "Email: informationexample.com",
      #    "Journal News: Saving Grace",
      #    "Deadline: 10:00 PM EST 15 February",
      #    "Query: Lorem...laborum.",
      #    "Requirements: Psychologists...psychology. Please...well. Thank...Psychology"] 
    

    "Requirements" is still problematic, but without additional rules, nothing more can be done. We cannot limit all category values to a single sentence because "Query" can have more than one. If you wish to limit "Requirements" to one sentence:

    idx = a4.index { |n,_| n =~ /Requirements: / }
      #=> 7
    a4[idx] = a4[idx][/.*?[.!?]/] if idx
      # => "Requirements: Psychologists; anyone with good knowsledge with sociology and psychology."
    

    If you wish to combine these operations:

    def parse_it(text)
      a1 = text.gsub(/\n/, '')
               .gsub(/\d+\r/, '') 
               .gsub(/(?<=\w):=/, ": ")
               .gsub(/>\s+(?=(?:\w+\s+)*\w+: )/, " :")
               .gsub(/[^a-zA-Z0-9 :;.?!-()\[\]{}]/, "")
               .split(/((?<= :)(?:\w+\s+)*\w+:\s+)/)
               .map { |s| s.chomp(':') }
    
      a1[0] = a1.shift + a1.first
    
      a2 = a1.each_slice(2).to_a
      idx = a2.index { |n,_| n =~ /Email: / }
      a2[idx][1] = a2[idx][1][/.*?\s/] if idx
    
      a3 = a2.map { |b| b.join(' ').split.join(' ') }    
      idx = a3.index { |n,_| n =~ /Requirements: / }
      a3[idx] = a3[idx][/.*?[.!?]/] if idx
    
      a3
    end