ruby-on-railsregexrubysanitize

Extract the mailto value and remove html tag if any in the string


I want to extract the mailto value from the given string and also want to remove the html tag at the same time.

ex -> "<mailto:demomail@gmail.com|demomail@gmail.com> helo<p> bye </p>"
output -> demomail@gmail.com helo bye

If I use this -> gsub(/<[^>]*>/,'')
output -> helo bye

If I use this -> ActionView::Base.full_sanitizer.sanitize(html_string, :tags => %w(img br p), :attributes => %w(src style))
output -> helo bye

Can you suggest me how can i get my expected output?
expected output -> demomail@gmail.com helo bye


Solution

  • The probem is that the mailto value is inside HTML tags, so when you remove the HTML tags, you remove the mailto value as well. It is definitely possible to construct a complex regular expression that would handle it, but I think it's much easier to extract the mailto value separately from the rest of the string. I would do this with a capturing group that extracts the value between "mailto:" and "|". Then you can get the rest of of the output value by processing the full string with the gsub method you already have.

    s = "<mailto:demomail@gmail.com|demomail@gmail.com> helo<p> bye </p>"
    
    # Find the "mailto" value
    s.match(/mailto:([^|]*)/)
    => #<MatchData "mailto:demomail@gmail.com" 1:"demomail@gmail.com">
    
    # Full result with the matched email and the rest of the string with HTML tags removed
    s.match(/mailto:([^|]*)/)[1] + s.gsub(/<[^>]*>/, "")
    => "demomail@gmail.com helo bye "
    

    If the string starts with something other than the <mailto> tag, you could replace the whole tag with just the matched email address and then get rid of the other tags after that:

    s = "this is <mailto:demomail@gmail.com|demomail@gmail.com> helo<p> bye </p>"
    
    # Replace mailto tag with the email, then process the rest
    # '\1' is a backreference to the first match
    s.gsub(/<mailto:([^|]*)[^>]*>/, '\1').gsub(/<[^>]*>/, "")
    => "this is demomail@gmail.com helo bye "
    
    # Alternatively, you can just process the mailto tag differently in the gsub block
    s.gsub(/<[^>]*>/) do |tag|
      tag.include?("mailto:") ? tag.match(/mailto:([^|]*)/)[1] : ""
    end
    => "this is demomail@gmail.com helo bye "