rubyweb-scrapingweb-crawlernokogiriopen-uri

Need to fetch the email id and phone number from web scraping


require 'open-uri'
require 'nokogiri'

def scrape(url)
  html = open(url).read
  nokogiri_doc = Nokogiri::HTML(html)
  final_array = []

  nokogiri_doc.search("a").each do |element|
    element = element.text
    final_array << element
  end

  final_array.each_with_index do |index|
    puts "#{index}"
  end
end


scrape('http://www.infranetsol.com/')

In this I'm only getting the a tag but I need the email id and phone number into an excel file.


Solution

  • All you have is text. So, what you can do, is to only keep string tha look like email or phone number.

    Fo instance, if you keep your result in an array

    a = scrape('http://www.infranetsol.com/')
    

    You can get element with an email (string with a '@') :

    a.select { |s| s.match(/.*@.*/) }
    

    You can get element with a phone number (string with at least 5 digits) :

    a.select{ |s| s.match(/\d{5}/) }
    

    The whole code :

    require 'open-uri'
    require 'nokogiri'
    
    def scrape(url)
      html = open(url).read
      nokogiri_doc = Nokogiri::HTML(html)
      final_array = []
    
      nokogiri_doc.search("a").each do |element|
        element = element.text
        final_array << element
      end
    
      final_array.each_with_index do |index|
        puts "#{index}"
      end
    end
    
    
    a = scrape('http://www.infranetsol.com/')
    email = a.select { |s| s.match(/.*@.*/) }
    phone = a.select{ |s| s.match(/\d{5}/) }
    
    # in your example, you will have to email in email 
    # and unfortunately a complex string for phone.
    # you can use scan to extract phone from text and flat_map 
    # to get an array without sub array
    # But keep in mind it will only worked with this text
    
    phone.flat_map{ |elt| elt.scan(/\d[\d ]*/) }