rubyweb-scrapinguri

URI Extract escaping at colons, any way to avoid this?


I have the following function below that will normally spit out a URL such as path.com/p/12345.

Sometimes, when a tweet contains a colon before the tweet such as

RT: Something path.com/p/123

the function will return:

personName:
path.com/p/12345

My function:

$a = 10

def grabTweets()
  tweet = Twitter.search("[pic] "+" path.com/p/", :rpp => $a, :result_type => "recent").map do |status|
    tweet = "#{status.text}" #class = string
    urls = URI::extract(tweet) #returns an array of strings
  end
end

My goal is to find any tweet with a colon before the URL and remove that result from the loop so that it is not returned to the array that is created.


Solution

  • You can only select HTTP URLs:

    URI.extract("RT: Something http://path.com/p/123")
      # => ["RT:", "http://path.com/p/123"]
    
    URI.extract("RT: Something http://path.com/p/123", "http")
      # => ["http://path.com/p/123"]
    

    Your method can also be cleaned up quite a bit, you have a lot of superfluous local variables:

    def grabTweets
      Twitter.search("[pic] "+" path.com/p/", :rpp => $a, :result_type => "recent").map do |status|
        URI.extract(status.text, "http")
      end
    end
    

    I also want to strongly discourage your use of a global variable ($a).