ruby-on-railsrubyhttpruby-on-rails-4typhoeus

Fetching a large number of images, determining if they are broken or not


I have about 600,000 posts in my database, all of which contain a link to an image. In about 1% of these posts, the images are broken (they've been taken down or moved or whatever). I need a fast way to go through all the images and remove posts that have broken images. Here's my code thus far:

class Post < ActiveRecord::Base

  ..unrelated code truncated

  def self.clean_broken_images
    Post.with_image.find_each do |post|
      response = HTTP.get(post.image)
      post.destroy if response == 404
    end
  end

end

This works, but as you might expect, it's insanely slow (I haven't actually let it run to completion yet).

Is there a faster way to do it? For example: just return the response header, and delete if it's a 404? Use Typhoeus/Hydra (not sure I could do this for this massive amount of posts)? I should also mention that I am running this with delayed jobs.

Thanks!


Solution

  • Do you need to proactively remove the Posts from your database? You could wait until they are requested and use some javascript to load the image(s). If the image can't be found, have the script issue a DELETE request to your server for the appropriate post.