rubyshort-url

How to implement a short URL like the URLs in Twitter?


If there is a long url, I want to generate a short URL like those in Twitter. Is there some way to implement this in Ruby?


Solution

  • The easiest way is to:

    1. keep a database of all URLs
    2. when you insert a new URL into the database, find out the id of the auto-incrementing integer primary key.
    3. encode that integer into base 36 or 62 (digits + lowercase alpha or digits + mixed-case alpha). Voila! You have a short url!

    Encoding to base 36/decoding from base 36 is simple in Ruby:

    12341235.to_s(36)
    #=> "7cik3"
    
    "7cik3".to_i(36)
    #=> 12341235
    

    Encoding to base 62 is a bit tricker. Here's one way to do it:

    module AnyBase
      ENCODER = Hash.new do |h,k|
        h[k] = Hash[ k.chars.map.with_index.to_a.map(&:reverse) ]
      end
      DECODER = Hash.new do |h,k|
        h[k] = Hash[ k.chars.map.with_index.to_a ]
      end
      def self.encode( value, keys )
        ring = ENCODER[keys]
        base = keys.length
        result = []
        until value == 0
          result << ring[ value % base ]
          value /= base
        end
        result.reverse.join
      end
      def self.decode( string, keys )
        ring = DECODER[keys]
        base = keys.length
        string.reverse.chars.with_index.inject(0) do |sum,(char,i)|
          sum + ring[char] * base**i
        end
      end
    end
    

    ...and here it is in action:

    base36 = "0123456789abcdefghijklmnopqrstuvwxyz"
    db_id = 12341235
    p AnyBase.encode( db_id, base36 )
    #=> "7cik3"
    p AnyBase.decode( "7cik3", base36 )
    #=> 12341235
    
    base62 = [ *0..9, *'a'..'z', *'A'..'Z' ].join
    p AnyBase.encode( db_id, base62 )
    #=> "PMwb"
    p AnyBase.decode( "PMwb", base62 )
    #=> 12341235
    

    Edit

    If you want to avoid URLs that happen to be English words (for example, four-letter swear words) you can use a set of characters that does not include vowels:

    base31 = ([*0..9,*'a'..'z'] - %w[a e i o u]).join
    base52 = ([*0..9,*'a'..'z',*'A'..'Z'] - %w[a e i o u A E I O U]).join
    

    However, with this you still have problems like AnyBase.encode(328059,base31) or AnyBase.encode(345055,base31) or AnyBase.encode(450324,base31). You may thus want to avoid vowel-like numbers as well:

    base28 = ([*'0'..'9',*'a'..'z'] - %w[a e i o u 0 1 3]).join
    base49 = ([*'0'..'9',*'a'..'z',*'A'..'Z'] - %w[a e i o u A E I O U 0 1 3]).join
    

    This will also avoid the problem of "Is that a 0 or an O?" and "Is that a 1 or an I?".