indexingdatabase-designarchitectureurl-shortenersystem-design

How indexing is done in URL shortening service


I am learning to design systems. First use case is URL shortening service. I have read that we can store short_url,long_url pair in RDBMS. Since as per the requirement, we should be able to map a short_url to long_url (so that this can be sent to calling client) and also from long_url to short_url (so that existing mapped long_urls are not shortened again). My question is how these mappings/look-ups are efficiently done in RDBMS. Straightforward answer would be that index on short_url as well as index on long_url is maintained. I would like to explore its details like generally what indexing techniques in both cases are done in RDBMS.


Solution

  • From what I can gather from your description, you aren't actually "mapping" anything, you simply need to store a URL and its short equivalent.

    So barring any other information I would lean towards something like

    create table URLs (
    Id int identity(1,1),
      LongURL varchar(200),
      ShortURL varchar(20),
      CreateDate datetime default (getdate())
    )
    
    create unique clustered index IdxShortURL on URLs (ShortURL)
    create nonclustered index IdxLongURL on URLs (LongURL)
    

    Primarily you will be looking up the long url given the short url, so this is indexed unique and clustered, the database can seek directly to it and return the full URL

    You can check for the existence of a URL already existing using the nonclustered index on the LongURL.

    Edit

    To expand on the indexing strategy - this really depends on the usage expected. Often the correct indexing strategy arises after monitoring query execution plans.

    Yes, both are similar character columns but - and this is my assumption - the long URL will be primarily looked up using the short URL far more frequently than the reverse situation, ie, I have the short URL, I need to lookup the long URL - in this instance the short URL is synonymous to the table's 'Id', it is the key through which you access other information.

    By specifying the short URL as unique and clustered, the data in the table is physically ordered by the short URL and, when a row is looked up using the short URL, the equivalent long URL is immediately available as part of the index's leaf level and a single seek operation can retrieve it.

    Conversely - again my assumption - you don't want to add a URL if it already exists and checking for a URL's existence will be done by looking for the long URL, so a separate index on this column will suffice; even if you need to retrieve the short URL as part of this lookup, because URLs are unique it is only ever a single bookmark lookup, something very quick and efficient.

    YMMV but hopefully that aligns with your intentions.