URL Shortener

Notes from grokking system design

Similar: TinyURL, bitly

Requirements / Goals

  • Functional
    • Generate shorter and unique alias of URL, redirect users to the original URL
    • Option to choose custom short link
    • Short links will expire after a default/certain timespan
  • Non-functional
    • High availability
    • Minimal latency (URL redirection)
    • Alias should not be predictable
  • Extended
    • Analytics - total number of redirects occurred

Capacity Estimation / Constraints

  • 100:1 read to write ratio (Read-heavy system)
    • Reads = redirection requests
    • Writes = new URL shortenings
  • Scale? Thousands, millions, billions?
  • Traffic Estimates
    • Assume 500 M new URL shortenings / month (write)
      • Therefore, 100 * 500 M = 50 B redirections (read)
    • QPS
      • New URL shortenings
        • 500 M / (30 days * 24 hrs * 3600 s) = ~ 200 URL / s (write)
      • URL redirections
        • 100 * 200 URL / s = 20 K / s (read)
  • Storage Estimates
    • Assume storing for 5 years
      • 500 M * 5 yrs * 12 months = 30 B objects
      • 30 B objects * 500 bytes = 15 TB
  • Bandwidth Estimates
    • Write
      • 200 new URL shortenings / s * 500 bytes = 100 KB/s
    • Read
      • 20 K * 500 bytes = ~10 MB/s
  • Cache Memory Estimates
    • 80-20 rule, 20% of URLs will generate 80% of traffic
    • Cache 20% of requests
      • 0.2 * 1.7 Billion * 500 bytes = ~170 GB

Summary

Value
New URLs (write)200 /s
URL redirections (read)20 K/s
Incoming Data (write)100 KB/s
Outgoing Data (read)10 MB/s
Storage for 5 Years15 TB
Memory for Cache170 GB

System APIs

REST service

  • create_url(api_dev_key: string, original_url: string, custom_alias: string, user_name: string, expire_date: string)
    • returns shortened URL or error
  • view_url()
  • delete_url(api_dev_key: string, url_key: string)
  • API key to control usage and prevent abuse

Database Design

  • Observations
    • Need to store billions of records
    • Read-heavy
    • Each record is small (<1000)
    • No relationships between each record, except for storing which user created the short link
  • Schema
URL
Hashvarchar PK
OriginalURLvarchar
CreationDatedate
ExpirationDatedate
User
UserIDint PK
Namevarchar
Emailvarchar
CreationDatedate
LastLogindate
  • Best choice: NoSQL key-value store (DynamoDB / Cassandra / Riak), easier to scale
    • Store billions of rows with no relationships between objects

Basic System Design and Algorithm

  • Base62 Encode (a-zA-Z0-9), Base64 includes ‘+’ and ‘/’
Base62Base62
62^6 = ~56 Billion Possible Strings64^6 = ~68.7 Billion Possible Strings
62^7 = 3.5 Trillion Possible Strings64^7 = ~281 Trillion Possible Strings
  • MD5 / base62 encode
    • MD5(long URL) → base62(128-bit hash value) → 20+ characters
  • Key Generation Service (KGS)
    • Generate unique random keys of length 6 offline and store them in a database (key-DB)
    • Retrieve hash from key-DB
      • No need to worry about collisions
      • No need to perform encoding
    • Concurrency Issues
      • Multiple servers reading keys concurrently
      • KGS can use two tables (key-DB): unused keys, used keys
      • As soon as KGS hands a key to an application server, it moves it to the used keys table
      • KGS can also have keys in memory, it can quickly provide them whenever a servers needs them
        • As we load them in memory, move it to the used table
        • Ensures each server gets unique keys
        • If KGS dies before assigning all the loaded keys to some server, we will be wasting those keys - fine since we have a lot of keys
      • Make sure not to give same key to multiple servers
        • Must synchronize (or get a lock on) the data structure holding the keys before removing keys from it and giving them to a server
    • Size of key-DB
      • Base64 encoding, 68.7 N unique six letter keys
        • Assume 1 byte to store 1 alpha-numeric character
        • 6 characters * 68.7 B = 412 GB
    • KGS Single PoF?
      • Standby replica of KGS, primary server dies, standby take over

Data Partitioning and Replication

  • Store billions of URLs
  • Two ways to partition the DB, store data in different servers
    • Range-based
      • Store in partitions based on first letter of hash key
      • Drawback: unbalanced DB servers, unequal load
    • Hash-based
      • Take a hash of the object we are storing, then calculate which partition to use based on the hash
        • Can take the hash of the ‘key’ or the actual URL to determine the partition
      • Hashing function randomly distribute URLs into different partitions (map any key to a number between 1-256), number would represent the partition in which we store our object
      • Drawback: overloaded partitions
        • Solution: Consistent Hashing

Cache

  • Memcached/Redis to store full URLs with respective hashes
  • 80-20% rule
    • Memory Requirements
      • 170 GB to cache 20% of daily traffic
      • Can easily fit into one machine
        • Alternatively, can use a couple of small servers to store all these hot URLs
  • Eviction Policy: Least Recently Used (LRU)
    • Use a Linked Hash Map or similar data structure
  • Replicate cache servers to distribute load between them
  • Whenever there is a cache miss, servers would hit the database
    • Update the cache and pass new entry to all cache replicas

Load Balancing

  • In three places:
    • Between clients, application servers, database servers and cache servers
  • Start: Round Robin (RR)
    • Drawback: overloaded servers
    • More intelligent load balancer solution required to avoid server overload
      • Least Connection / Least Response / Least Bandwidth

Purging or DB Cleanup

  • Reached expiration time
  • Lazy cleanup
    • Whenever a user tries to access an expired link, we delete the link and return an error
    • Separate Cleanup Service
      • Rune periodically removing expired links from storage and cache
      • Considered lightweight, can run when traffic is low
      • After removing expired link, put the key back in key-DB to be reused

Telemetry

  • Statistics to track
    • Country of visitor
    • Date and time of access
    • Web page that refers the click / Platform where page was accessed

Security and Permissions

  • Create Private URLs or allow a particular set of users to access a URL?
    • Store permission level (public/private) with each URL in the DB
    • Create a separate table to store UserIDs that have permission to see a specific URL
    • Send 401 error if no access