URL Shortener

Similar: TinyURL, bitly

Requirements / Goals

Functional
- Generate shorter and unique alias of URL, redirect users to the original URL
- Option to choose custom short link
- Short links will expire after a default/certain timespan
Non-functional
- High availability
- Minimal latency (URL redirection)
- Alias should not be predictable
Extended
- Analytics - total number of redirects occurred

Capacity Estimation / Constraints

100:1 read to write ratio (Read-heavy system)
- Reads = redirection requests
- Writes = new URL shortenings
Scale? Thousands, millions, billions?
Traffic Estimates
- Assume 500 M new URL shortenings / month (write)
  - Therefore, 100 * 500 M = 50 B redirections (read)
- QPS
  - New URL shortenings
    - 500 M / (30 days * 24 hrs * 3600 s) = ~ 200 URL / s (write)
  - URL redirections
    - 100 * 200 URL / s = 20 K / s (read)
Storage Estimates
- Assume storing for 5 years
  - 500 M * 5 yrs * 12 months = 30 B objects
  - 30 B objects * 500 bytes = 15 TB
Bandwidth Estimates
- Write
  - 200 new URL shortenings / s * 500 bytes = 100 KB/s
- Read
  - 20 K * 500 bytes = ~10 MB/s
Cache Memory Estimates
- 80-20 rule, 20% of URLs will generate 80% of traffic
- Cache 20% of requests
  - 0.2 * 1.7 Billion * 500 bytes = ~170 GB

Summary

	Value
New URLs (write)	200 /s
URL redirections (read)	20 K/s
Incoming Data (write)	100 KB/s
Outgoing Data (read)	10 MB/s
Storage for 5 Years	15 TB
Memory for Cache	170 GB

System APIs

REST service

create_url(api_dev_key: string, original_url: string, custom_alias: string, user_name: string, expire_date: string)
- returns shortened URL or error
view_url()
delete_url(api_dev_key: string, url_key: string)
API key to control usage and prevent abuse

Database Design

Observations
- Need to store billions of records
- Read-heavy
- Each record is small (<1000)
- No relationships between each record, except for storing which user created the short link
Schema

URL
Hash	varchar PK
OriginalURL	varchar
CreationDate	date
ExpirationDate	date

User
UserID	int PK
Name	varchar
Email	varchar
CreationDate	date
LastLogin	date

Best choice: NoSQL key-value store (DynamoDB / Cassandra / Riak), easier to scale
- Store billions of rows with no relationships between objects

Basic System Design and Algorithm

Base62 Encode (a-zA-Z0-9), Base64 includes ‘+’ and ‘/’

Base62	Base62
62^6 = ~56 Billion Possible Strings	64^6 = ~68.7 Billion Possible Strings
62^7 = 3.5 Trillion Possible Strings	64^7 = ~281 Trillion Possible Strings

MD5 / base62 encode
- MD5(long URL) → base62(128-bit hash value) → 20+ characters
Key Generation Service (KGS)
- Generate unique random keys of length 6 offline and store them in a database (key-DB)
- Retrieve hash from key-DB
  - No need to worry about collisions
  - No need to perform encoding
- Concurrency Issues
  - Multiple servers reading keys concurrently
  - KGS can use two tables (key-DB): unused keys, used keys
  - As soon as KGS hands a key to an application server, it moves it to the used keys table
  - KGS can also have keys in memory, it can quickly provide them whenever a servers needs them
    - As we load them in memory, move it to the used table
    - Ensures each server gets unique keys
    - If KGS dies before assigning all the loaded keys to some server, we will be wasting those keys - fine since we have a lot of keys
  - Make sure not to give same key to multiple servers
    - Must synchronize (or get a lock on) the data structure holding the keys before removing keys from it and giving them to a server
- Size of key-DB
  - Base64 encoding, 68.7 N unique six letter keys
    - Assume 1 byte to store 1 alpha-numeric character
    - 6 characters * 68.7 B = 412 GB
- KGS Single PoF?
  - Standby replica of KGS, primary server dies, standby take over

Data Partitioning and Replication

Store billions of URLs
Two ways to partition the DB, store data in different servers
- Range-based
  - Store in partitions based on first letter of hash key
  - Drawback: unbalanced DB servers, unequal load
- Hash-based
  - Take a hash of the object we are storing, then calculate which partition to use based on the hash
    - Can take the hash of the ‘key’ or the actual URL to determine the partition
  - Hashing function randomly distribute URLs into different partitions (map any key to a number between 1-256), number would represent the partition in which we store our object
  - Drawback: overloaded partitions
    - Solution: Consistent Hashing

Cache

Memcached/Redis to store full URLs with respective hashes
80-20% rule
- Memory Requirements
  - 170 GB to cache 20% of daily traffic
  - Can easily fit into one machine
    - Alternatively, can use a couple of small servers to store all these hot URLs
Eviction Policy: Least Recently Used (LRU)
- Use a Linked Hash Map or similar data structure
Replicate cache servers to distribute load between them
Whenever there is a cache miss, servers would hit the database
- Update the cache and pass new entry to all cache replicas

Load Balancing

In three places:
- Between clients, application servers, database servers and cache servers
Start: Round Robin (RR)
- Drawback: overloaded servers
- More intelligent load balancer solution required to avoid server overload
  - Least Connection / Least Response / Least Bandwidth

Purging or DB Cleanup

Reached expiration time
Lazy cleanup
- Whenever a user tries to access an expired link, we delete the link and return an error
- Separate Cleanup Service
  - Rune periodically removing expired links from storage and cache
  - Considered lightweight, can run when traffic is low
  - After removing expired link, put the key back in key-DB to be reused

Telemetry

Statistics to track
- Country of visitor
- Date and time of access
- Web page that refers the click / Platform where page was accessed

Security and Permissions

Create Private URLs or allow a particular set of users to access a URL?
- Store permission level (public/private) with each URL in the DB
- Create a separate table to store UserIDs that have permission to see a specific URL
- Send 401 error if no access

URL Shortener

Requirements / Goals #

Capacity Estimation / Constraints #

System APIs #

Database Design #

Basic System Design and Algorithm #

Data Partitioning and Replication #

Cache #

Load Balancing #

Purging or DB Cleanup #

Telemetry #

Security and Permissions #

Requirements / Goals

Capacity Estimation / Constraints

System APIs

Database Design

Basic System Design and Algorithm

Data Partitioning and Replication

Cache

Load Balancing

Purging or DB Cleanup

Telemetry

Security and Permissions