URL Shortener
Notes from grokking system design
Similar: TinyURL, bitly
Requirements / Goals
- Functional
- Generate shorter and unique alias of URL, redirect users to the original URL
- Option to choose custom short link
- Short links will expire after a default/certain timespan
- Non-functional
- High availability
- Minimal latency (URL redirection)
- Alias should not be predictable
- Extended
- Analytics - total number of redirects occurred
Capacity Estimation / Constraints
100:1 read to write ratio
(Read-heavy system)- Reads = redirection requests
- Writes = new URL shortenings
- Scale? Thousands, millions, billions?
- Traffic Estimates
- Assume
500 M new URL shortenings / month
(write)- Therefore,
100 * 500 M = 50 B redirections
(read)
- Therefore,
- QPS
- New URL shortenings
500 M / (30 days * 24 hrs * 3600 s) = ~ 200 URL / s
(write)
- URL redirections
100 * 200 URL / s = 20 K / s
(read)
- New URL shortenings
- Assume
- Storage Estimates
- Assume storing for 5 years
500 M * 5 yrs * 12 months = 30 B objects
30 B objects * 500 bytes = 15 TB
- Assume storing for 5 years
- Bandwidth Estimates
- Write
200 new URL shortenings / s * 500 bytes = 100 KB/s
- Read
20 K * 500 bytes = ~10 MB/s
- Write
- Cache Memory Estimates
- 80-20 rule, 20% of URLs will generate 80% of traffic
- Cache 20% of requests
0.2 * 1.7 Billion * 500 bytes = ~170 GB
Summary
Value | |
---|---|
New URLs (write) | 200 /s |
URL redirections (read) | 20 K/s |
Incoming Data (write) | 100 KB/s |
Outgoing Data (read) | 10 MB/s |
Storage for 5 Years | 15 TB |
Memory for Cache | 170 GB |
System APIs
REST service
- create_url(api_dev_key: string, original_url: string, custom_alias: string, user_name: string, expire_date: string)
- returns shortened URL or error
- view_url()
- delete_url(api_dev_key: string, url_key: string)
- API key to control usage and prevent abuse
Database Design
- Observations
- Need to store billions of records
- Read-heavy
- Each record is small (<1000)
- No relationships between each record, except for storing which user created the short link
- Schema
URL | |
---|---|
Hash | varchar PK |
OriginalURL | varchar |
CreationDate | date |
ExpirationDate | date |
User | |
---|---|
UserID | int PK |
Name | varchar |
varchar | |
CreationDate | date |
LastLogin | date |
- Best choice: NoSQL key-value store (DynamoDB / Cassandra / Riak), easier to scale
- Store billions of rows with no relationships between objects
Basic System Design and Algorithm
- Base62 Encode (a-zA-Z0-9), Base64 includes ‘+’ and ‘/’
Base62 | Base62 |
---|---|
62^6 = ~56 Billion Possible Strings | 64^6 = ~68.7 Billion Possible Strings |
62^7 = 3.5 Trillion Possible Strings | 64^7 = ~281 Trillion Possible Strings |
- MD5 / base62 encode
- MD5(long URL) → base62(128-bit hash value) → 20+ characters
- Key Generation Service (KGS)
- Generate unique random keys of length 6 offline and store them in a database (key-DB)
- Retrieve hash from key-DB
- No need to worry about collisions
- No need to perform encoding
- Concurrency Issues
- Multiple servers reading keys concurrently
- KGS can use two tables (key-DB): unused keys, used keys
- As soon as KGS hands a key to an application server, it moves it to the used keys table
- KGS can also have keys in memory, it can quickly provide them whenever a servers needs them
- As we load them in memory, move it to the used table
- Ensures each server gets unique keys
- If KGS dies before assigning all the loaded keys to some server, we will be wasting those keys - fine since we have a lot of keys
- Make sure not to give same key to multiple servers
- Must synchronize (or get a lock on) the data structure holding the keys before removing keys from it and giving them to a server
- Size of key-DB
- Base64 encoding, 68.7 N unique six letter keys
- Assume 1 byte to store 1 alpha-numeric character
- 6 characters * 68.7 B = 412 GB
- Base64 encoding, 68.7 N unique six letter keys
- KGS Single PoF?
- Standby replica of KGS, primary server dies, standby take over
Data Partitioning and Replication
- Store billions of URLs
- Two ways to partition the DB, store data in different servers
- Range-based
- Store in partitions based on first letter of hash key
- Drawback: unbalanced DB servers, unequal load
- Hash-based
- Take a hash of the object we are storing, then calculate which partition to use based on the hash
- Can take the hash of the ‘key’ or the actual URL to determine the partition
- Hashing function randomly distribute URLs into different partitions (map any key to a number between 1-256), number would represent the partition in which we store our object
- Drawback: overloaded partitions
- Solution: Consistent Hashing
- Take a hash of the object we are storing, then calculate which partition to use based on the hash
- Range-based
Cache
- Memcached/Redis to store full URLs with respective hashes
- 80-20% rule
- Memory Requirements
- 170 GB to cache 20% of daily traffic
- Can easily fit into one machine
- Alternatively, can use a couple of small servers to store all these hot URLs
- Memory Requirements
- Eviction Policy: Least Recently Used (LRU)
- Use a Linked Hash Map or similar data structure
- Replicate cache servers to distribute load between them
- Whenever there is a cache miss, servers would hit the database
- Update the cache and pass new entry to all cache replicas
Load Balancing
- In three places:
- Between clients, application servers, database servers and cache servers
- Start: Round Robin (RR)
- Drawback: overloaded servers
- More intelligent load balancer solution required to avoid server overload
- Least Connection / Least Response / Least Bandwidth
Purging or DB Cleanup
- Reached expiration time
- Lazy cleanup
- Whenever a user tries to access an expired link, we delete the link and return an error
- Separate Cleanup Service
- Rune periodically removing expired links from storage and cache
- Considered lightweight, can run when traffic is low
- After removing expired link, put the key back in key-DB to be reused
Telemetry
- Statistics to track
- Country of visitor
- Date and time of access
- Web page that refers the click / Platform where page was accessed
Security and Permissions
- Create Private URLs or allow a particular set of users to access a URL?
- Store permission level (public/private) with each URL in the DB
- Create a separate table to store UserIDs that have permission to see a specific URL
- Send 401 error if no access