System Design Advanced Technologies

Advanced•February 20, 2024

A guide to advanced technologies commonly used in system design.

Database Transactions

A transaction is a sequence of one or more database operations that are treated as a single logical unit of work. It ensures that all operations within the transaction are completed successfully before committing any changes to the database. If any operation fails, the entire transaction can be rolled back, leaving the database in its previous state.

ACID Properties

When a database supports transactions, it typically means that the implementation adheres to the ACID properties. Therefore, for a transaction to be considered robust and reliable, it must comply with ACID.

Atomicity: Ensures that all operations in a transaction are completed successfully; if any operation fails, the entire transaction fails and is rolled back.
Consistency: Guarantees that a transaction brings the database from one valid state to another, maintaining all defined rules and constraints (e.g., integrity constraints).
Isolation: Ensures that concurrently executed transactions do not interfere with each other, providing a clear and consistent view of the data.
Durability: Ensures that once a transaction is committed, it remains so, even in the event of a system failure.

ANSI Isolation Levels

Transaction isolation levels determine how transaction integrity is visible to other transactions and how the interactions between competing transactions are managed. Different isolation levels provide varying balances of performance and consistency, impacting how transactions interact with one another.

Read Uncommitted: Transactions can read data that has been modified but not yet committed by other transactions.
- Pros: Offers the highest level of concurrency and performance.
- Cons: Can lead to dirty reads, where transactions read uncommitted data that may be rolled back later.
Read Committed: A transaction can only read data that has been committed at that moment. It cannot see uncommitted changes made by other transactions.
- Pros: Prevents dirty reads, ensuring that only committed data is read.
- Cons: Does not protect against non-repeatable reads or phantom reads, where the data read might change if read again within the same transaction.
Repeatable Read: Ensures that if a transaction reads a row, it can read that row again and receive the same values, even if other transactions are modifying it.
- Pros: Prevents both dirty reads and non-repeatable reads.
- Cons: May still allow phantom reads, where new rows added by other transactions can be seen on subsequent reads.
Serializable: The highest isolation level, where transactions are executed in such a way that they appear to be serial, meaning one transaction is completely executed before another begins.
- Pros: Prevents dirty reads, non-repeatable reads, and phantom reads.
- Cons: Most restrictive in terms of concurrency and can lead to performance issues due to increased locking and contention.
Common Default Isolation Levels
- PostgreSQL: Default is Read Committed.
- MySQL (InnoDB): Default is Repeatable Read.

Locking Mechanisms

Locking Mechanisms are techniques used in databases to control concurrent access to data, ensuring data integrity and consistency during transactions. They can be broadly categorized into pessimistic and optimistic locking strategies. Database locking should be used for short-term locking to avoid deadlocks and improve system responsiveness.

Pessimistic Locking: This strategy involves locking the data being modified before the transaction begins, preventing other transactions from accessing it until the lock is released.
- Subtypes:
  - Row-Level Locking: Locks individual rows in a table.
  - Table-Level Locking: Locks the entire table, preventing access to all rows in the table.
- Pros & Cons:
  - Pros:
    - Ensures data integrity and consistency, preventing race conditions.
    - Easy to implement and it avoid conflicts by serializing updates.
    - Very useful when data contention is heavy
  - Cons:
    - Can lead to deadlocks and reduced concurrency.
    - Can be less efficient than optimistic locking, especially in low-concurrency environments.
  - Best Practices:
    - By locking only the necessary rows, more transactions can proceed concurrently.
    - Shorter lock duration reduces the risk of deadlocks and improve system responsiveness.

Optimistic Locking: A locking strategy where transactions proceed without acquiring locks initially. Before committing, the transaction checks whether the data has been modified by other transactions. If a conflict is detected, the transaction is rolled back.

Subtypes:
- Versioning Check-and-Set (CAS): Each row has a version number that increments with each update. Transactions check this version number during commit.
- Timestamping: Each row has a timestamp that is updated with each modification. Transactions check this timestamp during commit. However,server clock synchronization is required to ensure accurate timestamp comparison.
Pros & Cons:
- Pros:
  - Higher concurrency and performance, as it avoids unnecessary locking, especially in low-contention environments.
  - Reduces the risk of deadlocks since locks are not held.
- Cons:
  - Performance can degrade in high-contention due to frequent retries.
  - Requires additional logic to handle conflict detection and resolution.
Best Practices:
- Use optimistic locking in low-conflict scenarios to maximize performance.
- Monitor conflict rates to determine whether optimistic locking is appropriate for the workload.
Summary Table

Feature	Pessimistic Locking	Optimistic Locking
Definition	Locks data upfront to prevent access	Checks for conflicts before commit
Sub-Locking Types	Row-level, Table-level, Shared, Exclusive	Versioning, Timestamp-based
Pros	Guarantees data integrity	Higher concurrency, no deadlocks
Cons	Reduced concurrency, potential for deadlocks	Possible rollbacks, conflict overhead
Best Practices	Use row-level locks, implement timeouts	Monitor conflicts, provide feedback
Use Cases	High-conflict environments (e.g., banking)	Low-conflict (e.g., read-heavy, collaborative tools)

MVCC Alternative

Multi-Version Concurrency Control is a concurrency control mechanism that allows multiple versions of the same data to coexist, providing a way to manage concurrent access to data without blocking or locking. It is a more efficient approach for read-heavy workloads and allows for better concurrency. Modern databases like PostgreSQL and MySQL use MVCC instead of locking to achieve better concurrency. The underlying implementation of MVCC is different for different databases, where PostgreSQL implements MVCC primarily through OCC and Oracle and MySQL InnoDB row level locking, more pessimistic approach.

How it works:

When a transaction reads data, it reads the version that was current at the time of the read, a consistent snapshot of the data is returned.
When a transaction writes data, it creates a new version of the data, and the old version is kept in the database. Multiple versions of the same data can coexist, allowing for better concurrency.
The key benefit of MVCC is that readers do not block writers, and writers do not block readers, which is achieved through versioning.

Database Constraints

Constraints are rules applied to columns in a table to ensure the integrity and validity of the data. They enforce specific conditions that data must adhere to when being inserted or updated, ensuring that the data maintains certain standards.

Types of Constraints

Constraint Type	Description
Primary Key	Ensures that each row in a table has a unique identifier.
Foreign Key	Ensures that a value in one table corresponds to a value in another table.
Unique	Ensures that all values in a column are unique.
Check	Ensures that the values in a column meet certain conditions.
Not Null	Ensures that a column cannot contain null values.

Constraints and transactions are closely related. Constraints help ensure the integrity and consistency of the data being manipulated by transactions.

Enfore Data Integrity: Ensure that the data being inserted, updated, or deleted in a transaction adheres to the defined rules and constraints.
Rollback on Failure: If a transaction attempts to perform an operation that violates a database constraint, the transaction will be rolled back, and the database will remain in a consistent state.
Performance Considerations: While database constraints are essential for maintaining data integrity, they can also impact the performance of the database, especially for high-concurrency workloads.

Database Indexes

Indexes are data structures used to optimize the performance of database queries by allowing the database to quickly locate specific rows based on the values of one or more columns.

Types of Indexes

Index Type	Description
Primary Key	A unique index on a single column that serves as the primary key for the table.
Unique Index	Ensures that all values in a column are unique.
Composite Index	An index on multiple columns, allowing for efficient queries that involve combinations of those columns.
B-Tree Index	A balanced tree structure that supports efficient range queries on single dimension data (e.g WHERE column BETWEEN value1 AND value2), equality queries (e.g WHERE column = value), and sorting (e.g ORDER BY column).
Hash Index	Uses a hash function to quickly locate rows based on the values of one or more columns, it is efficient for equality queries (e.g WHERE column = value).
Bitmap Index	Uses a bitmap to quickly locate rows based on the values of one or more columns.
Spatial Index	Used for geospatial data, allowing for efficient queries that involve spatial relationships.
Full-Text Index	Used for full-text search, allowing for efficient queries that involve searching for specific text patterns.

Full Text Search

A search optimized database is designed specifically for full-text search, allowing users to efficiently search through large volumes of text data. Unlike traditional databases that rely on exact matches and can be slow and inefficient (e.g., using LIKE queries) as it requires scanning the entire table, search optimized databases utilize techniques such as inverted indexes, tokenization, and stemming to enhance search performance. Use cases for full-text search include Ticketmaster that needs to search through a large number of events to find relevant results. Or a social media platform like Twitter that needs to search through a large number of tweets to find relevant results.

Key Features

Inverted Indexes: A data structure mapping words to documents, enabling quick lookups for search terms.
```
{
  "running": [doc1, doc2, doc3],
  "jumping": [doc1, doc3],
  "jumper": [doc2]
}
```
Tokenization: The process of breaking text into individual words for mapping in the inverted index.
Stemming: Reducing words to their root forms (e.g., "running" to "run"), allows you to match different forms of the same word.
Fuzzy Search: Allows for finding similar terms, accommodating typos and slight misspellings through algorithms that measure edit distances.
Relevance Scoring: Assigns a score to each result based on how well it matches the search query, allowing you to prioritize results based on relevance.
Scalability: Designed to scale horizontally across many servers, accommodating large amounts of data and high traffic loads through techniques like sharding and consistent hashing.

Examples

Elasticsearch: The leading search optimized database, built on Apache Lucene, known for its speed and scalability, used by companies like Netflix and Uber.
Solr: A search platform built on Apache Lucene, known for its scalability and reliability, used by companies like Walmart and eBay.
Postgres: Offers GIN index for full-text search.
Redis: Offers full-text search capabilities, but is considered less mature.

Geospatial Search

Geospatial search is a technique used to search for data based on its location. It is used in applications that require searching for data based on its location, such as Yelp that needs to search for restaurants near a specific location and Uber that needs to search for drivers near a specific location.

Problems with using traditional index:

Traditional B-Tree index is efficient for one-dimensional data, not for two-dimensional.
Geospatial data is two-dimensional (latitude and longitude), even with index datasets returned from each dimension could still be huge. The intersection operation of the these two datasets could be quite slow.
Furthermore, B-tree index doesn't have the ability to handle the spatial relationship between the geospatial points. Thus we need a way to map the two-dimensional data into a one-dimension.

Type of Geospatial Index

In a broad sense, there are two types of geospatial indexing approaches: Hash-based and Tree-based. Even though the underlying implementations of those approaches are differerent, the high level idea is the same which is to divide the map into smaller areas and then build indexes for fast search.

Google S2: A tree-based index structure that uses Hilbert curve to map two-dimensional data to one-dimensional data, it's widely used in companies like Google, Uber, and Tinder.
R-Tree: A tree-based index structure that optimized for indexing multi-dimensional information (like geographic data) using bounding rectangles for efficient spatial queries.
- PostGIS: Using R-tree as geospatial index.
Quadtree: A tree-based index structure that partition a two-dimensional space into four quadrants recursively until each quadrant contains only certain number of locations.
- Only the leaf nodes store the actual location data.
- Quadtree is an in-memory data structure, quadtree is built at server startup time.
- Quadtree index does not take too much memory (a few GBs) and can be easily fit in one server.
- To find nearby locations, we start from the root and traverse the tree until we find the leaf node where the search location is. If there are enough locations in the leaf node, we can return them directly. Otherwise, we visit its neighboring nodes recursively until we find enough locations.
- Pros:
  - Adaptive Precision/Dynamic Grid Size: Can adjust to the population density.
  - Nearest Neighbor Search: Well suited for fetching k-nearest locations, no matter the location is within a specified radius or not.
  - Geofencing Support: Efficient for geofencing queries on complex polygon or custom gemetric shapes.
- Cons:
  - Operational Overhead: More complex to implement and maintain. Update the quadtree index is more expensive than geohash. In general, update index takes O(log n) time. If the index is accessed by multi-threaded program, it would require proper locking mechaism to avoid race conditions. Moreover, it's time consuming to rebuild the quadtree index. We could take a rolling update to incrementally rebuild the quadtree on a small portion of servers at a time across the entire cluster. This would mean some servers would return stale data for a short period of time.
    - Mitigation: This is generally acceptable for most applications. We can setup a business agreement with the clients that new locations are only be avaiable after a certain period of time. Or we could update the quadtree via a cron job, during off-peak hours.
  - Memory Overhead: Quadtree index takes more memory than geohash.
  - Not enough locations in the grid: The current gird may not contain enough locations, leading to inaccurate results.
    - Mitigation: Increase the search radius by explorting the neighboring grids. Two options for finding neighboring grids:
      - Connect all leaf nodes with double-linked list.
      - Keep a parent pointer in each node.
- Examples:
  - Elasticsearch: Using both Geohash and Quadtree as geospatial index.
Geohash: A hash-based index structure that encodes geographic coordinates (latitude and longitude, two-dimensional data) into a short string (one-dimensional data) of characters.
- Geohash usually use base32 encoding, which is a 5-bit encoding scheme that maps a set of 32 characters (0–9 and b–z, excluding "a", "i", "l", "o" to avoid confusion). Each group of 5 bits turns into one character in the geohash string.
- The length of the geohash string determines the precision. Longer strings provide more specific locations, while shorter ones cover larger areas. Normally, we only interested in geohashes with lengths between 4 and 6.
- Nearby locations share common prefixes in their geohash strings, making it efficient to find nearby points using string matching. It guarantees that the longer a shared prefix between two geohash strings, the closer the two locations are.
- Pros:
  - Simple Implementation: Easier to implement and integrate
  - Storage Efficient: Compact string representation, consumes less memory than quadtree.
  - Prefix Property: Nearby locations share common prefixes in their geohash strings, making it efficient to find nearby points using string matching.
  - Great for Key-Value Store: Works well with existing key-value store infrastructure.
- Cons:
  - Fixed Precision/ Fixed Grid Size: As the precision (level) of geohash is fixed, the size of the grid is fixed as well. It cannot dynamically adjust to the population density.
  - Boundary Issues: Nearby locations can have different geohash strings as they are on different sides of a box.
    - Mitigation: The geohashes of the neighboring grids can be calculated in constant time.
  - Limited Query Types: You quries are mostly rectangular or radis-based location queries.
  - Not enough locations in the grid: The current gird and neighboring grids may not contain enough locations, leading to inaccurate results.
    - Mitigation: Increase the search radius by removing the last character of the geohash string, then use the new geohash string to find nearby locations. We can repeat this process until we find enough locations.
- Examples:
  - Redis: Utilize Geohash in conjunction with its sorted set data structure to implement efficient geospatial search.
    - Redis encodes geographic coordinates into a geohash string
    - Redis uses a sorted set to store the each geospatial point, where the key of the sorted set is geograpic coordinate and the score is the geohash of the location.
    - The sorted set allows Redis to efficiently perform range queries and proximity searches, such as finding all points within a certain distance from a specified location.
  - Elasticsearch:
    - Geospatial Index Types:
      - Geo-point: A single geographic point on the earth's surface, represented by latitude and longitude coordinates. Elasticsearch uses geohash to index the Geo-point.
        
        Use Case:
        
        Point-based proximity search, such as finding all points within a certain distance from a specified location.
        
        Query Type:
        
        geo_distance: Find points within a certain distance from a specified location.
        
        geo_bounding_box: Find points within a rectangular bounding box.
      - Geo-shape: A polygon on the earth's surface, represented by a list of coordinates. Elasticsearch previously uses Quadtree to index the Geo-shape. Use Case: - Spatial relationship checks, such as finding all points within a custom boundaries. Query Type: - geo_polygon: Find points within a polygon. - geo_shape: Find points within a complex shape.
    How to allow search by predefined location names such as city names or neighborhood names, eg Pizza in San Francisco?
    1. Create a index or table that maps the predefined location names to their polygon coordinates. These data can be obtained from open data sources. In elasticsearch, we can use a geo-shape index to store this info.
    2. When a user search for "Pizza in San Francisco", based on the predefined location name, we can perform a query that finds all points within the polygon of San Francisco. Both Postgres via the PostGIS extension and Elasticsearch have functionality for working with polygons. In elasticsearch, we can use a geo_polygon query to find all points within the polygon.
    Further Optimization:
    
    Instead of filtering on polygon for each request, we can pre-compute the areas for each location upon creation and store them as a list of location identifiers in our location table. Example location document:
```
{
  "id": "123",
  "name": "Pizza Place",
  "location_names": ["bay_area","san_francisco", "mission_district"],
  "category": "restaurant"
}
```
    Now we can leverage the inverted index on the location_names to quickly find all locations within the predefined location names.
    
    The combination of geospatial index and the inverted index allows Elasticsearch to efficiently handle spatial queries and search by geographic parameters.

Database Storage Model

Relational Database Storage Model

The general principle of using data files, page structures, write-ahead log (WAL) and specialized index structures are common across all mordern RDBMS.

Storage Components

Data Files:
- The actual table data and index data are stored in data files, can be a single file or set of files.
- These files are organized in a way that facilitates efficient access and retrieval of data.
Page Structures:
- Within the data file, data is stored in fixed-size blocks called pages or data blocks.
- A page is the smallest unit of data that transfer between the buffer pool and the data file, typically in size of 8KB by default.
- Data rows in a table are stored within a page in compact format.
Write-Ahead Log (WAL):
- Sequential log of all the changes that have been made to the database data
- Used to ensure data durability and recoverability by recording changes before they are applied to data files.
- WAL is just a plain file where new entries are appended to the end of the file.
- WAL has pure sequential read and write access pattern, where the disk performance is good.

Write Path

Transaction Log (WAL) to Disk :

Changes are first written to WAL on disk.
This is a sequential write operation, which is fast.
Once the changes are persisted to WAL, the transaction is considered to be committed.
The WAL is critical for durability and disaster recovery.
Once the data has been persisted into data files, the WAL is no longer needed and can be truncated to free up space.

Buffer Cache Update in Memory:

Changes on table and index data are then made to data pages stored in the memory buffer cache.
When pages are modififed, they are marked as dirty to indicate that they need to be flushed to disk eventually.
Buffer cache is used to improve read performance by reducing frequent disk I/O operations.
The buffer cache use a hash table to store the mapping between page ids and the data pages.

Background Flush from Memory to Disk:

Dirty pages in memory are flushed to actual data files on disk periodically.
This happens asynchronously through a background writer process, when the buffer cache is full or when the checkpoint is reached.
This delayed flush strategy allows the system to batch multiple changes together, reducing the number of disk I/O operations and improving performance.

Read Path

Buffer Cache Hit:

If the data is in the buffer cache, it is read from the buffer cache.

Buffer Cache Miss:

If the data is not in the buffer cache, it is read from the data files on disk.
- RDMBS use B-tree to store the mapping from primary key to the data position in data file and page on disk for efficient look up.
- File ID + Page Number + Slot Number = Row Location

Summary

The WAL-first approach is fundamental to how modern RDBMSs ensure data durability while maintaining good performance for write operations. The actual writing to data files on disk happens in the background asynchronously and is optimized with batching.
Each index update follows the same steps as table data updates. This is why having many indexes on the same table can slow down the write significantly.
The buffer cache is used to improve read performance by reducing frequent disk I/O operations.

NonSQL Database Cassandra Storage Model

Cassandra leverages a data structure called Log Structured Merge Trees (LSM Tree) to achieve high throughput write performance while maintaining reasonable read performance. The LSM tree is used in place of B-Tree, which is the index of choice for most databases (relational DBs, DynamoDB, etc).

Cassandra are designed for high write throughput, favors write speed over read speed. Every write operation (create, update, delete) is treated as an new entry. Cassandra uses the ordering of these updates to determine the "state" of a row. The LSM tree enables Cassandra to efficiently understand the state of a row, while writing data to the database as almost entirely "append only" writes.

LSM Tree Components

WAL:
- Write-Ahead Log, similar to the WAL in RDBMS.
- Ensure data durability and crash recovery.
Memtable:
- In-memory data structure that stores the data in a sorted order (typically a B-Tree or Skip List).
- Accumulates writes until it is full, then flushes to disk as SSTables.
SSTable:
- Sorted String Table, on-disk data structure that stores the data in a sorted order.
- SSTables are immutable files containing sorted key-value pairs.
- SSTables are merged and compacted in the background to optimize read performance.
- SSTable Indexing - Cassandra store files captures the mapping from primary key to the data position (byte offset) in the SSTable file to enable faster retrieval of data on disk. This is somewhat similar to the B-Tree might point to data on disk.

Write Path

Write to WAL:

All write operations are first written to the WAL.
This is a fast sequential disk write, ensures data durability and crash recovery.

Write to Memtable:

Same data then written to the in-memory memtable.
Data is sorted by the primary key.

Flush Memtable to SSTable:

When the memtable is full, it is flushed to disk as SSTables.
Corresponding WAL is no longer needed and can be truncated to free up space.

Background Compaction:

SSTables are merged and compacted in the background to save disk space and improve read performance.
Compaction also performs the actual deletion of data.

Read Path

Read from Memtable:

If the data is in the memtable, it is read from the memtable.
Since data is sorted by the primary key, it is efficient to find particular key.

Read from SSTable:

If the data is not in the memtable, it is read from the SSTable.
Cassandra uses a Bloom Filter to quickly determine which SSTables contain the data.
Data in SSTables are sorted by the primary key, so it is efficient to find particular key.

Redis Transactions

Despite Redis being single-threaded, multiple clients can interleave commands, creating race conditions that require explicit atomicity guarantees. There are 3 ways to achieve atomicity for multiple operations:

Built-in Atomic Commands:
- Redis provides built-in atomic commands like SET, INCR/DECR, SADD, ZADD, etc.
- Redis provides specialized commands for distributed lock purpose using the SETNX + EXPIRE pattern. Consider the following example:
```
SET lock:resource-name unique-value PX 10000 NX
```
  - SET is the command to set the value of the key.
  - lock:resource-name is the key name.
  - unique-value is the value to set the key to.
  - NX - the flag to set the key only if it does not exist, ensures only one client can acquire the lock
  - PX 10000 - set the expiration time to 10 seconds, ensures the lock will be released if the client crashes
Redis Transactions (MULTI/EXEC):
- The MULTI command is used to group multiple commands together.
- The EXEC command is used to execute the commands in the group.
- No rollabck capability - event if one command fails, other commands are continued to be executed.
Lua Scripts:
- Lua scripts are a way to execute multiple commands together.
- Send a single script to be executed atomically.
- Script evaluated as a single Redis command for execution, using EVAL command.
- More powerful than MULTI/EXEC, allows loops, conditional execution, error handling, and more.

Redis Persistence

Persistence is the process of writing data to durable storage. Redis provides two types of persistence: RDB and AOF. You can choose to use both or just one of them.

RDB: Redis Database (RDB) persistence performs point-in-time snapshots of the data at a specific interval.
- Pros:
  - Performance: Faster write operations because the data is written to the disk in a separate process.
  - Backup: RDB is a very compact single-file that perfect for backup and disaster recovery.
- Cons:
  - Data Loss: If the server crashes, the data is lost.
AOF: Redis Append Only File persistence writes every write operation to a log file.
- Pros:
  - Data Durability: AOF ensures better data durability compared to RDB, where fsync is used to flush the data to the disk.
  - Automatic Compaction: AOF is append-only file, Redis will automatically rewrite the file when it grows too large. This background process is safe and no downtime as new file is created to swap with the old file.
- Cons:
  - Performance Overhead: Slower write operations depends on the fsync frequency.
  - Storage Overhead: AOF files are usually larger than equivalent RDB files for the same data.

Distributed Systems Architecture

Distributed systems involve organizing nodes in a system to manage data and requests efficiently across multiple machines. Two primary architectures are commonly used: Master-Slave (Leader-Follower) Architecture and Decentralized Peer-to-Peer Architecture.

Master-Slave Architecture: A master-slave architecture is a design pattern where a single master node is responsible for managing the data and a set of slave nodes are responsible for reading the data.
- How it works:
  - Master Node: Acts as the coordinator, maintaining the authoritative state and managing cluster configuration.
  - Slave Nodes: Replicate data from the master and handle client requests. Write operations are forwarded to the master for processing.
  - Failover: If the master fails, a new master is elected from the slaves.
- Pros:
  - Centralized control simplifies consistency management.
  - Efficient handling of failover scenarios.
  - Easy to implement and understand.
- Cons:
  - Single point of failure (the master node).
  - Performance bottlenecks if the master becomes overloaded.
- Common Use Cases:
  - Relational databases (e.g., MySQL, PostgreSQL)
  - Distributed caching systems (e.g., Memcached, Redis)
Decentralized Peer-to-Peer Architecture: A decentralized peer-to-peer architecture is a design pattern where each node is independent and self-contained, without a central master node.
- How it works:
  - Equal Nodes: All nodes operate as peers with equal roles, sharing complete copies of the data.
  - Consensus Protocols: Updates are synchronized across all nodes using protocols like Raft or Paxos, which maintain data consistency and fault tolerance.
  - Dynamic Leadership: Leadership is elected dynamically for consensus, and client requests can be directed to any available node.
- Pros:
  - High availability and scalability, with no single point of failure.
  - Robust against node failures, as remaining nodes can continue operating.
- Cons:
  - More complex implementation and management due to the need for consensus protocols.
  - Requires mechanisms for conflict resolution to maintain data consistency across nodes.
- Common Use Cases:
  - NoSQL databases (e.g., Apache Cassandra, DynamoDB)
  - Messaging systems (e.g., Apache Kafka)
  - Search engines (e.g., Elasticsearch)

Load Balancer

L4 Load Balancer: are preferred for applications where performance and low latency are critical, particularly for protocols that maintain long-lived connections. They are efficient in handling high traffic volumes without the overhead of inspecting request content.
- Key Concepts:
  - Operate at the Transport Layer (TCP/UDP).
  - Make routing decisions based on IP addresses and ports without inspecting packet content.
  - Best suited for protocols requiring persistent connections, like WebSockets, WebRTC.
L7 Load Balancer: excel in scenarios where application-level routing is necessary, enabling more advanced features like A/B testing, user session persistence, and traffic management based on request characteristics. They are particularly useful for web applications that rely heavily on HTTP.
- Key Concepts:
  - Operate at the Application Layer, understanding protocols like HTTP.
  - Can inspect the content of requests (e.g., URL, headers, cookies) to make intelligent routing decisions.

Client Realtime Updates

Simple Polling

How it works: Simple polling involves the client making regular HTTP requests to the server to check for updates. The client sends a request at fixed intervals (e.g., every 2 seconds), and the server responds with the current state of the data.
Http Headers: Using standard HTTP headers
- Connection: Keep-Alive - Keep the TCP connection open after the response is sent, so the client can reuse the same connection for multiple requests to reduce connection overhead.
- Keep-Alive: timeout=5, max=1000 - Configure the keep-alive timeout and maximum number of requests per connection. The server will close the connection after 5 seconds if there are no requests. The maximum number of requests is 1000.
Load Balancer for Polling:
- Polling is stateless, so layer 7 load balance with round robin or least connections is usually sufficient.
Pros:
- Simplicity: Easy to implement and understand.
- Stateless: Does not require maintaining session state.
Cons:
- Latency: Updates can be delayed by the polling interval plus processing time.
- Limited Frequency: The update rate is constrained by the polling interval.
- Resource Intensive: Can strain server resources with many clients, as each request may require establishing new connections.
When to Use:
- Real-time updates and data precision are not critical, slight deley is acceptable and won't affect the user experience.
- Data updates are predictable and consistent, this predictability allows for effcient polling interval.
- The need for updates is short-lived.
Discussion Points in Interview:
- Trade-offs: Clearly explain the decision to use polling over more complex methods, emphasizing its simplicity and how it allows focusing on other system components.
- Addressing Objections: Be prepared to discuss concerns about speed and efficiency. Justify your chosen polling frequency and suggest optimizations, such as using HTTP keep-alive connections to minimize connection overhead.
- We can further enhance user experience by implmenting a smart buffering system, which involves intentionally lagging the data update to the client by one or two intervals. This way, we can create a smoother and consistent user experience with a more continuous animation. This intentional lag also helps compensate for any network latency
- Jitter refers to the deliberate introduction of randomness to polling intervals. Instead of polling at fixed, regular intervals (e.g., exactly every 5 seconds), you add a small random variation to each interval. It helps to smooth out the load on the server and prevent the thundering herd problem.

Long Polling

How it works: Long polling enhances the basic polling method to achieve near real-time updates. The client sends an HTTP request to the server, which holds the request open until new data is available. Once the server has new data, it responds to the client, which immediately sends a new request, creating a continuous cycle. If no new data arrives, the server may return an empty response after a timeout, prompting the client to make another request.
Http Headers: Using standard HTTP headers
- Connection: Keep-Alive
- Keep-Alive: timeout=60, max=1000 - Potentially longer server timeouts
Load Balancer for Long Polling:
- Long polling is stateless, a Layer 7 load balancer is necessary, with careful attention to connection timeout settings.
- For optimal long polling/SSE performance, you typically start with least connections for initial distribution, then rely on sticky sessions to maintain connection affinity throughout the session lifecycle. This ensures that a client maintains a consistent connection with the same backend server.
Pros:
- Standard HTTP: Leverages existing HTTP infrastructure, making it widely compatible.
- Stateless: Maintains a stateless server-side architecture.
- Reduced Bandwidth: Lower usage due to fewer requests.
- Near Real-Time Updates: Provides more consistent and near real-time updates.
Cons:
- Higher Latency: Can introduce delays, especially for high-frequency updates, due to the round-trip nature of requests.
- HTTP Overhead: Increased overhead from frequent HTTP requests.
- Resource Intensive: Can strain server resources with many clients, as each request may require establishing new connections.
When to Use:
- Ideal for applications requiring near real-time updates where updates are infrequent.
- Particularly useful for scenarios like payment processing, where you need to know the status of a long-running process as soon as it completes.
- If the latency of simple polling is a concern, long polling provides an effective upgrade with minimal added complexity.
Discussion Points in Interview:
- Infrastructure Compatibility: Emphasize that long polling uses standard HTTP, minimizing the need for additional infrastructure.
- Polling Frequency: Be specific about the expected polling frequency and ensure that your infrastructure can handle long-lived requests. This is crucial to avoid issues with load balancers timing out connections prematurely (e.g., if they close connections after 30 seconds while your server can hold them for longer).
- Trade-offs: Discuss the balance between the simplicity of long polling and its limitations, particularly in terms of latency and resource usage.

Server Sent Events (SSE)

How it works: SSE is specifically a server-side streaming technology that built on top of HTTP, allows the server to send a continuous stream of data to the client over a single HTTP connection. SSE are designed to be long-lived and with automatic reconnection handling using the "last event ID" to resume data flow seamlessly. Native browser support via EventSource API.
Http Headers: Standard HTTP headers with specialized content type
- Connection: Keep-Alive
- Cache-Control: no-cache - The client should not cache the response
- Last-Event-ID: 123 - The client can send this header to the server to resume the stream from the last event ID.
- Content-Type: text/event-stream - Response header to indicate that the server sends the data in the response body as a stream of events.
- Transfer-Encoding: chunked - Response header to indicate that the response body is a series of unknown-sized chunks.
Load Balancer for SSE:
- SSE is stateful due to the persistent connection.
- Layer 7 load balancers are preferred because they can manage persistent HTTP connections effectively.
- For optimal long polling/SSE performance, you typically start with least connections for initial distribution, then rely on sticky sessions to maintain connection affinity throughout the session lifecycle. This ensures that a client maintains a consistent connection with the same backend server.
Pros:
- Standard HTTP: Leverages existing HTTP infrastructure, making it widely compatible.
- Reduced Overhead: Reduces overhead from connection initiation and teardown compared to long polling.
- Automatic Reconnection: Clients can automatically reconnect if the connection drops.
- Built-in Browser Support: Modern browsers natively support SSE through the EventSource object.
- Real-Time Updates: Provides near real-time updates.
Cons:
- One-Way Communication: Only allows data to flow from the server to the client, not vice versa.
- Proxy Issues: Some proxies and networking equipment may not support streaming responses, complicating debugging.
- Limited Browser Support: While most modern browsers support SSE, they impose limits on the number of concurrent SSE connections per domain (maximum is 6).
  - Mitigation:
    - Using HTTP/2 can help as it allows multiplexing with multiple streams over a single connection
    - Using Websocket instead, which typically has higher connection limits
When to Use:
- SSE is ideal for applications requiring efficient, near real-time updates where the server needs to push data to the client without frequent client requests. It’s particularly suitable for scenarios like AI applications that need to stream data (e.g., words or tokens) to maintain a responsive user experience. However, ensure that the entire infrastructure supports streaming responses, as some proxies may not handle it well.
Discussion Points in Interview:
- Infrastructure Compatibility: Highlight that SSE utilizes existing HTTP infrastructure, reducing the need for additional setup.
- Connection Management: Discuss how SSE connections are typically short-lived (30-60 seconds) and how clients handle reconnections using the "last event ID" to resume data flow seamlessly.
- Limitiations and Considerations: Be prepared to mention the potential issues with proxies and load balancers that may not support streaming, and how this can affect application performance and debugging.
- SSE can leverge HTTP/2-3 multiplexing more natrually since it's stanrdard HTTP, where it gains the benefit of multiple SSE subsciptions (different event sources) can share the same connection without blocking each other.
  - HTTP/2-3 multiplexing allows multiple concurrent requests/responses over a single connection.
- Reference Architecture: Consider using a dedicated SSE layer (realtime servers)r that sits between clients and your application servers. This allows you to isolate the connection management concerns from your business logic services, and where you need to deploy application updates frequently without disrupting client connections.

WebSockets

How it works: Websocket is bidirectional streaming technology that upgrade from HTTP to WebSocket protocol. It provides a persistent, bi-directional communication channel over a single, long-lived connection between the client and the server. This allows for real-time, two-way communication, making it ideal for high-frequency read and write operations.

The client begins with a standard HTTP request (Initial HTTP Handshake) asking for a WebSocket upgrade. If the server supports WebSockets, it will respond with a 101 Switching Protocols response to complete the handshake. The same TCP connection remains open but the protocol is switched from HTTP to WebSocket.

Other bidirectional streaming technologies like HTTP/2 streaming and gRPC over HTTP/2 are also possible. However they both requires HTTP/2 and does not work with HTTP/1.1, which requires fundamental architecture changes.
- gRPC requires HTTP/2's binary framing protocol, stream multiplexing, full-duplex capabilities.
Initial HTTP Handshake Headers:
- Connection: Upgrade - Indicates that the client is requesting a WebSocket connection.
- Upgrade: websocket - Indicates that the client is requesting a WebSocket connection.
Load Balancer for WebSockets:
- WebSockets are stateful due to the persistent connection.
- L4 load balancers will support websockets natively since the same TCP connection is used for each request.
Pros:
- Full-Duplex: Supports simultaneous read and write operations.
- Low Latency: Reduces overhead compared to HTTP due to the absence of headers in each message.
- Efficient for Frequent Messages: Optimized for scenarios with frequent data exchange.
- Wide Browser Support: Broadly supported by modern browsers, with higher concurrent connection limits per domain (approximately 256).
Cons:
- Complex Implementation: More challenging to implement compared to simpler alternatives.
- Special Infrastructure: Requires infrastructure that supports WebSocket connections.
- Stateful Connections: Stateful nature complicates load balancing and scaling.
- Reconnection Handling: Requires logic to handle connection drops and reconnections, do not automatically recover from connection timeout like SSE.
When to Use:
- WebSockets are the best choice when you need high-frequency, bi-directional communication.
- However, consider whether you truly need bi-directional communication before opting for WebSockets, as simpler solutions like SSE or polling might suffice for scenarios primarily involving server-to-client data flow. A very common pattern is to have SSE subscriptions for updates and do writes over simple HTTP POST/PUT whenever they occur.
Discussion Points in Interview:
- Connection Management: Be prepared to discuss how you'll manage connections, handle reconnections, and address server restarts during deployments.
- State Management: Address how to minimize the spread of state across your architecture, especially in senior-level interviews.
- Scalability: Discuss strategies for scaling WebSocket servers, such as using a "least connections" load balancing strategy and offloading intensive processing to separate, scalable services.
- Reference Architecture: Consider using a dedicated WebSocket layer (realtime servers)r that sits between clients and your application servers. This allows you to isolate the connection management concerns from your business logic services, and where you need to deploy application updates frequently without disrupting client connections.

WebRTC: The Peer-to-Peer Solution

How it works: WebRTC enables direct peer-to-peer communication between browsers, making it ideal for video/audio calls and collaborative applications. Clients use a central signaling server to discover peers and exchange connection information. WebRTC uses STUN (Session Traversal Utilities for NAT) for NAT traversal via "hole punching," and TURN (Traversal Using Relays around NAT) as a relay service when direct connections aren't possible.

In practice, the signaling server is relatively lightweight and isn't handling much of the bandwidth as the bulk of the traffic is handled by the peer-to-peer connections. But interestingly the signaling server does effectively act as a real-time update system for its clients (so they can find their peers and update their connection info) so it either needs to utilize WebSockets, SSE, or some other approach detailed above.

Setup Process:
- 1. Peers connect to a signaling server to discover each other.
- 1. Peers exchange connection information (ICE candidates).
- 1. A direct peer connection is established, using STUN/TURN if needed.
- 1. Audio/video streams or data are sent directly between peers.
Pros:
- Direct Peer Communication: Enables direct connections between clients.
- Lower Latency: Reduces latency by eliminating intermediary servers for data transmission.
- Reduced Server Costs: Offloads traffic from servers to peer connections.
- Native Audio/Video Support: Built-in support for audio and video streaming.
Cons:
- Complex Setup: More complex to set up than WebSockets.
- Requires Signaling Server: Needs a signaling server for peer discovery.
- NAT/Firewall Issues: Vulnerable to issues related to NAT and firewalls.
- Connection Setup Delay: Involves connection setup delays due to NAT traversal.
When to Use: WebRTC is suitable for video/audio calls, screen sharing, gaming, and collaborative applications where low latency and direct peer communication are important. It can also be used to reduce server load by having clients communicate directly, as seen in applications like collaborative document editing.
Discussion Points in Interview:
- Justification: Clearly explain why WebRTC is necessary, such as for video conferencing or to handle scale by enabling direct client communication.
- Infra Requirements: Demonstrate knowledge of infrastructure requirements like STUN/TURN servers and signaling servers.
- Communication Patterns: Discuss communication patterns between peer clients and synchronization with a central server for data storage or reporting.
- Trade-offs: Be prepared to discuss the trade-offs of using WebRTC versus other solutions, especially in scenarios with constraints like limited resources.

Client Realtime Updates Overview

To choose the best method for server-to-client event delivery, consider these options:

Simple Polling: Use if latency isn't a concern.
SSE (Server-Sent Events): Opt for this if you need server-to-client communication but not bi-directional.
WebSocket: Choose this for frequent, bi-directional communication.
WebRTC: Use this specifically for audio/video calls.

Server Realtime Updates

After choosing a method for server-to-client updates (Simple Polling, Long-Polling, SSE, WebSockets, WebRTC), the next step is to consider how updates propagate from their source to the server. There are three common patterns for triggering these updates:

Pulling via Polling: The server periodically checks for updates.
Pushing via Consistent Hashes: Updates are pushed to the server using a consistent hashing strategy.
Pushing via Pub/Sub: Updates are pushed to the server through a publish-subscribe system.

Pulling via Polling

How it works:

Clients repeatedly ask the server for updates. The server maintains a database of updates and responds to client requests.
Pros:
- Simple to implement.
- State is constrained to the database.
- No special infrastructure is needed.
Cons:
- High latency.
- Excessive database load if updates are infrequent and polling is frequent.
When to use:
- When real-time updates are not required and some delay is acceptable.
Discussion Points in Interview:
- Discuss how you are storing updates and querying the database.
- Consider the read volume from frequent polling, especially with a large number of clients (e.g., a million clients polling every 10 seconds results in 100k TPS of read volume).

Pushing via Consistent Hashing

How it works: This approach ensures updates reach the correct clients, especially with persistent connections (long-polling, SSE, WebSockets). To enable this, you need to ensure that:
- The server can determine which client is connected to which server:
  - Simple Hashing: Initially, a simple hashing method (using modulo) is applied to assign users to servers. A central service (like Zookeeper or etcd) keeps track of server assignments and user connections. This service helps servers know which user is connected to which server.
    - Connection Process:
      1. When a client connects, it can either connect directly to the appropriate server or randomly connect to any server, which then redirects them to the correct one.
      2. The server looks up the user's hash in the coordination service to determine the correct server and redirects the client accordingly.
  - Consistent Hashing: Addresses the scaling issues of simple hashing by mapping servers onto a ring. This minimizes the number of re-mappings when servers are added or removed.
    - Consistent hashing reduces the dependency on centralized coordination for the hashing process itself.
    - Each node in the system can independently calculate the hash ring and determine where data should be stored or retrieved without relying on a central authority.
    - However, centralized services like ZooKeeper or etcd may still be needed for cluster membership management, failure detection, and configuration management.
      - Cluster Membership Management: While consistent hashing handles data redistribution efficiently, it does not manage cluster membership (e.g., detecting when a node joins, leaves, or fails). A service like ZooKeeper or etcd can provide consensus and maintain an authoritative list of active nodes in the cluster.
      - Failure Detection and Recovery: In distributed systems, nodes may fail unpredictably. A coordination service can help detect failures and notify other nodes to update their hash rings accordingly.
      - Configuration Management: Centralized services can store shared configurations (e.g., replication factors or backup policies) required by all nodes.
      - Alternatively, you can use decentralized approaches like gossip protocol for node discovery and failure detection rather than relying on ZooKeeper.
- Direct communication between servers:
  - Message Forwarding: When a user sends a message to another user, the sending server must forward that message to the server where the receiving user is connected. This requires direct communication between servers. For example, if User A is connected to Server 1 and sends a message to User C, who is connected to Server 2, Server 1 needs to communicate with Server 2 to deliver the message.
  - Decoupling of Servers: Direct communication allows servers to operate more independently, reducing bottlenecks and enhancing scalability. If each server only communicates via a central hub, it could become a performance bottleneck.
  - Protocol Choice:: Websocket or gRPC over HTTP/2 is a good choice for inter-server communication, provides low-latency, real-time communication between servers.
Pros:
- Decentralised approach, great at scaling
- Direct communication between servers, extreme low latency
- No central coordination needed for routing.
- Minimal connection disruption (re-mappings) during scaling.
- Works well with stateful connections.
Cons:
- Complex to implement correctly.
- Requires additional coordination service (like Zookeeper) for node disovery and failure detection.
- Can lead to uneven distribution if the hash function isn't well-designed.
- All servers need to maintain routing information.
When to use:
- When persistent connections are needed, the system needs to scale dynamically, and you can tolerate some connection loss during server failures.
- Also suitable when maintaining a lot of state associated with a given connection (e.g., collaborative document editing).
Discussion Points in Interview:
- Discuss how updates are routed using a coordination service like Zookeeper or etcd.
- Explain the scaling process:
  1. Signaling the beginning of a scaling event.
  2. Recording both the old and new server assignments.
  3. Slowly disconnecting clients from the old server and having them reconnect to their newly assigned server.
  4. Signaling the end of the scaling event and updating the coordination service with the new server assignments.
  5. In the interim, having messages which are sent to both the old and new server until they're fully transitioned.
- The mechanics of discovering the initial server assignments is also interesting. There are performance tradeoffs associated with redirecting clients to the correct server or requiring them to do a round-trip to a central server to look up the correct one. Especially during scaling events, any central registration service may become a bottleneck so it's important to discuss the tradeoffs with your interviewer.

Pushing via Pub/Sub

How it works:
- Architecture Overview:
  - A single Pub/Sub service (e.g., Kafka, Redis) collects updates and broadcasts them to interested clients.
  - The pub/sub service becomes the biggest source of state for our realtime updates.
  - Endpoint servers act as lightweight intermediaries that subscribe to topics and forward updates to clients.
- Client Connection:
  - Clients connect to any available endpoint server instead of a specific one.
  - The endpoint server registers the client with the Pub/Sub service by creating a topic, subscribing to it, and keeping a mapping from topics to the connections associated with them.
- Message Sending:
  - When an update occurs, it is published to the relevant topic in the Pub/Sub service.
  - The Pub/Sub service pushes the update to all endpoint servers that subscribe to the topic, then endpoint servers will forward the message to the appropriate clients over existing connections.
Pros:
- Simple to understand and implement compared with Pushing vis Consistent Hashing.
- Decoupled Communication via Pub/Sub system, simplify the architecture.
- Efficient Broadcasting: Capable of broadcasting updates to a large number of clients efficiently.
- Minimal State Spread: Reduces the amount of state information that needs to be managed across the system.
- Load Management: Easy to manage load on endpoint servers using load balanceing strategies like "least connections" for the initial incoming connection establishment.
Cons:
- Centralised Approach with SOP: The centralised Pub/Sub service can become a bottleneck and a single point of failure.
- Increased Latency: The additional layer of indirection can introduce latency in message delivery.
- Subscriber State: Difficulty in tracking whether subscribers are connected or when they disconnect.
- At-Most-Once Delivery: Redis Pub/Sub does not guarantee message delivery; if a subscriber is offline or misses a message, it will not be re-delivered. Consider using Redis Streams for guaranteed delivery.
- No Persistence: If a subscriber is offline during message publication, it will miss the message since Redis Pub/Sub does not persist messages. For guaranteed delivery, you can use Redis Streams, which persist messages and allow replay.
- Complex Connections: Many-to-many connections between Pub/Sub service hosts and endpoint servers can complicate the architecture.
When to use:
- Ideal For: Scenarios requiring efficient broadcasting to a large number of clients with minimal overhead on endpoint servers. The latency impact is minimal (<10ms).
- Not Ideal For: Applications needing to respond to connect/disconnect events or maintain significant state associated with clients.
Discussion Points in Interview:
- Single Point of Failure: Be prepared to discuss the implications of the Pub/Sub service being a single point of failure and potential bottlenecks. Redis cluster is a popular way to scale pub/sub service, which involves sharding subscriptions across multiple hosts.
- Connection Management: Introducing a cluster for the Pub/Sub component means you'll manage the many-to-many connections between the pub/sub service and the endpoint servers (each endpoint server will be connected to all hosts in the cluster). In reality, the number of nodes in the cluster is small.
- Load Balancing: For initial inbound connections to the endpoint servers, you'll probably want to use a load balancer with a "least connections" strategy. This will help ensure that you're distributing the load across the servers in the cluster. Since the connection itself are effectively the only resource being consumed by the endpoint servers, load balancing based on connections is a great way to manage the load. Once connection established, it is crucial to ensure that clients remain connected to the same endpoint server for the lifetime of their session. This is where sticky sessions (or session affinity) come into play.

Server Realtime Updates Overview

When propagating updates to endpoint servers, consider these options:

Polling Model:
- The simplest method if feasible.
Pub/Sub Model:
- A great alternative if polling isn't possible.
- Minimizes state management, simplifies scaling, and efficiently broadcasts updates to many clients.
- Suitable for most applications.
Consistent Hashing:
- A specialized approach for specific scenarios.
- Ideal when creating deep infrastructure components or when maintaining substantial state associated with a client.
- Useful if your application requires a lot of state/resources for each connection

Stream Processing

Stream processing is a technique for processing data as it flows in, rather than waiting for all the data to arrive. It is a way to process data in real-time, as it is generated.

Apache Spark

Spark is a distributed unified engine that supports batch and stream processing.
Spark adopts a micro-batch aggregation strategy, which can introduce latency, making it less suitable for ultra-low latency applications.
Spark window should be a multiple of micro-batch interval, which determines the flush interval.
Spark has checkpointing mechanism to ensure fault-tolerance, which can introduce significant latency due to it's disk-based approach.
Spark has larger user base and community support, and more comprehensive libraries for different types of data processing tasks, including machine learning, graph processing, and SQL.

Apache Flink

Flink adopts real-time event-based aggregation strategy, which means as soon as an data arrives for a specific window, it can be processed immediately and flushed out.
Flink is a true stream processing engine, it is designed for ultra-low latency and high throughput.
Flink allows configure flush interval independently of window duration, enabling you to achieve lower latency by emitting results more frequently.
Flink supports complex event-driven applications with native support for stateful computations.
Flink's checkpointing mechanism is more efficient than Spark's, has lower overhead due to it uses a hybrid approach, which initially managing state in memory and peroidically taking snapshots to disk.

Collaborative Editing Techniques

In collaborative editing, multiple users can edit the same document concurrently. One of the biggest challenge is real-time conflict resolution and ensure data consistency. There are several techniques to achieve this, including:

Operational Transformation (OT)

OT treats each user's edit as an operation, and then transform these operations in a way that is consistent with the edits of all the users.
The transformation ensures that operations do not conflict with each other and can be applied in any order, the final result remains the same.

How it works:
1. Editing Operations: When a user makes an edit (e.g., inserting, deleting, or modifying text), the client generates an operation that represents this edit. This operation contains information about the edit, such as the position, content, and metadata. The operation is sent to the server.
2. Server Transforming Operations: The server receives the operations from all the users and transforms them in a way that is consistent with the edits of all the users. The server then broadcasts the transformed operations to all the users.
3. Client Transforming Operations: Upon receiving the transformed operations, the client need to transform the operation again to incorporate the effects of its local operations.
4. Applying Operations: The client applies the transformed operations to its local document.
Compaction is a critical process to ensure system remains performant
- Each document may contains milllions of operations, transfering and tranforming the full reocreds for every newly connected client is not efficient
- Compaction mean merging a huge number of opeartions into as few operations as possible (potentially a singular insert opeartion), save both the disk and mermory space
  - To mitigate the conflicts with live upates, we can adopt offline compaction, with a dedication background worker do the compaction periodically. With compacted operations written into a new version, and flip the version once completed
  - Or we can do online compaction by collaboration service, since collaboration service already captures the recent operations and aware of live connections, it can safely to the do compaction when the document is idle. We can offload compaction to a separate lower CPU priority process. It will write results into a new version and flip the version when completed.
Pros:
- Memory efficient
- Low latency
- Relative easy to understand
Cons:
- OT requires a central server to coordinate transformation and provide a final ordering of operations, which can become a performance bottleneck or leads to SPOF
- It's less scalable than CRDTs.
- Complex to implement correctly.
Real World Use Cases:
- Google Docs
- Collaborative code editors

Conflict Free Replicated Data Types (CRDTs)

CRDTs represents edits as a set of operations that can be applied in any order, the final result remains the same.
More complex to implement and understand than OT.

Pros:

More distributed, no need for a central server. Ideal for decentralized (peer-to-peer) applications.
More scalable than OT, to support large number of collaborators.
More easily adaptable to offline use cases.

Cons:

Computation and memory intensive
More complex to implement than OT

Real World Use Cases:

Trello
Collabroative drawing applications like Figma, Miro

Video Streaming

Video Elements

Video Version: A video can have multiple versions with different video resolutions (e.g. 360p, 720p, 1080p)
Video Segment (Chunk): A video can be divided into multiple ordered segments, each segment is a chunk of video data.
- Segments are typically 2-10 seconds long.
- Segments are designed to be independently playable and is the basic unit of transfer in video streaming.
- During video uploading and downloading, videos are chunked in segments which are the smallest unit of transfer.
- It allows for resumable and parallel upload and download, as well as partial download and playback
GOP (Group of Pictures): Each video segment contains multiple GOPs, arranged in a specific order.
- GOP is a group of frames, arranged in a specific order. Each frame is a single image in video sequence.
- GOP is independently decodable and playable due to the way video codecs are designed.
- GOPs are the fundamental unit during video compression. The GOP structure is designed to organize frames in a way that allows for efficient compression and decompression.
Thumbnail: Thumbnails are static images extracted from a video that provide a visual preview of the video content. They serve as a visual reference point that helps users identify video content without having to play it.
- Thumbnails are small files with size under 100KB. For mobile apps, thumbnails are further compressed to under 50KB. The size affected by the resolution (e.g. 1080p, 360p) and format (e.g. JPEG, PNG) of the thumbnail.
- Often include multiple thumbnails per video at different resolutions
- Thumbnail are typically stored in optimized image format (dedicated database) separate from video segments
- Thumbnails can be cached in CDN for faster loading
- Thumbnails can also be cached in browsers

Video Codecs

A video codec (encoder/decoder) is an algorithm that compresses and decompresses video data to reduce the size of the video file while preserving the quality of the video. The video transcoding process is to encode the original video into a different format or compatible bitrate for different devices. Codecs usually trade off on the following:

Compression ratio (how much the original video is reduced in size)
Compression Speed (how fast the video is encoded and decoded)
Video quality (how much quality is lost in the compressed video)
Platform (what platforms the codec is supported on)

Video Bitrate

The birate is the number of bits transmitted over a period of time, typically measured in kbps (kilobits per second) or Mbps (megabits per second). The size and quality of a video affects the bitrate, where high resolution videos have higher bitrate while low resolution videos have lower bitrate.

Manifest Files

Manifest files are text-based documents that give details about video streams. There's typically 2 types of manifest files:

primary manifest: list all the available versions of a video (e.g. 360p, 720p, 1080p). This is "root" file that points to different media manifest files.
media manifest: list all the linkes to all the video segments (e.g. video chunks) under a specific video version. It serves as a index for the video segments.

Video Streaming Protocols

Morden video streaming protocols are designed to be adaptive, where video bitrate is adjusted automatically based on the network conditions and device capabilities. They use the manifest files to determine the available video versions and select the best bitrate segments under the current network conditions. Popular video streaming protocols include:

HTTP Live Streaming (HLS)
Dynamic Adaptive Streaming over HTTP (DASH)
Smooth Streaming

Video segments are the basic unit of transfer in video streaming, which allows the following features:

Players to switch between different quality versions (like 1080p to 720p) at segment boundaries
Viewers to seek to different parts of the video without downloading the entire file
More efficient delivery and caching of video content via CDN

While HLS, DASH, and Smooth Streaming provide a "streaming" experience to the end-user, their underlying mechanism is indeed a clever use of standard HTTP request-response interactions, driven by the client, which periodically polls for manifest updates and requests discrete media segments

How Video Streaming Works

The server prepares the video manifest file and video segments
The client requests the manifest file containing list of video segments and their URLs from the server
The client then sequentially requests the individual video segments using standard HTTP GET requests.
As playback progresses, the client may switch between different video versions and bitrates based on the network conditions and device capabilities, and continues requesting the subsequent segments.
For live streaming, the server continuously generates new video segments as the content is being recorded, and the client periodically re-requests the manifest file (or an updated version of it) to discover new segments as they become available. This re-requesting of the manifest and subsequent new segments is a form of polling, potentially long-polling.

Unique ID Generation

UUID

UUID is an easy way to generate unique ID, no coordination is needed between servers
UUID is a 128-bit value, which is 16 byte
UUID is not numeric, so it's not suitable for sorting and range queries
UUID does not have timestamp information, not go up with time
UUID has a low probability of collision

Snowflake

Snowflake is a distributed unique ID generator that is commonly used in distributed systems
It divdides the 64-bit value into 4 parts:
- 1 bit for sign
- 41 bits for timestamp (milliseconds since current epoch), this is equivalent around 69 years
- 10 bits for machine ID (can be further divided into datacenter ID and machine ID), it gives 1024 unique machines
- 12 bits for sequence number, this allows maximum 4096 IDs to be generated per millisecond per machine. The number is reset to 0 every millisecond
The length of different parts can be adjusted based on the needs of the system.
Snowflake has timestamp information in the front, which is suitable for sorting and range queries by time

Short URL

Short URL is a way to represent a long URL in a shorter format
By applying base62 encoding to the unique ID, we get a compacted string representation.
For 1 billion IDs, the base62 short URL is only 6 characters long.
For over 3 trillion IDs, the base62 short URL is only 7 characters long.
For the original ID generation, we can use the same algorithm as Snowflake, within each machine
- Use database self-incrementing ID as the unique ID
- Use redis INCR command to generate the unique ID
- For high availability, we can use two instances to generate odd and even IDs respectively
- To reduce the load on instance, we can use a technique called "counter batching", where we can use redis INCRBY command to generate multiple IDs at once.

Consistent Hashing

To achieve horizontal scaling, it is important to distribute the request or data evenly across multiple servers. Consistent hashing is a commonly used technique to achieve this goal.

Problems with Modulo Hashing

If you have N servers, a simple way to balance the load is to use modulo N to hash the key to one of the servers.

serverIndex = hash(key) % N

This approach works well when the number of servers is fixed.
However, when a server is added or removed, most keys are need to be redistrubted to the new server due to the change in the modulo N. This is quite expensive.
This is also not cache friendly which may cause a storm of cache misses and degrade the performance of the system.

How Consistent Hashing Works

Consitent hasing is meant to minimize the data redistribution when a server is added or removed.

Assume we have a hash function like SHA-1, by connecting the both ends of hash value range, we get a hash ring.
We can map the server to the hash ring by hashing the server's unique identifier.
We can map the key to the hash ring by using the same hash function.
The key will be assigned to the server that is the closest to it in the clockwise direction.
- Any server that knows the hash function can calculate which server should handle a given key without requiring a central mapping. This is also one of the major advantages of consistent hashing.
- By using efficient data structure like a binary search tree or a sorted array with binary search to find the appropriate server on the hash ring, we can achieve O(log N) time complexity for key assignment and server lookup.
When adding or removing a server, only a small portion of the keys (those that fall into the range between the old server and the new server) requires redistribution.
To mitigate the impact of hash ring imbalance or hot spot, we can use a technique called "virtual nodes".
- Each server is represented by multiple virtual nodes on the hash ring.
- Different servers may have different number of virtual nodes due to heterogeneity of the servers.
- Each server will end up being responsible for multiple partitions (or key ranges) of the hash ring.
To further mitigate the hot spot problem, where traffic for a specific key is signifiant larger than other.
- Eg. edits for popular collaborative document, comments for poluar video stream
- Aopt the hot write key sharing with suffixes appraoch (docId-1, docId-2), to shard the traffic among multiple servers
- For those servers to exchange data or information, in order to deliver the full records to their connected client
  - Leverage centralised Pub/Sub to subscribe the events published by other servers
  - Or enable direct communication between servers via Websocket or GRPC

Consistent Hashing in Practice

Used in load balancers to distribute traffic evenly across the servers.
Used in database sharding or partitioning to distribute data evenly across the shards or partitions.
Used in distributed systems like Redis and Kafka to distribute data evenly across the nodes.

System Design Advanced Technologies

Table of Contents

Table of Contents

Database Transactions

ACID Properties

ANSI Isolation Levels

Locking Mechanisms

Database Constraints

Database Indexes

Types of Indexes

Full Text Search

Geospatial Search

Database Storage Model

Relational Database Storage Model

Storage Components

Write Path

Read Path

Summary

NonSQL Database Cassandra Storage Model

LSM Tree Components

Write Path

Read Path

Redis Transactions

Redis Persistence

Distributed Systems Architecture

Load Balancer

Client Realtime Updates

Simple Polling

Long Polling

Server Sent Events (SSE)

WebSockets

WebRTC: The Peer-to-Peer Solution

Client Realtime Updates Overview

Server Realtime Updates

Pulling via Polling

Pushing via Consistent Hashing

Pushing via Pub/Sub

Server Realtime Updates Overview

Stream Processing

Apache Spark

Apache Flink

Collaborative Editing Techniques

Operational Transformation (OT)

Conflict Free Replicated Data Types (CRDTs)

Video Streaming

Video Elements

Video Codecs

Video Bitrate

Manifest Files

Video Streaming Protocols

How Video Streaming Works

Unique ID Generation

UUID

Snowflake

Short URL

Consistent Hashing

Problems with Modulo Hashing

How Consistent Hashing Works

Consistent Hashing in Practice