System Design Key Technologies
A beginner-friendly guide to the fundamental technologies commonly used in system design.
API Design
Security
From a security perspective, it's often best to avoid passing sensitive information like user IDs in the request body. Instead, pass them as headers.
- Exposure Risk: Request bodies can be logged or intercepted, and sensitive information like user IDs could be exposed.
- Prevention of Manipulation: If user IDs are passed in the request body, there's a risk that malicious users could manipulate them.
- Mitigation: It's more secure to use authentication tokens (like JWT) or session IDs in the headers that can securely identify the user. The server can then validates the token and extracts the userId from it.
Idempotent APIs
Idempotent APIs are a type of API design where multiple identical requests have the same effect as a single request. Idempotent APIs play a critical role in building reliable and robust web services, making the system more reliable and easier to maintain and use, providing predictable outcomes for clients even in the face of errors or network issues.
Key Characteristics of Idempotent APIs
- Same Effect: Multiple identical requests should have the same effect as a single request.
- Repeatability: Operations can safely be repeated without unintended side effects.
- Consistent Outcome: The outcome of the operation is the same regardless of how many times it is repeated.
Examples of Idempotent APIs
- GET Requests: Fetching a resource does not change the server state.
- PUT Requests: Updating a resource to a specific state; repeated calls with the same data do not change the outcome.
- DELETE Requests: Deleting a resource effectively makes multiple calls to delete the same resource yield the same final state (the resource is gone).
Pagination
API pagination is a technique used to divide a large set of data into smaller, manageable chunks or "pages" that can be retrieved sequentially. This approach is essential for optimizing performance, reducing load times, and improving the user experience when dealing with large datasets.
Types of Pagination
- Offset-Based Pagination: This method uses an offset (or page number) and a limit (number of items per page) to retrieve data. For example, if you want to access the second page of results with 10 items per page, you would use an offset of 10 (i.e., starting after the first 10 results).
Example:
GET /api/items?offset=10&limit=10
- Pros:
- Simplicity: Easy to implement and understand.
- Cons:
- Performance Issue: As the offset increases, query performance can degrade, especially in large datasets, since the database has to count through all rows preceding the offset for each query and bypass them.
- Result Inconsistency: If the underlying data changes while paginating (e.g., new items are inserted or deleted), results can be inconsistent (e.g., items might be missed or duplicated).
- Pros:
- Cursor-Based Pagination: A better approach is to use cursor pagination. The cursor is a unique identifier for a specific position in the dataset. For example, if you want to access the second page of results, you would use a cursor that points to the position after the first page.
Example:
GET /api/items?cursor={last_item_id}&limit=10
- Pros:
- Performance: Generally performs better with large datasets, as it does not require the database to count or skip records, since the cursor points to the specific position in the dataset (assuming we built an index on the cursor field).
- Result Consistency: More resilient to changes in the underlying dataset, as it always returns results relative to the last fetched item where the cursor is.
- Cons:
- Complexity: Implementation can be more complex.
- Pros:
Database
Interview Tips
Avoid comparing relational and NoSQL databases directly in your interview responses, as this may indicate a lack of experience. Instead of broad statements about why to choose one over the other, concentrate on the specific database you're familiar with and how it addresses the problem at hand. If comparisons are necessary, focus on the differences relevant to your experience and their impact on your design. A strong statement might include specifics, such as highlighting the ACID properties of Postgres for data integrity.
For interviews, it’s best to choose a specific database type to focus on. If you're preparing for product design interviews, opt for a relational database (like Postgres). If you're preparing for infrastructure design interviews, choose a NoSQL database (such as DynamoDB).
Relational Databases
Key Features
- SQL Joins
- Combines data from multiple tables, enabling complex queries
- Joins can be a performance bottleneck, so they should be minimized when possible.
- Indexes:
- Improve query performance by allowing faster data retrieval.
- Common implementation: B-Trees or Hash Tables
- Support for multiple indexes, including multi-column and specialized types (e.g., geospatial, full-text)
- Indexs comes with additional storage cost and write performance overhead
- Transactions:
- Group multiple operations into a single atomic operation, ensuring data integrity.
- ACID compliant to ensure data integrity and reliability.
- Transactions introduce inherent overhead due to locking mechanisms, isolation management, and transaction logging.
Pros & Cons
Pros
- Data Integrity: Strong ACID (Atomicity, Consistency, Isolation, Durability) properties ensure reliable transactions.
- Structured Data: Well-defined schema allows for clear organization and relationships.
- Powerful Querying: SQL provides a robust language for complex queries and data manipulation.
- Mature Ecosystem: Extensive tools and community support for popular RDBMS like Postgres and MySQL.
Cons
- Scalability Limitations: Can struggle with horizontal scaling compared to NoSQL databases.
- Rigid Schema: Changes to the database schema can be complex and time-consuming.
- Performance Bottlenecks: Joins and complex queries can lead to performance issues, especially with large datasets.
NoSQL Databases
Key Features
- Flexible Schema: Support for various data structures without a fixed schema, allowing for easy adaptation to changing data requirements.
- Scalability: Designed to scale horizontally across many servers, accommodating large amounts of data and high traffic loads through techniques like sharding and consistent hashing.
- Diverse Consistency Models: Offer a range of consistency options, from strong consistency (ensuring all nodes have the same data at the same time) to eventual consistency (where all nodes will eventually converge on the same data).
- Indexing: Support for indexing (e.g., B-Tree, Hash Table) to enhance query performance, similar to relational databases.
- Variety of Data Models:
- Key-Value Stores: Fast access and simple data retrieval (e.g., Redis, DynamoDB).
- Document Stores: Flexible and schema-less, ideal for JSON-like data (e.g., MongoDB).
- Column-Family Stores: Optimized for high write performance and scalability (e.g., Cassandra).
- Graph Databases: Efficiently manage and query relationships between data points (e.g., Neo4j).
- Time-Series Databases: Store data in time-series format, allowing for efficient querying of data indexed by time, making them ideal for applications monitoring metrics and events over time. (e.g., Prometheus).
- Geospatial Databases: Designed to store and query spatial data, these databases include support for geographic information systems (GIS), allowing for location-based queries and analysis. (e.g., PostGIS (extension for PostgreSQL)).
- Search Engines: Optimized for full-text search capabilities and complex search queries, allowing users to index and retrieve data efficiently based on text patterns. (e.g., Elasticsearch).
Pros & Cons
Pros
- Flexibility: Easily accommodates varying data types and structures without the need for a predefined schema.
- High Scalability: Capable of handling large-scale applications with high throughput and low latency.
- Performance: Optimized for specific use cases, such as write-heavy workloads or real-time analytics.
- Diverse Use Cases: Suitable for applications dealing with big data, real-time web apps, and evolving data models.
Cons
- Consistency Trade-offs: Eventual consistency may lead to stale reads, which can be problematic for certain applications requiring real-time accuracy.
- Limited ACID Transactions: Many NoSQL databases do not fully support ACID transactions, which can be a drawback for applications needing strong transactional guarantees.
Blob Storage
Blob storage is a service designed for storing large, unstructured data blobs such as images, videos, and files. It is more cost-effective and efficient than traditional databases for handling these types of data. Services like Amazon S3 and Google Cloud Storage allow users to upload blobs and retrieve them via URLs. These services often integrate with Content Delivery Networks (CDNs) to enable fast global access. The most popular blob storage services are Amazon S3, Google Cloud Storage, and Azure Blob.
Key Patterns
- Use Case: Blob storage is ideal for applications like YouTube (videos), Instagram (images), and Dropbox (files), where metadata is stored in a core database while the actual blobs are stored in blob storage.
- Architecture: Typically involves a core database (e.g., Postgres, DynamoDB) that stores metadata and URLs pointing to the blobs in blob storage.
Upload/Download Process
- Upload:
- The user issues a upload request to the server.
- The server registers the upload request with status
pending
in the database and returns a pre-signed URL to the user. - The user uploads the data to blob storage using the pre-signed URL.
- The blob storage triggers a notification event to the server to update the database with status
completed
.
- Download:
- The user issues a download request to the server.
- The server returns a presigned URL to the user.
- The user uses the presigned URL to download the data via CDN, which proxies the request to the underlying blob storage.
Key Features
- Durability: Blob storage services are designed to be incredibly durable. They ensure data safety through replication and erasure coding.
- Scalability: Services like AWS S3 are highly scalable, capable of handling unlimited data and requests.
- Cost-Effectiveness: Much cheaper than traditional databases (e.g., AWS S3 charges $0.023 per GB, while DynamoDB charges $1.25 per GB).
- Security: Built-in encryption at rest and in transit, access control features protect data.
- Direct Client Interaction: Clients can upload and download files directly using presigned URLs. Presigned URLs are temporary URLs that are signed with the user's credentials, allowing them to upload and download files directly. When a presigned URL is created, it includes authentication information as part of the query string, enabling controlled access to otherwise private objects. This is useful for applications that need to store and retrieve large blobs of data, like images or videos.
- Chunking: When uploading large files, it's common to use chunking to upload the file in smaller pieces. This allows for resumable uploads, where the upload can be paused and resumed without having to start from the beginning, and also allows for parallel uploads. S3 supports this out of the box with multipart uploads.
- CDN: Integrates with Content Delivery Networks (CDNs) to enable fast global access, where actual download is served from the CDN that caches the file at edge locations around the world.
- Versioning: Supports versioning of files, allowing for easy rollback to previous versions.
- Lifecycle Management: Supports lifecycle management policies to automatically transition files between storage classes, reducing costs.
Message Queues
Message queues are data structures that act as buffers for managing bursty traffic and distributing workloads across systems. They allow a producer (such as a compute resource) to send messages and forget about them, while a pool of workers processes these messages at their own pace. This mechanism helps smooth out system loads and decouples the producer from the consumer, enabling independent scaling.
Key Functions
- Buffer for Bursty Traffic: Queues can handle sudden spikes in requests without dropping messages.
- Distribute Work: Queues distribute tasks among worker nodes, ensuring efficient resource utilization.
- Backpressure: This mechanism prevents overwhelming the system by slowing down message production when the queue is full, helping to avoid bottlenecks. This may not provided by queue services out of the box, but can be implemented by the application.
Common Queue Technologies
-
SQS (Simple Queue Service): A fully managed queue service provided by AWS, designed for ease of use and integration with other AWS services.
Key Features:
- Scalability: SQS can automatically scale to handle an increasing number of messages, accommodating high volumes of traffic without pre-provisioning resources.
- Message Retention: Messages can be retained in the queue for a configurable period (from a few minutes up to 14 days), allowing consumers to process messages at their own pace.
- FIFO Queues: SQS provides FIFO (First-In-First-Out) queues to guarantee the order of message delivery, which is critical in scenarios where the order of messages matters.
- Message Visibility Timeout: After a message is retrieved from a queue, it becomes invisible to other consumers for a defined period. This ensures that while the consumer is processing the message, no other consumer can see or retrieve it. If the processing is successful, the message is deleted from the queue by the consumer via DeleteMessage API. If the processing is not successful, the message will be visible again after the visibility timeout is over.
- At-Least-Once Delivery: This ensures at-least-once delivery of messages, meaning that each message is delivered at least once, no message is lost.
- Manage long-running tasks: Set an appropriate visibility timeout for messages that require extended processing time. If the consumer needs more time to process the message, it can use the
ChangeMessageVisibility
API to extend the visibility timeout.
- Delay Queues: Delay queues let you postpone the delivery of new messages to consumers for a specified delay period, from 0 seconds to 15 minutes. This delay is specified at queue level, apply to all messages sent to the queue. Useful for scheduling tasks.
- Delayed Message: To set different delay times for individual messages, you can use the
DelaySeconds
parameter when sending a message. This parameter allows you to set a delay of 0 to 900 seconds (15 minutes) for that specific message. Useful for scheduling tasks. - Dead Letter Queues (DLQ): SQS supports DLQs, where messages that cannot be processed after a specified number of attempts are sent. This is useful for debugging and handling failed messages without losing them.
- Batched Operations: SQS allows sending, receiving, and deleting messages in batches, which improves efficiency and reduces costs by minimizing API calls.
- Long Polling: SQS supports long polling, which reduces the number of empty responses and helps ensure that consumers receive messages more efficiently by waiting for messages to arrive instead of continuously polling.
Challenges
- Complexity: Introducing a message queue adds architectural complexity, requiring developers to understand and manage an additional system and its configuration.
- Latency: While message queues improve asynchronous processes, they can also introduce latency as messages are placed in the queue and retrieved later by consumers.
Event Streams & Event Sourcing
Streams are used for processing large amounts of data in real-time and supporting complex processing scenarios, such as event sourcing. Event sourcing involves storing changes in application state as a sequence of events, allowing for state reconstruction, detailed auditing, and transaction replay.
Streams are essential for real-time data processing and event-driven architectures, providing flexibility, scalability, and fault tolerance in modern applications. Kafka being particularly prominent in system design discussions due to its robust features and widespread use.
Use Cases of Event Streams
- Real-Time Data Processing: Streams are ideal for applications that require immediate processing of high-volume data, like a social media platform needing real-time analytics of user interactions (likes, comments, shares). Stream processing systems (e.g., Apache Flink, Spark Streaming) can handle these events efficiently.
- Event Sourcing: In systems like banking, where every transaction must be recorded, streams enable event sourcing. Each transaction is treated as an event that can be stored, processed, and replayed, allowing for real-time processing, auditing, and state reconstruction of accounts.
- Multiple Consumers: Streams support multiple consumers reading from the same data source simultaneously. For example, in a real-time chat application, messages sent to a stream are distributed to all participants, facilitating instant communication through a publish-subscribe pattern.
Common Stream Technologies
-
Kafka: A distributed streaming platform that can function as a queue, known for its scalability and complex ordering capabilities.
Key Features:
- High Throughput: Kafka is designed to handle high volumes of data with minimal latency, allowing for the processing of millions of messages per second.
- Durability: Messages in Kafka are persisted on disk and can be replicated across multiple brokers, ensuring data durability and fault tolerance.
- Scalability: Kafka can easily scale horizontally by adding more brokers to the cluster. Topics can be partitioned, allowing for parallel processing of messages.
- Event Replay: Consumers can reprocess events by simply reading the event stream from a specific offset, facilitating debugging, data recovery, and processing of historical data.
- Consumer Groups: Kafka allows multiple consumers to work as part of a consumer group, enabling load balancing. Each message is consumed by only one consumer within a group, allowing for scalable message processing.
- Message Retention: Kafka retains messages for a configurable amount of time, allowing consumers to read messages at their own pace, even if they fall behind.
- Stream Processing: Kafka provides support for stream processing through Kafka Streams, which allows for powerful, real-time processing of data streams.
- Message Partitioning: Kafka topics can be divided into partitions, which enable parallel processing and help maintain order within a partition. Each partition can be hosted on different brokers.
- Idempotent Producer: Kafka supports idempotent producers, which ensures that messages are not duplicated in the event of a producer failure.
- Delivery Semantics: Kafka supports at-least-once and exactly-once delivery semantics, ensuring that messages are not lost or duplicated.
- Schema Registry: While not part of Kafka core, the Kafka ecosystem often includes a Schema Registry to manage message schemas, ensuring compatibility and preventing issues with data format changes.
Challenges
- Complexity of Cluster Management: Setting up and managing a Kafka cluster can be complex, requiring expertise in distributed systems, configuration tuning, and ongoing maintenance.
- Operational Overhead: Running a Kafka cluster requires operational vigilance, including monitoring metrics such as throughput, latency, and disk usage, which can add to the maintenance burden.
Distributed Caching
A distributed cache is a system that stores data in memory across multiple servers, helping to scale applications and reduce latency. It is particularly useful for storing data that is expensive to compute or retrieve from a database.
Benefits
- Reduced Latency: Frequently accessed data can be retrieved from memory much faster than from disk or a remote database, significantly decreasing response times for applications.
- Increased Scalability: Distributed caches can easily scale horizontally by adding more nodes, allowing systems to handle increasing loads and support more concurrent users efficiently.
- Offloading Backend Databases: By caching frequently accessed data, the load on primary databases is reduced, leading to lower resource consumption and improved overall performance.
- Increased Performance: Applications can handle more requests per second due to faster read access, improving user experience and responsiveness.
- Data Locality: Distributed caches can be located closer to application servers, reducing network latency and speeding up data access.
Challenges
- Consistency Issues: Maintaining consistency between the cache and the underlying data source can be difficult, especially in systems with frequent updates.
- Cache Invalidation Complexity: Implementing effective cache invalidation strategies is complex and may lead to stale data being served if not managed properly.
- Data Loss Risks: If the distributed cache does not persist data to disk or has replication issues, there is a risk of data loss in case of node failures.
- Network Overhead: Distributed caches rely on the network to communicate between nodes and with applications, introducing potential latency and vulnerabilities to network issues.
- Increased System Complexity: Adding a distributed cache increases overall system architecture complexity, requiring additional management and monitoring.
- Operational Overhead: Managing a distributed cache infrastructure requires additional resources for setup, configuration, monitoring, and maintenance, which can increase operational costs.
- Eviction Policies: The effectiveness of caching can be impacted by how eviction policies are managed.
Key Concepts
-
Eviction Policy: Determines which items are removed when the cache is full. Common policies include:
- Least Recently Used (LRU): Evicts the least recently accessed items. Most common used and memory efficient.
- First In, First Out (FIFO): Evicts items in the order they were added.
- Least Frequently Used (LFU): Removes items that are least frequently accessed.
-
Cache Patterns:
-
Cache-Aside (Lazy Loading): The application explicitly retrieves data from the cache. If the data is not found (cache miss), it fetches it from the underlying data store and populates the cache for future requests. Use Case: Ideal for read-heavy applications like e-commerce sites where product details are requested frequently but changed infrequently. Pros:
- Reduces load on the database by caching only the necessary data.
- Simple implementation; the cache is populated only when needed. Cons:
- Cache may become stale if not updated frequently.
- Initial read requests may incur higher latencies due to the need to fetch from the database
-
Write-Through Cache: The application writes data to both the cache and the datastore simultaneously, ensuring that both are always in sync. Use Case: Suitable for applications requiring strong consistency, such as financial systems where real-time data accuracy is critical. Pros:
- Ensures that the cache and data store are always consistent.
- Reduces the risk of stale data in the cache. Cons:
- Higher latency due to additional write operations.
- Can lead to performance bottlenecks during high write operations.
-
Cache Invalidation: The application invalidates the cache when the underlying data store is updated, ensuring that stale data is promptly invalidated when changes occur. Use Case: Suitable for applications where the data store is updated frequently, such as a social media platform where new posts are constantly being added.
Pros:
- Helps maintain the accuracy and relevance of the data in the cache.
- Reduces the risk of stale data in the cache.
Cons:
- Complexity in managing cache eviction strategies and synchronization.
- Can lead to performance bottlenecks during high write operations.
-
-
Data Structure: Be explicit about the data stored in the cache and the data structure used (e.g., sorted sets for lists of events) to optimize retrieval and processing.
-
Hashes: Used for storing key-value pairs, allowing for efficient retrieval of specific data points.
-
Sorted Sets: In Redis, sorted set is a data structure that stores elements associated with a score, allowing the elements to be stored in a sorted order based on the score.
Use Case:
- Leaderboard: Create leaderboards where each score represents a user’s points or rankings. Sorted set keeps the scores in order, allowing quick retrieval of the top scorers.
- Time-Series Data: Store time-series data in a sorted order based on the timestamp. Allows for efficient range queries, such as retrieving data points within a certain time frame.
- Priority Queue: Sorted sets can function as priority queues where items with higher scores (priority) are served before those with lower scores.
- Geospatial Indexing: Store geographic coordinates in a sorted order based on the geohash. Allows for efficient range queries and proximity searches, such as retrieving all points within a certain distance from a specified location.
Inner Implementation:
- A hash table map the elements to their scores.
- A skip list to store the elements in a sorted order based on the scores.
- Skip lists offer average-case O(log N) time complexity for search, insert, and delete operations, making it efficient for ordered retrieval.
- For range queries, time complexity is O(log N + M), where N is the number of elements in the Sorted Set, and M is the number of elements within the requested range.
- The skip list consists of a based sorted linked list and multi-level indexes. To build next level index, simply skips every other node in current level index.
-
-
Common Solutions
- Redis: A popular in-memory data structure store that supports various data structures and provides a rich set of commands for data manipulation.
- Memcached: A simple key-value store that that primarily supports strings and binary objects.
CDN & Edge Caching
A Content Delivery Network (CDN) is a system of distributed servers designed to deliver content to users based on their geographic location, improving load times and user experience. CDNs cache content, such as static files (images, videos, HTML) and dynamic content (like API responses), on servers closer to users. When a user requests content, the CDN serves it from the nearest server if available; if not, it retrieves it from the origin server, caches it, and then delivers it.
Edge Caching is a technique that cache content at the "edge" of the network—on servers located geographically closer to end users. This is a fundamental technology used in CDNs, however, CDNs encompass more than just edge caching—such as routing algorithms, encryption, load balancing, and security features.
Key Points
- CDNs enhance performance by reducing latency and improving load times for global users.
- They cache not only static assets but also dynamic content that changes infrequently, such as blog posts.
- CDNs can cache API responses, alleviating server load and boosting API performance.
- Eviction policies manage cached content, determining when to remove items based on rules like time-to-live (TTL) or content changes.
Challenges
- Operational Overhead: Managing a CDN infrastructure requires ongoing maintenance, including monitoring, updating, and scaling the network as needed.
- Cache Invalidation: CDNs must manage cache invalidation to ensure that stale data is not served to users. This can be complex and may lead to trade-offs between consistency and performance.
- Cost: CDNs can be expensive, especially for high-traffic applications, due to the need for a global network of servers and the associated operational costs.
Distributed Lock
Distributed locks are mechanisms used to temporarily lock resources across different systems or processes, ensuring that only one process can access a resource at a time. They are particularly useful in scenarios where multiple users or systems might try to access the same resource simultaneously, such as in ticket sales or e-commerce.
Traditional databases with ACID properties use transaction locks to keep data consistent, which is great, but they're not designed for longer-term locking. This is where distributed locks come in handy.
Typically implemented using distributed key-value stores like Redis or Zookeeper, distributed locks utilize atomic operations to ensure that a resource can only be locked by one process at a time. For example, setting a key (e.g., ticket-123) to a "locked" state prevents other processes from acquiring the same lock until it is released.
Key Concepts
- Locking Mechanisms: This mechanism ensures that only one process can acquire the lock at a time Familiarity with implementations like Redlock, which uses multiple Redis instances for safety and consistency.
- Lock Expiry: Distributed locks can be configured to expire after a certain period, which helps avoid situations where a lock remains active indefinitely due to process crashes. This feature ensures that resources can be reclaimed after a timeout, allowing other processes to acquire the lock.
- Locking Granularity: Distributed locks can be used to lock a single resource or a group of resources. For example, a distributed lock can be used to lock a single ticket or a group of tickets.
Challenges
- Deadlocks: When multiple processes attempt to acquire locks on multiple resources, they may end up in a deadlock situation where each process is waiting for the other to release a lock.
Mitigation:
- Timeouts: Setting a timeout for the lock ensures that the lock will be released if the process crashes.
- Try-Lock: Implement a "try-lock" technique that allows a process to attempt to acquire a lock without waiting. If it fails to acquire the lock, the process can then back off, release other locks, and retry.
- Lock-Free: Where feasible, design algorithms that do not require locks. Instead use alternative concurrency control methods, such as optimistic concurrency control (OCC).
- Lock Hierarchy: Establish a strict order in which locks must be acquired. Ensure that all processes acquire locks in the same predefined sequence.
- Deadlock-Detection: Implement mechanisms to detect and resolve deadlocks. For example, use a centralized service to monitor lock requests and detect when a deadlock is imminent.
- Performance: Distributed locks can introduce latency, as they require network communication to coordinate locks across different nodes.
- Consistency: Distributed locks must ensure consistency across different nodes, which can be difficult to achieve in highly concurrent systems.
- Scalability: Ensuring that distributed locks are scalable and fault-tolerant can be challenging, especially in large-scale systems with many nodes.
- Complexity: Implementing and managing distributed locks can be complex, requiring careful consideration of failure modes and ensuring that locks are released in all cases, even in the event of system failures.
Common Use Cases
- E-Commerce Checkout: Locking high-demand items in a user’s cart during checkout to prevent double-selling.
- Ride-Sharing Matchmaking: Locking a driver to a rider request to avoid multiple matches.
- Distributed Cron Jobs: Ensuring that scheduled tasks are executed by only one server at a time to avoid duplication.
- Online Auction Bidding: Briefly locking an item during the final moments of bidding to process bids without conflicts.
Common Solutions
- Redis: A popular distributed key-value store that supports distributed locks in various approaches.
- SETNX and TTL (Single Redis Instance): SETNX is a command that sets a key to a value if it does not already exist. The TTL (Time To Live) is the duration for which the lock will be held.
- Pros and Cons: Simple implementation, but not fault-tolerant.
- Redlock (Multiple Redis Instances): A distributed locking algorithm that uses multiple Redis instances for safety and consistency. A client must acquire the majority of locks (e.g., at least 3 out of 5) within a short timeframe to ensure the lock is valid.
- Pros and Cons: More complex setup, but provides fault-tolerance and consistency.
- SETNX and TTL (Single Redis Instance): SETNX is a command that sets a key to a value if it does not already exist. The TTL (Time To Live) is the duration for which the lock will be held.
- AWS DynamoDB: A distributed database that provides a distributed locking mechanism through its
Locks API
. DynamoDB uses a persistent table with additional features like automatic heartbeating. - Zookeeper: A distributed coordination service that provides a distributed locking mechanism through its
Locks API
. Zookeeper uses its distributed file system to store locks, which provides durability and fault-tolerance.
Compute Options
In system design, compute options refer to the different ways to execute code in a system. Here are some of the most common compute options:
Containers
Containers are similar to VMs in that they provide an isolated environment for running code, but they are much more lightweight and faster to start up. Let's break down the key differences between VMs and containers:
- Isolation: Containers share the kernel of the host machine, while VMs have their own kernel and virtual hardware.
- Resource Utilization: Containers are more resource-efficient than VMs, as they do not need to run a full virtual machine.
- Lightweight: Containers are much lighter than VMs, they only include applications and their dependencies, so they can be started and stopped much faster.
When it comes to production, containers are often used in conjunction with orchestration tools like Kubernetes or ECS to manage the lifecycle of the containers.
Pros:
- Cost effective for steady workloads. We can further reduce the cost by using spot instances
- Better suited for long-running processes since they maintain the state of the application
Cons:
- More operational overhead compared to serverless, due to managing the lifecycle of the containers and the orchestration platform.
- Cannot scale elastically as serverless
Serverless
Serverless functions like AWS Lambda are small, stateless, event-driven functions that run in response to triggers (e.g. an HTTP request). They are managed by a cloud provider and automatically scale up or down based on demand, making them a great option for running code that is CPU intensive or unpredictable in terms of load.
Pros:
- Minimal operational overhead, as the cloud provider manages the infrastructure.
- Ideal for short-lived tasks under 15 minutes.
- Automatically scales up or down based on demand.
- No need to manage the lifecycle of the compute resources.
Cons:
- Has cold start time, which can introduce latency for the first invocation.
- Has resource limits, which can impact the performance of long running or CPU intensive tasks.
- More expensive than containers for steady workloads.
Push Notification
Push notifications are a way to send messages to users from a server to their devices. They are particularly useful for real-time communication, such as in chat applications or social media platforms.
3rd Party Push Notification Services
- Apple Push Notification Service (APNS): A messaging service provided by Apple to push notifications to iOS devices.
- Firebase Cloud Messaging (FCM): A messaging service provided by Google to push notifications to Android devices.
- SMS Service: 3rd party services like Twilio to send SMS notifications to users.
- Email Service: 3rd party services like Resend to send email notifications to users.