System Design - Storage/Database Systems 方法论框架

Contents

System Design 存储/数据库系统面试方法论

System Design Storage/Database Systems Methodology

🎯 核心问题:这类题目的特征

Core Question: Characteristics of These Problems

存储系统是系统设计的基础! 几乎所有系统都需要存储,选择合适的存储方案是关键。

Storage systems are the foundation of system design! Almost all systems need storage, and choosing the right storage solution is crucial.

关键特征

Key Characteristics

维度 / Dimension Storage/Database 系统
核心功能 / Core Function 存储和检索数据
输入 / Input 写入请求(write requests)
输出 / Output 读取结果(read results)
关键挑战 / Key Challenge 读写性能、一致性、扩展性
典型题目 / Typical Problems Design URL Shortener, Design Key-Value Store, Design File Storage

📊 决策树:识别 Storage/Database 题目

Decision Tree: Identify Storage/Database Problems

面试题目分析
Interview Problem Analysis
    │
    ├─ 是否主要关注"存储"、"数据库"、"数据持久化"?
    │   Is it mainly about "storage", "database", "data persistence"?
    │   │
    │   ├─ YES → Storage/Database 系统
    │   │   └─ 继续判断具体类型...
    │   │
    │   └─ NO → 可能是其他类型
    │
    ├─ 读写比例?
    │   Read/Write Ratio?
    │   │
    │   ├─ 读多写少(100:1)→ 读优化存储
    │   │   Read-heavy → Read-optimized storage
    │   ├─ 写多读少(1:100)→ 写优化存储
    │   │   Write-heavy → Write-optimized storage
    │   └─ 读写均衡 → 平衡型存储
    │       Balanced → Balanced storage
    │
    ├─ 数据一致性要求?
    │   Data Consistency Requirements?
    │   │
    │   ├─ 强一致性 → ACID 数据库
    │   │   Strong consistency → ACID database
    │   ├─ 最终一致性 → NoSQL/CAP
    │   │   Eventual consistency → NoSQL/CAP
    │   └─ 可接受不一致 → 缓存 + 数据库
    │       Accept inconsistency → Cache + Database
    │
    └─ 数据规模?
        Data Scale?
        │
        ├─ 小规模(< 1TB)→ 单机数据库
        ├─ 中规模(1TB-100TB)→ 分片数据库
        └─ 大规模(> 100TB)→ 分布式存储

🔍 核心特征识别

Core Characteristics Identification

典型题目关键词

Typical Problem Keywords

  1. Key-Value Store 问题
    • Design URL Shortener
    • Design Distributed Cache
    • Design Key-Value Store
    • Design Pastebin
  2. File Storage 问题
    • Design File Storage System
    • Design Image Storage
    • Design Video Storage
    • Design Dropbox
  3. Database 问题
    • Design Database Sharding
    • Design Distributed Database
    • Design Time-series Database

核心需求模式

Core Requirement Patterns

输入:写入请求
Input: Write Requests
- Key-Value pairs
- Files/Objects
- Records/Transactions

输出:读取结果
Output: Read Results
- Value by key
- File content
- Query results

关键挑战:
Key Challenges:
- 读写性能(QPS)
  Read/Write performance (QPS)
- 数据一致性(ACID vs BASE)
  Data consistency (ACID vs BASE)
- 扩展性(水平扩展)
  Scalability (horizontal scaling)
- 持久化(数据不丢失)
  Persistence (no data loss)

🏗️ 标准架构模式

Standard Architecture Patterns

模式一:Key-Value Store 架构

Pattern 1: Key-Value Store Architecture

适用场景: Use Cases:

  • URL Shortener
  • Distributed Cache
  • Session Storage

核心组件: Core Components:

Client
    ↓
Load Balancer
    ↓
Application Server
    ├─→ Key Generation Service
    ├─→ Cache Layer (Redis)
    └─→ Database (MySQL/NoSQL)
    ↓
Storage Layer
    └─→ Object Storage (S3) [for large values]

关键设计点: Key Design Points:

  1. Key 生成策略 Key Generation Strategy:
    • Base62 encoding (URL shortener)
    • UUID
    • Snowflake ID
    • Hash-based
  2. 存储选择 Storage Selection:
    • 小 Value (< 1MB) → Database
    • 大 Value (> 1MB) → Object Storage (S3)
    • 热点数据 → Cache (Redis)
  3. 扩展策略 Scaling Strategy:
    • 水平分片(按 key hash)
    • 一致性哈希(Consistent Hashing)
    • 读写分离

模式二:File Storage 架构

Pattern 2: File Storage Architecture

适用场景: Use Cases:

  • File Storage System
  • Image/Video Storage
  • Dropbox-like System

核心组件: Core Components:

Client
    ↓
API Gateway
    ↓
File Service
    ├─→ Metadata DB (PostgreSQL)
    ├─→ File Storage (S3/HDFS)
    └─→ CDN (for read optimization)

关键设计点: Key Design Points:

  1. 元数据 vs 文件内容分离 Metadata vs Content Separation:
    • Metadata: 数据库存储(文件名、路径、大小)
    • Content: 对象存储(S3、HDFS)
  2. 上传优化 Upload Optimization:
    • 分块上传(Chunked upload)
    • 断点续传(Resumable upload)
    • 并行上传
  3. 下载优化 Download Optimization:
    • CDN 缓存
    • 预取(Prefetching)
    • 压缩(Compression)

📋 核心设计决策点

Core Design Decision Points

1. 数据库类型选择

1. Database Type Selection

SQL Database (PostgreSQL, MySQL)

优点:
Advantages:
- ACID 事务支持
  ACID transaction support
- 强一致性
  Strong consistency
- 复杂查询支持
  Complex query support

缺点:
Disadvantages:
- 扩展性有限
  Limited scalability
- 写入性能受限
  Limited write performance

适用场景:
Use Cases:
- 需要事务的场景
  Scenarios requiring transactions
- 复杂查询需求
  Complex query requirements
- 强一致性要求
  Strong consistency requirements

NoSQL Database (Cassandra, DynamoDB)

优点:
Advantages:
- 高写入性能
  High write performance
- 水平扩展
  Horizontal scaling
- 最终一致性
  Eventual consistency

缺点:
Disadvantages:
- 不支持复杂查询
  No complex queries
- 一致性较弱
  Weaker consistency

适用场景:
Use Cases:
- 高写入负载
  High write load
- 简单查询模式
  Simple query patterns
- 可以接受最终一致性
  Can accept eventual consistency

Key-Value Store (Redis, Memcached)

优点:
Advantages:
- 极快速度(内存)
  Extremely fast (in-memory)
- 简单 API
  Simple API

缺点:
Disadvantages:
- 内存限制
  Memory limitations
- 数据可能丢失(除非持久化)
  Data may be lost (unless persisted)

适用场景:
Use Cases:
- 缓存
  Caching
- 会话存储
  Session storage
- 临时数据
  Temporary data

2. 分片策略选择

2. Sharding Strategy Selection

Range-based Sharding

优点:
Advantages:
- 实现简单
  Simple implementation
- 范围查询高效
  Efficient range queries

缺点:
Disadvantages:
- 热点问题(某些分片负载高)
  Hotspot issues (some shards have high load)
- 数据分布不均
  Uneven data distribution

适用场景:
Use Cases:
- 有序数据
  Ordered data
- 范围查询多
  Many range queries

Hash-based Sharding

优点:
Advantages:
- 数据分布均匀
  Even data distribution
- 避免热点
  Avoids hotspots

缺点:
Disadvantages:
- 范围查询困难
  Difficult range queries
- 重新分片复杂
  Complex resharding

适用场景:
Use Cases:
- 随机访问模式
  Random access patterns
- 不需要范围查询
  No range queries needed

Directory-based Sharding

优点:
Advantages:
- 灵活的分片策略
  Flexible sharding strategy
- 易于重新分片
  Easy resharding

缺点:
Disadvantages:
- 需要额外的查找服务
  Requires additional lookup service
- 单点故障风险
  Single point of failure risk

适用场景:
Use Cases:
- 复杂分片需求
  Complex sharding requirements
- 需要动态调整
  Need dynamic adjustment

3. 一致性策略选择

3. Consistency Strategy Selection

强一致性(Strong Consistency)

实现方式:
Implementation:
- 同步复制
  Synchronous replication
- 两阶段提交(2PC)
  Two-phase commit

优点:
Advantages:
- 数据始终一致
  Data always consistent
- 读取最新数据
  Read latest data

缺点:
Disadvantages:
- 延迟高
  High latency
- 可用性可能受影响
  Availability may be affected

适用场景:
Use Cases:
- 金融交易
  Financial transactions
- 关键业务数据
  Critical business data

最终一致性(Eventual Consistency)

实现方式:
Implementation:
- 异步复制
  Asynchronous replication
- 向量时钟(Vector Clocks)
  Vector clocks

优点:
Advantages:
- 低延迟
  Low latency
- 高可用性
  High availability

缺点:
Disadvantages:
- 可能读到旧数据
  May read stale data
- 冲突解决复杂
  Complex conflict resolution

适用场景:
Use Cases:
- 社交网络
  Social networks
- 内容分发
  Content distribution

4. 缓存策略选择

4. Caching Strategy Selection

Cache-Aside (Lazy Loading)

流程:
Flow:
1. 查询缓存
   Check cache
2. 缓存未命中 → 查询数据库
   Cache miss → Query database
3. 写入缓存
   Write to cache

优点:
Advantages:
- 实现简单
  Simple implementation
- 缓存故障不影响系统
  Cache failure doesn't affect system

缺点:
Disadvantages:
- 缓存未命中延迟高
  High latency on cache miss
- 可能缓存不一致
  Possible cache inconsistency

Write-Through

流程:
Flow:
1. 写入数据库
   Write to database
2. 同时写入缓存
   Write to cache simultaneously

优点:
Advantages:
- 缓存始终最新
  Cache always up-to-date
- 读取性能好
  Good read performance

缺点:
Disadvantages:
- 写入延迟高
  High write latency
- 写入不常读的数据浪费
  Wastes cache on rarely-read writes

Write-Behind (Write-Back)

流程:
Flow:
1. 写入缓存
   Write to cache
2. 异步写入数据库
   Asynchronously write to database

优点:
Advantages:
- 写入延迟低
  Low write latency
- 批量写入优化
  Batch write optimization

缺点:
Disadvantages:
- 数据可能丢失(缓存故障)
  Data may be lost (cache failure)
- 实现复杂
  Complex implementation

🎯 标准解题流程

Standard Problem-Solving Process

Step 1: 需求澄清(Requirements Clarification)

Step 1: Requirements Clarification

必须明确的问题: Questions to Clarify:

  1. 数据特征 Data Characteristics:
    • 数据大小(小 Value vs 大 File)
    • 数据访问模式(随机 vs 顺序)
    • 数据生命周期(临时 vs 永久)
  2. 读写特征 Read/Write Characteristics:
    • 读写比例(读多写少?写多读少?)
    • QPS 要求
    • 延迟要求
  3. 一致性要求 Consistency Requirements:
    • 强一致性 vs 最终一致性
    • 是否可以接受旧数据
    • 事务需求
  4. 扩展性要求 Scalability Requirements:
    • 数据规模(当前、未来)
    • 是否需要水平扩展
    • 增长预期

Step 2: 估算规模(Scale Estimation)

Step 2: Scale Estimation

关键指标: Key Metrics:

写入负载:
Write Load:
- Writes/second
- 平均数据大小
  Average data size
- 峰值 vs 平均值
  Peak vs average

读取负载:
Read Load:
- Reads/second
- 读取模式(随机/顺序)
  Read patterns (random/sequential)

存储需求:
Storage Requirements:
- 总数据量
  Total data volume
- 数据增长率
  Data growth rate
- 保留时间
  Retention period

示例计算(URL Shortener): Example Calculation (URL Shortener):

Writes = 100M URLs/day = 1.2k/second
Reads = 100:1 ratio = 120k/second

Storage (5 years):
100M/day * 365 * 5 = 182.5B URLs
182.5B * 500 bytes = 91.25 TB

Step 3: 基础设计(Basic Design)

Step 3: Basic Design

最小可行方案: Minimum Viable Solution:

1. Key Generation
   - 生成唯一 key
     Generate unique key

2. Storage
   - 存储 key-value 映射
     Store key-value mapping

3. Retrieval
   - 根据 key 查询 value
     Query value by key

承认问题: Acknowledge Issues:

  • “这个方案在规模上会有问题:单机数据库无法支撑”
  • “This solution will have scale issues: single DB can’t handle it”

Step 4: 优化设计(Optimized Design)

Step 4: Optimized Design

核心优化方向: Core Optimization Directions:

  1. 存储优化 Storage Optimization:
    • 分片(Sharding)
    • 读写分离(Read replicas)
    • 缓存层(Cache layer)
  2. 性能优化 Performance Optimization:
    • CDN(内容分发)
    • 预取(Prefetching)
    • 批量操作(Batching)
  3. 可靠性优化 Reliability Optimization:
    • 复制(Replication)
    • 备份(Backup)
    • 故障恢复(Failover)

📚 典型题目分类

Problem Categories

Key-Value Store 问题

  1. Design URL Shortener
    • 核心:短 URL 生成和重定向
    • 关键:Key 生成、高读取负载
    • 挑战:扩展性、缓存策略
  2. Design Distributed Cache
    • 核心:分布式键值存储
    • 关键:一致性哈希、缓存策略
    • 挑战:缓存失效、一致性

File Storage 问题

  1. Design File Storage System
    • 核心:文件上传、存储、下载
    • 关键:元数据分离、CDN
    • 挑战:大文件处理、扩展性

🎯 面试策略总结

Interview Strategy Summary

开场策略

Opening Strategy

1. 识别题目类型
   Identify problem type
   "这是一个存储系统设计问题"
   "This is a storage system design problem"

2. 明确核心需求
   Clarify core requirements
   "需要存储和检索 [key-value/files]"
   "Need to store and retrieve [key-value/files]"

3. 询问关键参数
   Ask key parameters
   - 读写比例
     Read/write ratio
   - 数据大小
     Data size
   - 一致性要求
     Consistency requirements

📝 快速检查清单

Quick Checklist

需求澄清 Checklist

  • 数据特征(大小、类型、生命周期)
  • 读写特征(比例、QPS、延迟)
  • 一致性要求(强一致 vs 最终一致)
  • 扩展性要求(规模、增长)

设计 Checklist

  • 数据库选择(SQL/NoSQL/Key-Value)
  • 分片策略(Range/Hash/Directory)
  • 缓存策略(Cache-Aside/Write-Through)
  • 复制策略(同步/异步)
  • 扩展性(水平扩展、负载均衡)

记住:这类题目的核心是选择合适的存储方案、分片策略、一致性模型。重点是读写性能、扩展性、一致性权衡! Remember: The core of these problems is choosing the right storage solution, sharding strategy, and consistency model. Focus on read/write performance, scalability, and consistency trade-offs!

Contents