2017年11月27日星期一

Site Reliability Engineering - How Google Runs Production Systems

Site Reliability Engineering - How Google Runs Production Systems

Ch10: Practical Alerting from Time-Series Data

Borgmon clusters could stream with each other, upper-tier filter the data from lower-tier and aggregate on it. Thus, the aggregation hierarchy provides metrics with different granularity. (DC -> campus -> Global)

Ch19: Load Balancing at the Frontend

Traffic are different. Search requests want latency, Video requests want throughput.
  • DNS layer LB
    • Clients don't know the closest IP -> anycast -> a map of all networks' physical locations (but how to keep it updated)
    • DNS middleman's cache (don't know how many users will be impacted)(need a low TTL for propagation)
    • DNS needs to know DC's capacity
  • Virtual IP layer LB: LVS packet encapsulation GRE(IP inside)

ch20: Load Balancing in the Datacenter

  1. Active-request limit: client marks backend as bad if its backlog is high. But many long queries will fail client quickly
  2. Backend has states: healthy, unresponsive, lame duck (stop sending requests to me)
  3. Subsetting: one client could only talk with limited backends to prevent "backend killer" client
  4. Weighted Round Robin on backend-provided load information

ch21: Handling Overload

  • Client-side adaptive throttling: Preject = max(0, req - K*accept/req)
  • Request is tagged with priority
  • When to retry
    • per-request retry budget is 3
    • per-client retry budget is 110% (avoid 3X load)
    • backend will give "overloaded; don't retry" if the backends' retry histogram is heavy
  • Put a batch proxy if clients have too many connections on backends

ch23: Managing Critical State: Distributed Consensus for Reliability

Whenever you see leader election, critical shared state, or distributed locking, use distributed consensus system.
Different algorithms: crash-fail or crash-recover? Byzantine or non-Byzantine failures?
Fundamental is replicated state machine (RSM): is a system that executes the same set of operations, in the same order, on several processes.
The consensus algorithm deals with agreement on the sequence of operations, and the RSM executes the operations in that order.

What RSM could build?
  • Replicated Datastores and Configuration Stores: RSM provides consistency semantics. Other (nondistributed-consensus-based) systems rely on timestamps (refer Spanner)
  • Highly Available Processing: Leader election of GFS to ensure only one leader for coordinating workers
  • A barrier: blocking a group of processes from proceeding until some condition is met -> MapReduce's phases and distributed locking
  • Atomic broadcast: for queueing, ensure messages are received reliably and in the same order

Performance

  • Network RTT: use regional proxies to hold persistent TCP/IP connections, and it could also serve for encapsulating sharding and load balancing strategies, as well as discovery of cluster members and leaders
  • Fasst Paxos: each client sends Propose directly to each member
  • Stable Leaders: but leader will be a bottleneck
  • Batching jobs
  • Disk: consider batching/combine transaction log into a single log

ch24: Distributed Periodic Scheduling with Cron

  1. Decouple processes from machines: sending RPC requests instead of launching jobs
  2. Tracking the state of cron jobs:
    1. where
      1. Store data externally like GFS or HDFS
      2. Store internally as part of the cron service (3 replicas), store snapchat locally, do not store logs on GFS (too slow)
    2. what
      1. when it is launched
      2. when it has finished
  3. Use Paxos:
    1. Single leader launches cron job (or through another DC scheduler), only launch jobs after Paxos quorum is met (synchronously) is it slow?
    2. Followers need to update each jobs' finish time in case leader dies
  4. Resolving partial failures:
    1. all external operations must be idempotent; or the state is stored externally unambiguously
    2. must record launched schedule time to prevent double scheduling. Trade-off between risk double launch vs skipping a launch

ch25: Data Processing Pipelines

How to implement a periodic pipeline? (latency, throughput, fast startup, even distributed, thundering herd issue)

Google Workflow uses model-view-controller pattern
  • Task Master(model): hold all jobs states (pointers) in RAM, and actual input/output data is stored in a common HDFS
  • Workers(view): stateless
  • Controller: auxiliary system activities: runtime scaling, snapshotting, workcycle state control, rolling back pipeline state
How to guarantee correctness?
  • Worker output through configuration tasks creates barriers on which to predicate work
  • All work committed requires a currently valid lease held by the worker
  • Output files are uniquely names by the workers
  • The client and server validate the Task Master itself by checking a server token on every operation
Failure model:
  • Task Master store journals on Spanner: achieve global availability, global consistency, low-throughput filesystem.
  • Use Chubby to elect writer and persist result in Spanner
  • Globally distributed Workflows employ 2+ local Workflows running in distinct clusters
  • Each task has a "heartbeat" peer, upon peer timeout it will resume the task instead