2017年11月27日星期一

Site Reliability Engineering - How Google Runs Production Systems

Site Reliability Engineering - How Google Runs Production Systems

Ch10: Practical Alerting from Time-Series Data

Borgmon clusters could stream with each other, upper-tier filter the data from lower-tier and aggregate on it. Thus, the aggregation hierarchy provides metrics with different granularity. (DC -> campus -> Global)

Ch19: Load Balancing at the Frontend

Traffic are different. Search requests want latency, Video requests want throughput.
  • DNS layer LB
    • Clients don't know the closest IP -> anycast -> a map of all networks' physical locations (but how to keep it updated)
    • DNS middleman's cache (don't know how many users will be impacted)(need a low TTL for propagation)
    • DNS needs to know DC's capacity
  • Virtual IP layer LB: LVS packet encapsulation GRE(IP inside)

ch20: Load Balancing in the Datacenter

  1. Active-request limit: client marks backend as bad if its backlog is high. But many long queries will fail client quickly
  2. Backend has states: healthy, unresponsive, lame duck (stop sending requests to me)
  3. Subsetting: one client could only talk with limited backends to prevent "backend killer" client
  4. Weighted Round Robin on backend-provided load information

ch21: Handling Overload

  • Client-side adaptive throttling: Preject = max(0, req - K*accept/req)
  • Request is tagged with priority
  • When to retry
    • per-request retry budget is 3
    • per-client retry budget is 110% (avoid 3X load)
    • backend will give "overloaded; don't retry" if the backends' retry histogram is heavy
  • Put a batch proxy if clients have too many connections on backends

ch23: Managing Critical State: Distributed Consensus for Reliability

Whenever you see leader election, critical shared state, or distributed locking, use distributed consensus system.
Different algorithms: crash-fail or crash-recover? Byzantine or non-Byzantine failures?
Fundamental is replicated state machine (RSM): is a system that executes the same set of operations, in the same order, on several processes.
The consensus algorithm deals with agreement on the sequence of operations, and the RSM executes the operations in that order.

What RSM could build?
  • Replicated Datastores and Configuration Stores: RSM provides consistency semantics. Other (nondistributed-consensus-based) systems rely on timestamps (refer Spanner)
  • Highly Available Processing: Leader election of GFS to ensure only one leader for coordinating workers
  • A barrier: blocking a group of processes from proceeding until some condition is met -> MapReduce's phases and distributed locking
  • Atomic broadcast: for queueing, ensure messages are received reliably and in the same order

Performance

  • Network RTT: use regional proxies to hold persistent TCP/IP connections, and it could also serve for encapsulating sharding and load balancing strategies, as well as discovery of cluster members and leaders
  • Fasst Paxos: each client sends Propose directly to each member
  • Stable Leaders: but leader will be a bottleneck
  • Batching jobs
  • Disk: consider batching/combine transaction log into a single log

ch24: Distributed Periodic Scheduling with Cron

  1. Decouple processes from machines: sending RPC requests instead of launching jobs
  2. Tracking the state of cron jobs:
    1. where
      1. Store data externally like GFS or HDFS
      2. Store internally as part of the cron service (3 replicas), store snapchat locally, do not store logs on GFS (too slow)
    2. what
      1. when it is launched
      2. when it has finished
  3. Use Paxos:
    1. Single leader launches cron job (or through another DC scheduler), only launch jobs after Paxos quorum is met (synchronously) is it slow?
    2. Followers need to update each jobs' finish time in case leader dies
  4. Resolving partial failures:
    1. all external operations must be idempotent; or the state is stored externally unambiguously
    2. must record launched schedule time to prevent double scheduling. Trade-off between risk double launch vs skipping a launch

ch25: Data Processing Pipelines

How to implement a periodic pipeline? (latency, throughput, fast startup, even distributed, thundering herd issue)

Google Workflow uses model-view-controller pattern
  • Task Master(model): hold all jobs states (pointers) in RAM, and actual input/output data is stored in a common HDFS
  • Workers(view): stateless
  • Controller: auxiliary system activities: runtime scaling, snapshotting, workcycle state control, rolling back pipeline state
How to guarantee correctness?
  • Worker output through configuration tasks creates barriers on which to predicate work
  • All work committed requires a currently valid lease held by the worker
  • Output files are uniquely names by the workers
  • The client and server validate the Task Master itself by checking a server token on every operation
Failure model:
  • Task Master store journals on Spanner: achieve global availability, global consistency, low-throughput filesystem.
  • Use Chubby to elect writer and persist result in Spanner
  • Globally distributed Workflows employ 2+ local Workflows running in distinct clusters
  • Each task has a "heartbeat" peer, upon peer timeout it will resume the task instead

2017年9月11日星期一

Network


 



  Network






  • 网络基本功系列:细说网络那些事儿
  • https://github.com/alex/what-happens-when#dns-lookup 浏览器如何访问google
  • Proxy Digest Authentication?
  • Why TChannel?
    • Why not HTTP? HTTP connections are uni-directional, can only be used for one concurrent request, and are built for delivering web pages with cache hints. HTTP tightly couples encoding and transport concerns (the path implies both the procedure name and sometimes even parts of the request body), having a single, header namespace with variable widths. It is not ideal for RPC, but it's fantastic for web pages.
    • HTTP2:
    • HTTP2 vs WebSocket
      • WebSocket is bidirectional; HTTP2 is client/server + server push
    • TChannel: RPC, multiplexing, bi-directional message passing, fast forwarding, work shedding with deadlines, cancellation, speculative execution, and communicates more clearly about when and when not to retry.
  • Improve Web Speed: client -> DNS resolver -> DNS -> DNS resolver -> client --req--> Web server
    • Speed = DNS latency + Request latency. Which DNS has lowest latency? Can't measure directly because of DNS resolver delegates the name resolution. Solution: create a special hostname to resolve (avoid cache), latency = server access time - special hostname resolve time

DNS resolvers


DNS客户端设置使用的DNS服务器(8.8.4.4)一般都是递归服务器,它负责全权处理客户端的DNS查询请求,直到返回最终结果。而DNS服务器之间一般采用迭代查询方式

  1. clear DNS cache sudo killall -HUP mDNSResponder
  2. check DNS client: sudo lsof -i -P | grep LISTEN
  3. http://whoismydns.com/Index.html
  4. dig @127.0.0.1 -p 53 google.com +tcp +trace





TCP

  • Connection Reset: the peer rejects it with an RST to let you know it isn't listening.
  • send() returns only means transmitted through interface

Load Balancer

课外加强阅读

GFW


2017年9月7日星期四

哥德尔、艾舍尔、巴赫

哥德尔不完备定理


  • 2条定理
  • 希尔伯特计划这个计划的主要目标,是为全部的数学提供一个安全的理论基础。具体地,这个基础应该包括:
    • 所有数学的形式化。意思是,所有数学应该用一种统一的严格形式化的语言,并且按照一套严格的规则来使用。
    • 完备性。我们必须证明以下命题:在形式化之后,数学里所有的真命题都可以被证明(根据上述规则)。
    • 相容性。我们必须证明:运用这一套形式化和它的规则,不可能推导出矛盾。
    • 保守性。我们需要证明:如果某个关于“实际物”的结论用到了“假想物”(如不可数集合)来证明,那么不用“假想物”的话我们依然可以证明同样的结论。
    • 确定性。应该有一个算法,来确定每一个形式化的命题是真命题还是假命题
  • 希尔伯特第二问题,是希尔伯特的23个问题之一,即关于一个公理系统相容性的问题,也就是判定一个公理系统内的所命题是彼此相容矛盾的,希尔伯特希望能以严谨的方式来证明任意公理系统内命题的相容性。

        2017年8月14日星期一

        Linux & OS & Linux内核设计与实现 & 深入理解计算机系统

        Unix / Linux systems and internals

        Preparations

        Operating Systems: Know about processes, threads and concurrency issues. Know about locks and mutexes and semaphores and monitors and how they work. Know about deadlock and livelock and how to avoid them. Know what resources a processes needs, and a thread needs, and how context switching works, and how it's initiated by the operating system and underlying hardware. Know a little about scheduling. The world is rapidly moving towards multi-core, so know the fundamentals of "modern" concurrency constructs.

        Linux internals
        • Process Execution and/or Threads
        • Memory Usage
        • RAID Levels
        • The kernel and how it interacts with other system components
        • System Calls
        • Signals and Signal Handlers
        • Modern Web Architectures and Webservers
        Preparations:
        • Review userspace / Kernel space boundaries and interactions.
          Examples might include: ioctls, sysctls, context switches.
        • Review troubleshooting tools for system-level performance issues.
        • Review troubleshooting tools for debugging application-level performance issues or bugs.
        file system processing, file properties, user permissions, text parsing



        Linux








        • Process
        • Context switch
          • Process context switch: need to switch virtual memory mapping, so that need to flush TLB (which is expensive)
          • User mode to Kernel mode: a mode transition usually 
        • Kernel
          • How the Kernel Manages Your Memory
          • Kernel can preempt a task running in the kernel so long as it does not hold a lock. Because the kernel is SMP-safe, if a lock is not held, the current code is reentrant and capable of being preempted
        • Permissions
          • chmod
            • sticky bit, 1, t: has the final decision无法删除
            • guid, 2, s: 目录下新文件保留原目录ownership
            • suid, 4, s
            • chmod -R a+rX . 所有子目录可搜索(regular file retain excutable permission)
          • chattr
            • i, immutable
          • filetype
          • umask: file permissions are set for newly created files
        • TCP/UDP packet size: 64KB, MTU: 1500 B
        • Zero Copy: From kernel to NIC, no user space
        • https://fabiokung.com/2014/03/13/memory-inside-linux-containers/
        • executing machine code in memory: Doable, think JIT runtime and python interpreter
        • HDD vs SSD:
          • SSD连续读的能力相比普通磁盘优势并不明显; SSD适合多读,多随机读写
          • erase-before-write: SSD必须erase before write
          • 因为SSD存在“写磨损”的问题,当某个单元长时间被反复擦写时(比如Oracle redo),不仅会造成写入的性能问题,而且会大大缩短SSD的使用寿命,所以必须设计一个均衡负载的算法来保证SSD的每个单元能够被均衡的使用,这就是wear leveling,称为损耗均衡算法 -- offline erase
          • 传统数据库日志是sequential logging因为是连续位置的随机写入; SSD会大大折寿, SSD采用in-place logging: data and log in the same block; 日志文件还是适合HDD
          • SSD作为flashcache掉电后数据是有效还是无效?被当作无效的
        • 4K对齐是一种高级硬盘使用技术,用特殊方法将文件系统格式与硬盘物理层上进行契合,为提高硬盘寿命与高效率使用硬盘空间提供解决方案。因该技术将物理扇区与文件系统的每簇4096字节对齐而得名。当前电脑传统机械硬盘的每个扇区一般大小为512字节

        Filesystem







        《Linux内核设计与实现》

        • Overview
          • In Linux, each processor is doing exactly one of three things at any given moment
            • In user-space, executing user code in a progress
            • In kernel-space, in process context, executing on behalf of a specific process
            • In kernel-space, in interrupt context, not associated with a process, handling an interrupt
              • top halve has interrupt disabled
          • Kernel memory is not pageable
        • Process Management
          • Process includes
            • open files
            • pending signals
            • internal kernel data
            • processor state
            • memory address space (multiple memory mappings) page tables
            • data section containing global variables
          • Thread includes
            • unique program counter
            • process stack
            • set of processor registers
          • The kernel stores the list of processes in a circular doubly linked list called the task list.
          • Fork() only overhead: duplication of the parent's page tables & creation of a unique process descriptor for the child
            • ZOMBIE only has kernel stack, thread_info structure, and task_struct structure
            • When a task is ptraced, it is temporarily reparented to the debugging process
          • kernel threads do not have an address space (not normal process)
        • Process Scheduling: low latency (process response time) and high throughput (system utilization)
          • Context switch, the switching from one runnable task to another:
            • calls switch_mm() to switch virtual memory mapping
            • calls switch_to() to switch the processor state: saving/restoring stack information and processor registers and others
          • User preemption can occur:
            • when returning to user-space from a system call
            • when returning to user-space from an interrupt handler
          • Kernel preemption can occur: Linux 2.6 is a fully preemptive kernel, it is possible to preempt a task at any point, so long as the kernel is safe to reschedule.
            • NO LOCK. locks are used as markers of regions of nonpreemptibility
            • scheduler_tick() will check flag need_resched (which is per-process)
          • Kernel preemption can occur:
              • When an interrupt handler exits, before returning to kernel-space
              • When kernel code becomes preemptible again (NO LOCK, preempt_count is 0)
              • If a task in the kernel explicitly calls schedule()
              • If a task in the kernel blocks (which results in calling schedule())
          • System Calls
            • syscall (interrupt vector 128): an exception or trap to enter the kernel
            • syscall must be reentrant
          • Interrupts and Interrupt Handlers
            • Why interrupt? Processor polling incurs overhead, interrupt is what hardware to signal the kernel to get attention
            • Interrupt vs Exception:
              • interrupt handlers (top halves) are executed by the kernel asynchronously in immediate response to hardware interrupt
              • exceptions occur synchronously with respect to the processor clock, that they are called synchronous interrupts. (e.g., divide by zero, a page fault)
            • Special interrupt context: it is called atomic context as code executing in this context is unable to block (Top Halves)
            • Interrupt handlers can form only the first half of any interrupt processing solution due to these limitations:
              • Interrupt handlers run asynchronously and thus interrupt other, potentially important, code, including other interrupt handlers. (need to run fast)
              • Interrupt handlers run with at best the current interrupt level disabled, and at at worst all interrupts on the current processor disabled. But disabling interrupts prevents hardware from communicating with the operating systems
              • Interrupt handlers do not run in process context; therefore, they cannot block
              • They are often timing-critical
          • Kernel Synchronization
            • When only a single processor, the only way data could be concurrently accessed was: interrupt occurred or if kernel code explicitly rescheduled and enabled another task to run
            • Causes of concurrency:
              • Interrupts: An interrupt can occur asynchronously at almost any time, interrupting the currently executing code
              • Softirqs and tasklets: The kernel can raise or schedule a softirq or tasklet at almost any time, interrupting the currently executing code
              • Kernel preemption: Because the kernel is preemptive, one task in the kernel can preempt another
              • Sleeping and synchronization with user-space: A task in the kernel can sleep and thus invoke the scheduler, resulting in the running of a new process
              • Symmetrical multiprocessing: Two or more processors can execute kernel code at exactly the same time (per CPU memory)
            • Lock data not code
            • Deadlock
              • Implement lock ordering
              • Prevent starvation. Ask yourself, does this code always finish? If foo does not occur, will bar wait forever?
              • Do not double acquire the same lock
              • Only one process move at one time (prevent livelock)
          • Memory Management
            • A page may be used by page cache, private data, or as a mapping in a process's page table
            • kernel page structure is associate with physical pages not virtual pages
            • high memory: memories one architecture can not directly map

          《深入理解计算机系统》


          链接
          • Static Linking: ld copies functions from AR file
          • Dynamic Linking:
            • At loading:
              • compile to get partial executable object file from shared object (mark symbols relocatable)
              • execve(ld-linux.so),relocate目标文件和可重定向文件libc.so,  relocate text and data
            • At runtime: dlopen() loads functions from 共享库 to its memory directly (top中SHR的一部分就是dll文件)
          • 3 kinds of object files:
            • Executable object file
            • Relocatable object file
              • Shared object (shared library)
          异常处理
          • interrupt(键盘), trap, fault (除0), abort(机器检查)
          • syscall is trap 128
          • User vs Kernel mode is controlled by one mode bit in 控制寄存器
          • context: 寄存器、浮点寄存器、PC、User stack、状态寄存器、Kernel stack、页表、进程表、文件表
          • 信号的缺陷:
            • 同类型待处理信号被阻塞
            • 待处理信号不会排队等待:queue = 1
            • 系统调用可以被中断:read会返回EINTR
          • fork() vs execve()
            • fork将父子进程每个页面标记为read only,每个section都标记为private copy-on-write
            • execve: 删除已存在的用户区域->映射私有区域(text data bss stack heap)->映射共享区域(libc.so)->设置PC
              • execve会继承打开了的fd
          • 打开文件的数据结构
            • 描述符表(descriptor table): 指向file table
            • 文件表(file table): shared by all processes. 当前文件位置, 引用计数(多少个进程指向它),  指向v-node table
            • v-node table: shared by all processes. 文件访问、大小、类型


          Linux startup process



















          Boot

          1. Power On Self Test 
          2. Boot from CD, USB, HD 
          3. Load kernel into initramfs
            1. BIOS -> Master Boot Record (first sector) -> GRUB -> Active Partition load kernel
            2. UEFI (低阶OS) -> GPT -> /boot/efi/boot.efi -> GRUB/kernel
              1. EFI system partition is based on FAT, contains boot loader or kernel image
            3. Why GRUB has so many stages?
              1. stage 1, boot.img, 446 B, within first sector: to locate stage 2
              2. stage 1.5, core.img, 25 KB, between MBR and first partition: contains a few common filesystem drivers (Not all filesystems)
              3. stage 2, /boot/grub2, The kernels are located in the /boot directory, along with an initial RAM disk image, and device maps of the hard drives
          4. Why BIOS could not load kernel directly? Does not have system driver for addressing
          Startup
          1. Kernel loads systemd/init, then mounts filesystems in /etc/fstab
          2. Kernel calls start_kernel() to set up system functions: hardware and memory paging, interrupt handlers, the rest of memory management, device and driver initialization
            1. idle process: power saving mode
            2. scheduler: Complete Fair Scheduler O(logN)
            3. init/systemd who mounts /etc/fstab
          3. init start services from specific levels
          4. display manager and login manager -> session manager
          Shutdown
          1. Close down user space functionality
          2. init terminates
          3. kernel shutdown



          2017年7月17日星期一

          Cassandra

          Cassandra
          Node repair 2.1
          • Frequent data deletions and downed nodes are common causes of data inconsistency. Manual repair: Anti-entropy repair
          • Parallel repair will only repair each token range ONCE. if you are using “nodetool repair -pr” you must run it on EVERY node in EVERY data center, no skipping allowed.

          2017年5月11日星期四

          한국어

          처음 뵙(拜)겠습니다 初次见面
          잘 부탁(付托)드리겠습니다 请多多关照
          천만(千万) 에요 哪里哪里 不客气
          ? 两个人的位置
          저 미안(未安)합니다만 劳驾
          메뉴를 보여 주세요 请给我看下菜单
          물 좀 더 주세요 请加点水
          ? 没关系
          조심(操心)하세요 保重

          옆을 짧게 깎아 주세요 两侧请剪短一些

          通过日语学韩语: https://www.zhihu.com/question/19830338/answer/93071137
          Taeyeon的唱功: https://www.zhihu.com/question/21424874/answer/115636610

          2017年3月18日星期六

          量子力学


          • 上帝掷骰子读书笔记
          • 走进量子纠缠PDF
          • 量子力学入门
            • 测不准定理
          • EPR佯谬
            • 隐变量理论:爱因斯坦的解释,因为他相信定域性。贝尔提出贝尔不等式支持爱因斯坦,却证明了量子没有定域性
            • 波函数坍塌理论
          • 量子力学诠释
            • 爱因斯坦
              • 定域性。超光速?量子没有定域实在性
                • 狭义相对论的时光倒流会不会影响因果律?
              • 实在论: “月亮是否依旧存在,即使无人赏月?” 实在论者,承继了古希腊哲学家柏拉图的看法,他们相信,我们所相信为真实的一切都只是近似于真实的存在。人类感官所感受到的世界,只是真实的一种投射,并不是真实。
              • 决定性
            • 波函数坍塌
              • 多宇宙诠释认为波函数没有因观察坍塌
              • 观察导致坍塌:保罗·戴维斯所说,“现实存在于观察之中,而不在电子之中”。
          • 应用:量子计算、量子光学、量子通信

          2017年3月10日星期五

          金融与经济

          三元悖论

          • 资本自由进出(Capital mobility)


          2017年3月1日星期三

          Distributed System Design

          Unix / Linux systems and internals
          https://tasteturnpike.blogspot.com/2021/04/database.html
          http://tasteturnpike.blogspot.com/2017/09/network.html
          https://tasteturnpike.blogspot.com/2017/11/site-reliability-engineering-how-google.html

          System Design
          https://puncsky.com/hacking-the-software-engineer-interview/?isAdmin=true
          https://github.com/checkcheckzz/system-design-interview
          http://www.aosabook.org/en/distsys.html

          Companies
          http://highscalability.com/blog/2015/9/14/how-uber-scales-their-real-time-market-platform.html
          https://www.quora.com/What-is-YouTubes-architecture

          估算问题Envelope Calculations: http://highscalability.com/blog/2011/1/26/google-pro-tip-use-back-of-the-envelope-calculations-to-choo.html

          Design 范式




          1. Compare 2PC to Paxos. What is different? What is similar? Will you ever need both together in a system?
          2. Raft
          3. Clock


          CockroachDB: The Resilient Geo-Distributed SQL Database


          1. shared-nothing architecture: all nodes are used for both data storage and computation
          2. single node is layered architecture:
            1. SQL: parser, optimizer, SQL execution engine (convert SQL to low-level read write requests to KV store)
            2. Transactional KV: ensures atomicity of changes spanning multiple KV pairs; isolation guarantee
            3. Distribution: a monolithic logical key space ordered by key. chunk = Range = 64MB. The Distribution layer is responsible for identifying which Ranges should handle which subset of each query, and routes the subsets accordingly.
            4. Replication: consensus-based replication
            5. Storage: RocksDB -> Pebble
          3. Fault Tolerance and High Availability
            1. Replication using Raft: Replicas of a Range form a Raft group. Each replica individually applies command. Usually Raft group leader acts as the leaseholder. All writes go through the leaseholder. Leases for user Ranges are tied to the liveness of the node the leaseholder is on. Heartbeat is every 4.5 seconds.
            2. Membership changes and automatic load rebalancing. Use peer-to-peer gossip protocol to determine node liveness
            3. Replica placement. Each nodes has its own attributes like capability (CPU, RAM) or locality (country, region)
          4. Transactions
            1. Execution at the transaction coordinator: coordinator receives a series of requested KV operations from the SQL layer. Optimization: batch multi-statement SQL transactions to complete with one replication.
              1. Write Pipelining: if no overlapping, operations on different keys can be "pipelined". if operation depends on earlier in-flight operation, execution must wait for earlier in-flight execution
              2. Parallel Commits: Naively, needs at least two sequential rounds of consensus before commit??? Optimization: initiate the replication of the staging status in parallel with the verification of the outstanding writes???
            2. Execution at the leaseholder:
              1. Acquires latches on the keys and dependencies
              2. Write operations are replicated. After consensus is reached, each replica applies the command to local storage engine
              3. Release latches
            3. Transaction record: A metadata which is a special key stores the current disposition of transaction: pending, staging, committed or aborted
            4. Concurrency Control: A total ordering of all transactions in the system, representing a serializable execution. Conflicts may require commit timestamp adjustments: The transaction typically tries to prove that its prior reads remain valid at the new timestamp 如果并发transaction撞到一起,则调后以错开它们的timestamp以达到serializable execution。并不需要按照它们出生的时间顺序进行执行,而是尊重它们物理到达时间,先到先执行
              1. Deadlock: CRDB employs a distributed deadlock-detection algorithm to abort one transaction from a cycle of waiters.
              2. Write-read conflicts: write在前,read等待完成
              3. Read-write conflicts: 后头的read先到并执行,write等待并改到read后的时间
              4. Write-write conflicts: 后头的write先到并执行,前头的write改时间过后头
            5. Read Refresh: Certain types of conflicts described above require advancing the commit timestamp of a transaction. To maintain serializability, the read timestamp must be advanced to match the commit timestamp
              1. 个人理解:在并发时不用保证多用户request的实际顺序,而要保证单用户多transaction的顺序: 一个用户的RW到达时变为WR,这时read不能advance,并触发read refresh
              2. Advancing a transaction’s read timestamp from ta to tb > ta is possible if we can prove that none of the data that the transaction read at ta has been updated in the interval (ta,tb ]. If the data has changed, the transaction needs to be restarted. If no results from the transaction have been delivered to the client, CRDB retries the transaction internally. If results have been delivered, the client is informed to discard them and restart the transaction
              3. To determine whether the read timestamp can be advanced, CRDB maintains the set of keys in the transaction’s read set (up to a memory budget) This involves re-scanning the read set and checking whether any MVCC values fall in the given interval. This process is equivalent to detecting the rw-antidependencies that PostgreSQL tracks for its implementation of SSI. Similar to PostgreSQL, our implementation may allow false positives (forcing a transaction to abort when not strictly necessary) to avoid the overhead of maintaining a full dependency graph
            6. Follower Reads: CRDB allows non-leaseholder replicas to serve requests for read-only queries with timestamps sufficiently in the past. Each leaseholder tracks the timestamps of all incoming requests and periodically emits a closed timestamp, the timestamp below which no further writes will be accepted. Every node keeps a record of its latency with all other nodes in the system. When a node in the cluster receives a read request at a sufficiently old timestamp (closed timestamps typically trail current time by ~2 seconds), it forwards the request to the closest node with a replica of the data.
          5. Clock Synchronization
            1. Hybrid-Logical Clocks:
              1. timestamps that are a combination of physical and logical time
              2. maximum allowable offset defaults to 500 ms
              3. HLCs provide causality tracking through their logical component upon each inter-node exchange. The most important of these is a lease disjointness invariant similar to that in Spanner: for each Range, each lease interval is disjoint from every other lease interval. This is enforced on cooperative lease handoff with causality transfer through the HLC and is enforced on non-cooperative lease acquisition through a delay equal to the maximum clock offset between lease intervals
              4. Strictly monotonic timestamp allocation ensures that two causally dependent transactions originating from the same node are given timestamps that reflect their ordering in real time.
            2. Uncertainty Intervals:
              1. CRDB does not support strict serializability because there is no guarantee that the ordering of transactions touching disjoint key sets will match their ordering in real time. In practice, this is not a problem for applications unless there is an external low-latency communication channel between clients that could potentially impact activity on the DBMS.
              2. CRDB satisfies single-key linearizability for reads and writes.
              3. Transaction can have earlier timestamp, but its not earlier in reality: if transaction sees future key within uncertainty interval, it will perform an uncertainty restart 
            3. Behavior under Clock Skew: If any node exceeds the configured maximum offset by more than 80% compared to a majority of other nodes, it self-terminates
              1. 如果有一个捣蛋鬼node时间晚了一个小时,也不自杀,该怎么半?certificate被用于验明正身
            4. https://www.cockroachlabs.com/blog/living-without-atomic-clocks/
              1. strongest guarantee of consistency called “external consistency”. To confuse things further, this is what folks interchangeably refer to as “linearizability” or “strict serializability”: 顺序发生的[T1,T2],T2后执行,但是拿着更早的timestamp。回看历史的时候就认为T2先执行(Note that this can only happen when the two transactions access a disjoint set of keys.)
                1. 如果T1 T2 有相同key,那么它们肯定在clock offset前被检测出来
                  1. Spanner选择等待7ms, 7ms内T2肯定会到. 但让所有操作等7ms并不实际,最新研究可以减少到1ms
                2. 如果没有相同key, CRDB认为这个情况无所谓
              2. Spanner (TrueTime) provides linearizability, CockroachDB only goes as far as to claim serializability
              3. Spanner always waits after writes, CockroachDB sometimes retries reads.
              4. 虽然CockroachDB不认为linearizability很重要,但commit timestamp很重要。它必须要比所有nodes时间晚,否则它可能读不到already-committed data - an unacceptable breach of consistency
                1. Spanner选择当前TrueTime即可满足当前committed transaction已经在7ms前完成
                2. CockroachDB makes use of a “causality token”, which is just the maximum timestamp encountered during a transaction. It’s passed from one actor to the next in a causal chain, and serves as a minimum timestamp for successive transactions to guarantee that each has a properly ordered commit timestamp.
                3. When CockroachDB starts a transaction, it chooses a provisional commit timestamp based on the current node’s wall time. It also establishes an upper bound on the selected wall time by adding the maximum clock offset for the cluster \[commit timestamp, commit timestamp + maximum clock offset]. This time interval represents the window of uncertainty. 当碰到比自己晚的timestamp,则uncertainty restart: bumping the provisional commit timestamp just above the timestamp encountered,但是不用改上限 -- Transactions reading constantly updated data from many nodes may be forced to restart multiple times, though never for longer than the uncertainty interval, nor more than once per node.
                  1. If it’s unlucky, the read may have to be retried more than once. There is an upper limit on how many retries can occur based on how clocks are synchronized. For NTP, this could be 250ms, so even the most unlucky transactions won’t have to retry for more than 250ms for clock-related reasons.
          6. SQL
            1. Primary index is keyed on primary key. Secondary indexes are keyed on index key.
            2. Vectorized execution engine: data from disk is transposed from row to column format as it is being read from CRDB’s KV layer, and transposed again from column to row format right before it is sent back to the end user. The overhead of this process is minimal.
            3. Schema Changes: schema change becomes a sequence of incremental changes. In this protocol, the addition of a secondary index requires two intermediate schema versions between the initial and final ones to ensure that the index is being updated on writes across the entire cluster before it becomes available for reads.
          7. Lessons Learned
            1. Raft
              1. Reducing Chatter: Too many heartbeats for a deployment of thousands of consensus groups (one per Range)
                1. combine heartbeat messages into one per node (not one per-RPC)
                2. pause heartbeat from Raft groups with no recent write activity
              2. Joint Consensus: rebalancing replicas across regions will lose availability during single region outage. (temporarily replicas becomes 2 or 4)
                1. In Joint Consensus, an intermediate configuration exists, but requires instead the quorum of both the old and new majority for writes.
            2. Removal of Snapshot Isolation 很难简单的实现
              1. The only safe mechanism to enforce strong consistency under snapshot isolation is pessimistic locking, via the explicit locking modifiers FOR SHARE and FOR UPDATE on queries. To guarantee strong consistency across concurrent mixed isolation levels, CRDB would need to introduce pessimistic locking for any row updates, even for SERIALIZABLE transactions
            3. Multi Version: Propose the effect of request to Raft, not the request itself
          8. Related Work
            1. Spanner: Spanner [4, 17] is a SQL system that provides the strongest isolation level, strict serializability [30]. It achieves this by acquiring read locks in all read-write transactions and waiting out the clock uncertainty window (the maximum clock offset between nodes in the cluster) on every commit. CRDB’s transaction protocol is significantly different from Spanner’s; it uses pessimistic write locks, but otherwise it is an optimistic protocol with a read refresh mechanism that increases the commit timestamp of a transaction if it observes a conflicting write within the clock uncertainty window. This approach provides serializable isolation and has lower latency than Spanner’s protocol for workloads with low contention. It may require more transaction retries for highly contended workloads, however, and for this reason future versions of CRDB will include support for pessimistic read locks. Note that Spanner’s protocol is only practical in environments where specialized hardware is available to bound the uncertainty window to a few milliseconds. CRDB’s protocol functions in any public or private cloud.
            2. TiDB, NuoDB vs CRDB: these systems are not optimized for geo-distributed workloads and only support snapshot isolation
          1. SQL Client sends a query to your cluster.
          2. Load Balancing routes the request to CockroachDB nodes in your cluster, which will act as a gateway.
          3. Gateway is a CockroachDB node that processes the SQL request and responds to the client.
            1. Generate SQL physical plan. For example, the SQL executor converts INSERTstatements into Put() operations.
            2. TxnCoordSender performs a large amount of the accounting and tracking for a transaction, including:

              • Accounts for all keys involved in a transaction. This is used, among other ways, to manage the transaction's state.
              • Packages all key-value operations into a BatchRequest, which are forwarded on to the node's DistSender.
            3. DistSender dismantles the initial BatchRequest by taking each operation and finding which physical machine should receive the request for the range—known as the range's leaseholder. The DistSender then waits to receive acknowledgments for all of its write operations, as well as values for all of its read operations.
          4. Leaseholder is a CockroachDB node responsible for serving reads and coordinating writes of a specific range of keys in your query.
            1. write intents for a key: these are provisional, uncommitted writes that express that some other concurrent transaction plans to write a value to the key.
              1. the new transaction attempts to "push" the write intent's transaction by moving that transaction's timestamp forward (i.e., ahead of this transaction's timestamp); however, this only succeeds if the write intent's transaction has become inactive.
              2. 总之就是比timestamp,先到先执行,或者尝试把前头的transaction推前,如果自己插队了就自我中止,否则只好等着
            2. leasholder当然有最新的数据供read,但是其它node未必
          5. Raft leader is a CockroachDB node responsible for maintaining consensus among your CockroachDB replicas.
          6. The gateway node's DistSender aggregates commit acknowledgments from all of the write operations in the BatchRequest, as well as any values from read operations that should be returned to the client.
            1. DistSender 检查timestamp是否被push,如果是则read refresh to see if any values it needed have been changed,如果改变了就重来
            2. 只有所有operation成功且通过check才会commit. TxnCoordSender 开始异步清理之中生成的write intent,并convert all write intents to fully committed values
          Multi-tenant 架构
          1. 每个用户独享SQL Pod
          2. Storage pod被共享,EBS is the block storage system

          touch on security, alerting, monitoring, and access. For example, "how do you keep CRDB running?"
          Access: SSO for human, TLS public root certificate for SQL client with password

          1. Multiple disks/SSDs per node: not one node per disk
          2. At least 3 availability zones: spread nodes evenly on each zone
          3. At least 3 regions
          4. Node越大越稳定, node越小failure影响越小
          5. Storage
            1. /datastore及/log用独立的volume
          6. Security
            1. A certificate authority (CA) certificate and key, used to sign all of the other certificates.
            2. A separate certificate and key for each node in your deployment, with the common name node.
            3. A separate certificate and key for each client and user you want to connect to your nodes, with the common name set to the username. The default user is root.

          7. Load Balancing: HAproxy with keepalived
          8. Clock synchronization
            1. While serializable consistency is maintained regardless of clock skew, skew outside the configured clock offset bounds can result in violations of single-key linearizability between causally dependent transactions. It's therefore important to prevent clocks from drifting too far by running NTP or other clock synchronization software on each node.




          TiDB


          https://github.com/pingcap/docs/blob/7eb32c7363e62bbef787e7bf674924b95bc1af37/transaction-isolation-levels.md TiDB implements Snapshot Isolation (SI) consistency, which it advertises as REPEATABLE-READ

          The main purpose for TiKV to use LSM-tree instead of B-tree as its underlying storage engine is because using cache technology to promote read performance is much easier than promote write performance.




          Distribute single file to 10k servers using DHT

          Consistent hash
          Think how node join/leave: 把一个环切成若干段,加入一个新节点时就多加一个段点,新节点负责该点到之前点的数据;后面k-1个节点都需要复制这个段的数据 (replica is k)



          etcd



          etcd is designed to reliably store infrequently updated data and provide reliable watch queries. etcd exposes previous versions of key-value pairs to support inexpensive snapshots and watch history events (“time travel queries”). A persistent (B+ tree), multi-version, concurrency-control data model is a good fit for these use cases.

          Deletion is adding a tombstone

          Each transaction/ keyspace modification will create a revision. The key of key-value pair is a 3-tuple (major, sub, type). Major is the store revision holding the key. Sub differentiates among keys within the same revision. The value of the key-value pair contains the modification from previous revision, thus one delta from previous revision. The b+tree is ordered by key in lexical byte-order. Ranged lookups over revision deltas are fast; this enables quickly finding modifications from one specific revision to another.


          Kafka

          How is Kafka's durability?


          Prometheus


          Prometheus needs manual management for scalability: https://prometheus.io/docs/introduction/faq/#i-was-told-prometheus-doesnt-scale


          Ideal Architecture

          Service level design

          1. What is the product use scenario, what need to trade-off? What would piss of our users?
          2. Do we need storage? Uber driver location does not need storage
          3. How to scale storage? Sharding and replication
            1. how to find the sharded storage? Needs to index them somewhere
          4. Is AppServer stateless? Otherwise, why?
            1. Principle: workers only ask for tasks it can handle, add one scheduler to coordinate tasks assignment. e.g., One worker can handle Ohio state, but another work can only handle NYC task
            2. Transactions. Cut super large transactions into stages/more services; add a watcher to record transaction progress
            3. Needs locality for high performance? How high is the locality needs? Can we move these localities/business logic into storage layer?
          5. How about join operations between workers? Has to cut it into multiple stages, like Big Data?
          6. Routing: always route to nearest DC; always determine locality as early as possible

          Host level design

          Extreme performant Linux web service:
          1. How to exhaust multi-processor's performance:
            1. multi-processing vs multi-threading: depends on how many resources are shared
            2. Python asyncio: ensure each operation is asynchronous, then spawn 32 processes
            3. Goroutines: pro for developer velocity, forget about concurrency cost, just do synchronous operations
          2. Threading model: N:1, kernel schedules at process level, process schedules its threads at its own discretion; 1:1, kernel schedules at thread level. Hybrid model
          3. Linux web programing: multi-processing using epoll; parent process dispatch accepted fd
          4. fork() is dangerous if parent has a mysql connection fd: child could stop it from being closed




          Consistency: All clients see the same most recent write or error
          Availability: All reads contain data without error, but may not be most recent
          Partition Tolerance: The system works despite network failures





          分布式系统





          ACID


          PolarDB


          RDS Provisioned IOPS is ~ 100k IOPS


          阿里云 PolarDB 和 Amazon Aurora 的一个共同设计哲学就是,放弃了通用分布式数据库 OLTP 多路并发写的支持,采用一写多读的架构设计

          主节点和只读节点之间是 Active-Active 的 Failover 方式,计算节点资源得到充分利用,由于使用共享存储,共享同一份数据



          PolarDB 的设计思想有几个大的革新。一是通过重新设计特定的文件系统来存取 Redo log 这种特定的 WAL I/O 数据,二是通过高速网络和高效协议将数据库文件和 Redo log 文件放在共享存储设备上,避免了多次长路径 I/O 的重复操作,相比较 Binlog 这种方式更加巧妙

          主节点和只读节点之间采用 Active-Active 的 Failover 方式,提供 DB 的高可用服务。DB 的数据文件、redolog 等通过 User-Space 用户态文件系统,经过块设备数据管理路由,依靠高速网络和 RDMA 协议传输到远端的 Chunk Server。同时 DB Server 之间仅需同步 Redo log 相关的元数据信息。Chunk Server 的数据采用多副本确保数据的可靠性,并通过 Parallel-Raft 协议保证数据的一致性。

          Shared Disk 架构

          PolarDB 采用 Shared Disk 架构,其根本原因是上述的计算与存储分离的需要。逻辑上 DB 数据都放在所有 DB server 都能够共享访问的数据 chunk 存储服务器上。而在存储服务内部,实际上数据被切块成 chunk 来达到通过多个服务器并发访问 I/O 的目的


          物理 Replication

          我们知道,MySQL Binlog 记录的是 Tuple 行级别的数据变更。而在 InnoDB 引擎层,需要支持事务 ACID,也维持了一份 Redo 日志,存储的是基于文件物理页面的修改。这样 MySQL 的一个事务处理默认至少需要调用两次 fsync() 进行日志的持久化操作,这对事务处理的系统响应时间和吞吐性能造成了直接的影响。尽管 MySQL 采用了 Group Commit 的机制来提升高并发下的吞吐量,但并不能完全消除 I/O 瓶颈。

          此外,由于单个数据库实例的计算和网络带宽有限,一种典型的做法是通过搭建多个只读实例分担读负载来实现 Scale out。PolarDB 通过将数据库文件以及 Redolog 等存放在共享存储设备上,非常讨巧的解决了只读节点和主节点的数据 Replication 问题。由于数据共享,只读节点的增加无需再进行数据的完全复制,共用一份全量数据和 Redo log,只需要同步元数据信息,支持基本的 MVCC,保证数据读取的一致性即可。这使得系统在主节点发生故障进行 Failover 时候,切换到只读节点的故障恢复时间能缩短到 30 秒以内。系统的高可用能力进一步得到增强。而且,只读节点和主节点之间的数据延迟也可以降低到毫秒级别。

          从并发的角度来看,使用 Binlog 复制现在只能按照表级别并行复制,而物理复制只按照数据页维度,粒度更细,并行效率更加高。

          最后一点,引入 Redolog 来实现 Replication 的好处是,Binlog 是可以关闭来减少对性能的影响,除非需要 Binlog 来用于逻辑上的容灾备份或者数据迁移。

          总之,在 I/O 路径中,通常越往底层做,越容易和上层的业务逻辑和状态解耦而降低系统复杂度。而且这种 WAL Redo log 大文件读写的 I/O 方式也非常适用于分布式文件系统的并发机制,为 PolarDB 带来并发读性能的提高。

          高速网络下的 RDMA 协议

          RDMA 通常是需要有支持高速网络连接的网络设备(如交换机,NIC 等),通过特定的编程接口,来和 NIC Driver 进行通讯,然后通常以 Zero-Copy 的技术以达到数据在 NIC 和远端应用内存之间高效率低延迟传递,而不用通过中断 CPU 的方式来进行数据从内核态到应用态的拷贝,极大的降低了性能的抖动,提高了整体系统的处理能力。

          Snapshot 物理备份

          Snapshot 是一种流行的基于存储块设备的备份方案。其本质是采用 Copy-On-Write 的机制,通过记录块设备的元数据变化,对于发生写操作的块设备进行写时复制,将写操作内容改动到新复制出的块设备上,来实现恢复到快照时间点的数据的目的。Snapshot 是一个典型的基于时间以及写负载模型的后置处理机制。也就是说创建 Snapshot 时,并没有备份数据,而是把备份数据的负载均分到创建 Snapshot 之后的实际数据写发生的时间窗口,以此实现备份、恢复的快速响应。PolarDB 提供基于 Snapshot 以及 Redo log 的机制,在按时间点恢复用户数据的功能上,比传统的全量数据结合 Binlog 增量数据的恢复方式更加高效。


          Amazon Aurora


          storage is independent of computing

          The Aurora shared storage architecture makes your data independent from the DB instances in the cluster. For example, you can add a DB instance quickly because Aurora doesn't make a new copy of the table data. Instead, the DB instance connects to the shared volume that already contains all your data. You can remove a DB instance from a cluster without removing any of the underlying data from the cluster. Only when you delete the entire cluster does Aurora remove the data.


          As a result, all Aurora Replicas return the same data for query results with minimal replica lag. This lag is usually much less than 100 milliseconds after the primary instance has written an update. Replica lag varies depending on the rate of database change. That is, during periods where a large amount of write operations occur for the database, you might see an increase in replica lag.

          既然主从都用一样的storage: cluster volume,为何主从之间存在replication lag? 

          Because Aurora Replicas read from the same cluster volume as the writer DB instance, the ReplicaLag metric has a different meaning for an Aurora PostgreSQL DB cluster. The ReplicaLag metric for an Aurora Replica indicates the lag for the page cache of the Aurora Replica compared to that of the writer DB instance.

          因为它使用了一个独立的page cache process,而不是直接从disk读。猜想: writer instance的操作会直接replicate到reader instance page cache


          In the unlikely event of a database failure, the page cache remains in memory, which keeps current data pages "warm" in the page cache when the database restarts. This provides a performance gain by bypassing the need for the initial queries to execute read I/O operations to "warm up" the page cache.


          MicroServices


          Coding

          屏蔽关键词,及替换词{"o": [], "1": ["l"]}:TRIE tree

          System Design



          design URL shorter


          实现聊天室,各种功能