2017年11月27日星期一

Site Reliability Engineering - How Google Runs Production Systems

Ch10: Practical Alerting from Time-Series Data

Borgmon clusters could stream with each other, upper-tier filter the data from lower-tier and aggregate on it. Thus, the aggregation hierarchy provides metrics with different granularity. (DC -> campus -> Global)

Ch19: Load Balancing at the Frontend

Traffic are different. Search requests want latency, Video requests want throughput.

DNS layer LB

Clients don't know the closest IP -> anycast -> a map of all networks' physical locations (but how to keep it updated)
DNS middleman's cache (don't know how many users will be impacted)(need a low TTL for propagation)
DNS needs to know DC's capacity

Virtual IP layer LB: LVS packet encapsulation GRE(IP inside)

ch20: Load Balancing in the Datacenter

Active-request limit: client marks backend as bad if its backlog is high. But many long queries will fail client quickly
Backend has states: healthy, unresponsive, lame duck (stop sending requests to me)
Subsetting: one client could only talk with limited backends to prevent "backend killer" client
Weighted Round Robin on backend-provided load information

ch21: Handling Overload

Client-side adaptive throttling: Preject = max(0, req - K*accept/req)
Request is tagged with priority
When to retry

per-request retry budget is 3
per-client retry budget is 110% (avoid 3X load)
backend will give "overloaded; don't retry" if the backends' retry histogram is heavy

Put a batch proxy if clients have too many connections on backends

ch23: Managing Critical State: Distributed Consensus for Reliability

Whenever you see leader election, critical shared state, or distributed locking, use distributed consensus system.
Different algorithms: crash-fail or crash-recover? Byzantine or non-Byzantine failures?
Fundamental is replicated state machine (RSM): is a system that executes the same set of operations, in the same order, on several processes.
The consensus algorithm deals with agreement on the sequence of operations, and the RSM executes the operations in that order.

What RSM could build?

Replicated Datastores and Configuration Stores: RSM provides consistency semantics. Other (nondistributed-consensus-based) systems rely on timestamps (refer Spanner)
Highly Available Processing: Leader election of GFS to ensure only one leader for coordinating workers
A barrier: blocking a group of processes from proceeding until some condition is met -> MapReduce's phases and distributed locking
Atomic broadcast: for queueing, ensure messages are received reliably and in the same order

Performance

Network RTT: use regional proxies to hold persistent TCP/IP connections, and it could also serve for encapsulating sharding and load balancing strategies, as well as discovery of cluster members and leaders
Fasst Paxos: each client sends Propose directly to each member
Stable Leaders: but leader will be a bottleneck
Batching jobs
Disk: consider batching/combine transaction log into a single log

ch24: Distributed Periodic Scheduling with Cron

Decouple processes from machines: sending RPC requests instead of launching jobs
Tracking the state of cron jobs:

where

Store data externally like GFS or HDFS
Store internally as part of the cron service (3 replicas), store snapchat locally, do not store logs on GFS (too slow)

what

when it is launched
when it has finished

Use Paxos:

Single leader launches cron job (or through another DC scheduler), only launch jobs after Paxos quorum is met (synchronously) is it slow?
Followers need to update each jobs' finish time in case leader dies

Resolving partial failures:

all external operations must be idempotent; or the state is stored externally unambiguously
must record launched schedule time to prevent double scheduling. Trade-off between risk double launch vs skipping a launch

ch25: Data Processing Pipelines

How to implement a periodic pipeline? (latency, throughput, fast startup, even distributed, thundering herd issue)

Google Workflow uses model-view-controller pattern

Task Master(model): hold all jobs states (pointers) in RAM, and actual input/output data is stored in a common HDFS
Workers(view): stateless
Controller: auxiliary system activities: runtime scaling, snapshotting, workcycle state control, rolling back pipeline state

How to guarantee correctness?

Worker output through configuration tasks creates barriers on which to predicate work
All work committed requires a currently valid lease held by the worker
Output files are uniquely names by the workers
The client and server validate the Task Master itself by checking a server token on every operation

Failure model:

Task Master store journals on Spanner: achieve global availability, global consistency, low-throughput filesystem.
Use Chubby to elect writer and persist result in Spanner
Globally distributed Workflows employ 2+ local Workflows running in distinct clusters
Each task has a "heartbeat" peer, upon peer timeout it will resume the task instead

2017年9月11日星期一

Network

http://tasteturnpike.blogspot.com/2017/03/sre-knowledge.html

网络基本功系列：细说网络那些事儿

https://github.com/alex/what-happens-when#dns-lookup 浏览器如何访问google

Proxy Digest Authentication?

Why TChannel?

Why not HTTP? HTTP connections are uni-directional, can only be used for one concurrent request, and are built for delivering web pages with cache hints. HTTP tightly couples encoding and transport concerns (the path implies both the procedure name and sometimes even parts of the request body), having a single, header namespace with variable widths. It is not ideal for RPC, but it's fantastic for web pages.

HTTP2:

Solve head of line blocking by HTTP pipelining

HTTP is poll mode, not bidirectional: http协议和websocket协议的区别

Multiplexing over single TCP connection

HTTP2 vs WebSocket

WebSocket is bidirectional; HTTP2 is client/server + server push

TChannel: RPC, multiplexing, bi-directional message passing, fast forwarding, work shedding with deadlines, cancellation, speculative execution, and communicates more clearly about when and when not to retry.

Improve Web Speed: client -> DNS resolver -> DNS -> DNS resolver -> client --req--> Web server

Speed = DNS latency + Request latency. Which DNS has lowest latency? Can't measure directly because of DNS resolver delegates the name resolution. Solution: create a special hostname to resolve (avoid cache), latency = server access time - special hostname resolve time

DNS resolvers

DNS客户端设置使用的DNS服务器(8.8.4.4)一般都是递归服务器，它负责全权处理客户端的DNS查询请求，直到返回最终结果。而DNS服务器之间一般采用迭代查询方式

clear DNS cache `sudo killall -HUP mDNSResponder`
check DNS client: sudo lsof -i -P | grep LISTEN
http://whoismydns.com/Index.html
dig @127.0.0.1 -p 53 google.com +tcp +trace

TLS handshake

TCP

Connection Reset: the peer rejects it with an RST to let you know it isn't listening.

send() returns only means transmitted through interface

Load Balancer

网络地址转换NAT原理及应用

Consistent Hash for routing

LVS

LVS包转发模型和调度算法(图)

LVS full-NAT (详解SNAT+DNAT): DR只能在同一个网段, DR和NAT必须在同一个VLAN

https://www.quora.com/What-is-the-difference-between-layer-3-and-layer-4-load-balancing-Why-is-layer-7-LB-used-inspite-of-its-drawbacks-of-being-a-bottleneck

https://www.nginx.com/resources/glossary/layer-4-load-balancing/

L7 load-balancing is better for data locality on same type data

课外加强阅读

computer networking top down approach

GFW

2017年9月7日星期四

哥德尔、艾舍尔、巴赫

哥德尔不完备定理

2条定理

任何相容的形式系统，只要蕴涵皮亚诺算术公理，就可以在其中构造在体系中不能被证明的真命题，因此通过推演不能得到所有真命题（即体系是不完备的）。
任何相容的形式系统，只要蕴涵皮亚诺算术公理，它就不能用于证明它本身的相容性。

希尔伯特计划: 这个计划的主要目标，是为全部的数学提供一个安全的理论基础。具体地，这个基础应该包括：

所有数学的形式化。意思是，所有数学应该用一种统一的严格形式化的语言，并且按照一套严格的规则来使用。
完备性。我们必须证明以下命题：在形式化之后，数学里所有的真命题都可以被证明（根据上述规则）。
相容性。我们必须证明：运用这一套形式化和它的规则，不可能推导出矛盾。
保守性。我们需要证明：如果某个关于“实际物”的结论用到了“假想物”（如不可数集合）来证明，那么不用“假想物”的话我们依然可以证明同样的结论。
确定性。应该有一个算法，来确定每一个形式化的命题是真命题还是假命题。

希尔伯特第二问题，是希尔伯特的23个问题之一，即关于一个公理系统相容性的问题，也就是判定一个公理系统内的所命题是彼此相容无矛盾的，希尔伯特希望能以严谨的方式来证明任意公理系统内命题的相容性。

2017年8月14日星期一

Linux & OS & Linux内核设计与实现 & 深入理解计算机系统

Unix / Linux systems and internals

Preparations

Operating Systems: Know about processes, threads and concurrency issues. Know about locks and mutexes and semaphores and monitors and how they work. Know about deadlock and livelock and how to avoid them. Know what resources a processes needs, and a thread needs, and how context switching works, and how it's initiated by the operating system and underlying hardware. Know a little about scheduling. The world is rapidly moving towards multi-core, so know the fundamentals of "modern" concurrency constructs.

Linux internals

Process Execution and/or Threads
Memory Usage
RAID Levels
The kernel and how it interacts with other system components
System Calls
Signals and Signal Handlers
Modern Web Architectures and Webservers

Preparations:

Review userspace / Kernel space boundaries and interactions.
Examples might include: ioctls, sysctls, context switches.
Review troubleshooting tools for system-level performance issues.
Review troubleshooting tools for debugging application-level performance issues or bugs.

file system processing, file properties, user permissions, text parsing

Linux

Process

Process group
Linux process states: Running, interruptible/uninterruptible sleep, defunct/zombie, stopped (job control signal or traced)
CPU time: User time, System time, Idle time, Steal time (hypervisor forbidden)
Core dump
Why process can't kill: waiting on hardware device or network resource
Why Zombie stay around: keeping exit status

Forget Waitpid()
Can't get killed, kill its parent, then released by init

Can signal be ignored (lost)?

Context switch

Process context switch: need to switch virtual memory mapping, so that need to flush TLB (which is expensive)
User mode to Kernel mode: a mode transition usually

Kernel

How the Kernel Manages Your Memory
Kernel can preempt a task running in the kernel so long as it does not hold a lock. Because the kernel is SMP-safe, if a lock is not held, the current code is reentrant and capable of being preempted

Permissions

chmod

sticky bit, 1, t: has the final decision无法删除
guid, 2, s: 目录下新文件保留原目录ownership
suid, 4, s
chmod -R a+rX . 所有子目录可搜索(regular file retain excutable permission)

chattr

i, immutable

filetype
umask: file permissions are set for newly created files

TCP/UDP packet size: 64KB, MTU: 1500 B
Zero Copy: From kernel to NIC, no user space
https://fabiokung.com/2014/03/13/memory-inside-linux-containers/
executing machine code in memory: Doable, think JIT runtime and python interpreter
HDD vs SSD:

SSD连续读的能力相比普通磁盘优势并不明显; SSD适合多读，多随机读写
erase-before-write: SSD必须erase before write
因为SSD存在“写磨损”的问题，当某个单元长时间被反复擦写时（比如Oracle redo），不仅会造成写入的性能问题，而且会大大缩短SSD的使用寿命，所以必须设计一个均衡负载的算法来保证SSD的每个单元能够被均衡的使用，这就是wear leveling，称为损耗均衡算法 -- offline erase
传统数据库日志是sequential logging因为是连续位置的随机写入; SSD会大大折寿, SSD采用in-place logging: data and log in the same block; 日志文件还是适合HDD
SSD作为flashcache掉电后数据是有效还是无效？被当作无效的

4K对齐是一种高级硬盘使用技术，用特殊方法将文件系统格式与硬盘物理层上进行契合，为提高硬盘寿命与高效率使用硬盘空间提供解决方案。因该技术将物理扇区与文件系统的每簇4096字节对齐而得名。当前电脑传统机械硬盘的每个扇区一般大小为512字节

Filesystem

Filesystems

hard link can not cross multiple partition
i-node has reference counter
directory inode holds pairs of (inode_num, filename)
https://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/
What is a Superblock, Inode, Dentry and a File?

Dentry is not an on disk data structure; constructed on the fly from pathname string and cached
From one file to its inode to disk data block: http://www.cs.ucsb.edu/~rich/class/cs170/notes/FileSystem/index.html

《Linux内核设计与实现》

Overview

In Linux, each processor is doing exactly one of three things at any given moment

In user-space, executing user code in a progress
In kernel-space, in process context, executing on behalf of a specific process
In kernel-space, in interrupt context, not associated with a process, handling an interrupt

top halve has interrupt disabled

Kernel memory is not pageable

Process Management

Process includes

open files
pending signals
internal kernel data
processor state
memory address space (multiple memory mappings) page tables
data section containing global variables

Thread includes

unique program counter
process stack
set of processor registers

The kernel stores the list of processes in a circular doubly linked list called the task list.
Fork() only overhead: duplication of the parent's page tables & creation of a unique process descriptor for the child

ZOMBIE only has kernel stack, thread_info structure, and task_struct structure
When a task is ptraced, it is temporarily reparented to the debugging process

kernel threads do not have an address space (not normal process)

Process Scheduling: low latency (process response time) and high throughput (system utilization)

Context switch, the switching from one runnable task to another:

calls switch_mm() to switch virtual memory mapping
calls switch_to() to switch the processor state: saving/restoring stack information and processor registers and others

User preemption can occur:

when returning to user-space from a system call
when returning to user-space from an interrupt handler

Kernel preemption can occur: Linux 2.6 is a fully preemptive kernel, it is possible to preempt a task at any point, so long as the kernel is safe to reschedule.

NO LOCK. locks are used as markers of regions of nonpreemptibility
scheduler_tick() will check flag need_resched (which is per-process)

Kernel preemption can occur:

When an interrupt handler exits, before returning to kernel-space
When kernel code becomes preemptible again (NO LOCK, preempt_count is 0)
If a task in the kernel explicitly calls schedule()
If a task in the kernel blocks (which results in calling schedule())

System Calls

syscall (interrupt vector 128): an exception or trap to enter the kernel
syscall must be reentrant

Interrupts and Interrupt Handlers

Why interrupt? Processor polling incurs overhead, interrupt is what hardware to signal the kernel to get attention
Interrupt vs Exception:

interrupt handlers (top halves) are executed by the kernel asynchronously in immediate response to hardware interrupt
exceptions occur synchronously with respect to the processor clock, that they are called synchronous interrupts. (e.g., divide by zero, a page fault)

Special interrupt context: it is called atomic context as code executing in this context is unable to block (Top Halves)
Interrupt handlers can form only the first half of any interrupt processing solution due to these limitations:

Interrupt handlers run asynchronously and thus interrupt other, potentially important, code, including other interrupt handlers. (need to run fast)
Interrupt handlers run with at best the current interrupt level disabled, and at at worst all interrupts on the current processor disabled. But disabling interrupts prevents hardware from communicating with the operating systems
Interrupt handlers do not run in process context; therefore, they cannot block
They are often timing-critical

Kernel Synchronization

When only a single processor, the only way data could be concurrently accessed was: interrupt occurred or if kernel code explicitly rescheduled and enabled another task to run
Causes of concurrency:

Interrupts: An interrupt can occur asynchronously at almost any time, interrupting the currently executing code
Softirqs and tasklets: The kernel can raise or schedule a softirq or tasklet at almost any time, interrupting the currently executing code
Kernel preemption: Because the kernel is preemptive, one task in the kernel can preempt another
Sleeping and synchronization with user-space: A task in the kernel can sleep and thus invoke the scheduler, resulting in the running of a new process
Symmetrical multiprocessing: Two or more processors can execute kernel code at exactly the same time (per CPU memory)

Lock data not code
Deadlock

Implement lock ordering
Prevent starvation. Ask yourself, does this code always finish? If foo does not occur, will bar wait forever?
Do not double acquire the same lock
Only one process move at one time (prevent livelock)

Memory Management

A page may be used by page cache, private data, or as a mapping in a process's page table
kernel page structure is associate with physical pages not virtual pages
high memory: memories one architecture can not directly map

《深入理解计算机系统》

链接

Static Linking: ld copies functions from AR file
Dynamic Linking:

At loading:

compile to get partial executable object file from shared object (mark symbols relocatable)
execve(ld-linux.so)，relocate目标文件和可重定向文件libc.so, relocate text and data

At runtime: dlopen() loads functions from 共享库 to its memory directly (top中SHR的一部分就是dll文件)

3 kinds of object files:

Executable object file
Relocatable object file

Shared object (shared library)

异常处理

interrupt(键盘), trap, fault (除0), abort(机器检查)
syscall is trap 128
User vs Kernel mode is controlled by one mode bit in 控制寄存器
context: 寄存器、浮点寄存器、PC、User stack、状态寄存器、Kernel stack、页表、进程表、文件表
信号的缺陷：

同类型待处理信号被阻塞
待处理信号不会排队等待：queue = 1
系统调用可以被中断：read会返回EINTR

fork() vs execve()

fork将父子进程每个页面标记为read only，每个section都标记为private copy-on-write
execve: 删除已存在的用户区域->映射私有区域(text data bss stack heap)->映射共享区域(libc.so)->设置PC

execve会继承打开了的fd

打开文件的数据结构

描述符表(descriptor table): 指向file table
文件表(file table): shared by all processes. 当前文件位置, 引用计数(多少个进程指向它), 指向v-node table
v-node table: shared by all processes. 文件访问、大小、类型

Linux startup process

Boot

Power On Self Test
Boot from CD, USB, HD
Load kernel into initramfs

BIOS -> Master Boot Record (first sector) -> GRUB -> Active Partition load kernel
UEFI (低阶OS) -> GPT -> /boot/efi/boot.efi -> GRUB/kernel

EFI system partition is based on FAT, contains boot loader or kernel image

Why GRUB has so many stages?

stage 1, boot.img, 446 B, within first sector: to locate stage 2
stage 1.5, core.img, 25 KB, between MBR and first partition: contains a few common filesystem drivers (Not all filesystems)
stage 2, /boot/grub2, The kernels are located in the /boot directory, along with an initial RAM disk image, and device maps of the hard drives

Why BIOS could not load kernel directly? Does not have system driver for addressing

Startup

Kernel loads systemd/init, then mounts filesystems in /etc/fstab
Kernel calls start_kernel() to set up system functions: hardware and memory paging, interrupt handlers, the rest of memory management, device and driver initialization

idle process: power saving mode
scheduler: Complete Fair Scheduler O(logN)
init/systemd who mounts /etc/fstab

init start services from specific levels
display manager and login manager -> session manager

Shutdown

Close down user space functionality
init terminates
kernel shutdown

2017年7月17日星期一

Cassandra

LSM Tree + SSTable
PRIMARY KEY(PARTITION KEY, CLUSTERING KEY) Rows are spread across cluster on Partition Key, then ordered by Clustering Key
Why Cassandra doesn’t need vector clocks
Delete:

a batch containing conditional updates can only operate within a single partition, because the underlying Paxos implementation only works at partition-level granularity

SSTable

Each level ensures non-overlapping
SizeTiered is better for write-intensive workloads, and Leveled better for read-intensive

Node repair 2.1

Frequent data deletions and downed nodes are common causes of data inconsistency. Manual repair: Anti-entropy repair
Parallel repair will only repair each token range ONCE. if you are using “nodetool repair -pr” you must run it on EVERY node in EVERY data center, no skipping allowed.

2017年5月11日星期四

한국어

처음 뵙(拜)겠습니다 初次见面
잘 부탁(付托)드리겠습니다 请多多关照
천만(千万) 에요 哪里哪里不客气
? 两个人的位置
저 미안(未安)합니다만 劳驾
메뉴를 보여 주세요 请给我看下菜单
물 좀 더 주세요 请加点水
? 没关系
조심(操心)하세요 保重

옆을 짧게 깎아 주세요 两侧请剪短一些

通过日语学韩语: https://www.zhihu.com/question/19830338/answer/93071137
Taeyeon的唱功: https://www.zhihu.com/question/21424874/answer/115636610

2017年3月30日星期四

产品、市场

《重来》（更为简单有效的商业思维）
《启示录》（打造用户喜欢的产品）

2017年3月18日星期六

量子力学

上帝掷骰子读书笔记
走进量子纠缠PDF
量子力学入门

测不准定理

EPR佯谬

隐变量理论：爱因斯坦的解释，因为他相信定域性。贝尔提出贝尔不等式支持爱因斯坦，却证明了量子没有定域性
波函数坍塌理论

量子力学诠释

爱因斯坦

定域性。超光速？量子没有定域实在性

狭义相对论的时光倒流会不会影响因果律？

实在论: “月亮是否依旧存在，即使无人赏月？” 实在论者，承继了古希腊哲学家柏拉图的看法，他们相信，我们所相信为真实的一切都只是近似于真实的存在。人类感官所感受到的世界，只是真实的一种投射，并不是真实。
决定性

波函数坍塌

多宇宙诠释认为波函数没有因观察坍塌
观察导致坍塌：保罗·戴维斯所说，“现实存在于观察之中，而不在电子之中”。

人择原理

应用：量子计算、量子光学、量子通信

2017年3月10日星期五

金融与经济

三元悖论：

资本自由进出（Capital mobility）

固定汇率（Exchange rate）

独立自主的货币政策（Monetary policy）

2017年3月1日星期三

Distributed System Design

Unix / Linux systems and internals

https://tasteturnpike.blogspot.com/2021/04/database.html
http://tasteturnpike.blogspot.com/2017/09/network.html
https://tasteturnpike.blogspot.com/2017/11/site-reliability-engineering-how-google.html

System Design
https://puncsky.com/hacking-the-software-engineer-interview/?isAdmin=true
https://github.com/checkcheckzz/system-design-interview
http://www.aosabook.org/en/distsys.html

Companies
http://highscalability.com/blog/2015/9/14/how-uber-scales-their-real-time-market-platform.html
https://www.quora.com/What-is-YouTubes-architecture

估算问题Envelope Calculations: http://highscalability.com/blog/2011/1/26/google-pro-tip-use-back-of-the-envelope-calculations-to-choo.html

Design 范式

CS 739: Distributed Systems

Compare 2PC to Paxos. What is different? What is similar? Will you ever need both together in a system?
Raft
Clock

CockroachDB: The Resilient Geo-Distributed SQL Database

shared-nothing architecture: all nodes are used for both data storage and computation
single node is layered architecture:

SQL: parser, optimizer, SQL execution engine (convert SQL to low-level read write requests to KV store)
Transactional KV: ensures atomicity of changes spanning multiple KV pairs; isolation guarantee
Distribution: a monolithic logical key space ordered by key. chunk = Range = 64MB. The Distribution layer is responsible for identifying which Ranges should handle which subset of each query, and routes the subsets accordingly.
Replication: consensus-based replication
Storage: RocksDB -> Pebble

Fault Tolerance and High Availability

Replication using Raft: Replicas of a Range form a Raft group. Each replica individually applies command. Usually Raft group leader acts as the leaseholder. All writes go through the leaseholder. Leases for user Ranges are tied to the liveness of the node the leaseholder is on. Heartbeat is every 4.5 seconds.
Membership changes and automatic load rebalancing. Use peer-to-peer gossip protocol to determine node liveness
Replica placement. Each nodes has its own attributes like capability (CPU, RAM) or locality (country, region)

Transactions

Execution at the transaction coordinator: coordinator receives a series of requested KV operations from the SQL layer. Optimization: batch multi-statement SQL transactions to complete with one replication.

Write Pipelining: if no overlapping, operations on different keys can be "pipelined". if operation depends on earlier in-flight operation, execution must wait for earlier in-flight execution
Parallel Commits: Naively, needs at least two sequential rounds of consensus before commit??? Optimization: initiate the replication of the staging status in parallel with the verification of the outstanding writes???

Execution at the leaseholder:

Acquires latches on the keys and dependencies
Write operations are replicated. After consensus is reached, each replica applies the command to local storage engine
Release latches

Transaction record: A metadata which is a special key stores the current disposition of transaction: pending, staging, committed or aborted
Concurrency Control: A total ordering of all transactions in the system, representing a serializable execution. Conflicts may require commit timestamp adjustments: The transaction typically tries to prove that its prior reads remain valid at the new timestamp 如果并发transaction撞到一起，则调后以错开它们的timestamp以达到serializable execution。并不需要按照它们出生的时间顺序进行执行，而是尊重它们物理到达时间，先到先执行

Deadlock: CRDB employs a distributed deadlock-detection algorithm to abort one transaction from a cycle of waiters.
Write-read conflicts: write在前，read等待完成
Read-write conflicts: 后头的read先到并执行，write等待并改到read后的时间
Write-write conflicts: 后头的write先到并执行，前头的write改时间过后头

Read Refresh: Certain types of conflicts described above require advancing the commit timestamp of a transaction. To maintain serializability, the read timestamp must be advanced to match the commit timestamp

个人理解：在并发时不用保证多用户request的实际顺序，而要保证单用户多transaction的顺序: 一个用户的RW到达时变为WR，这时read不能advance，并触发read refresh
Advancing a transaction’s read timestamp from ta to tb > ta is possible if we can prove that none of the data that the transaction read at ta has been updated in the interval (ta,tb ]. If the data has changed, the transaction needs to be restarted. If no results from the transaction have been delivered to the client, CRDB retries the transaction internally. If results have been delivered, the client is informed to discard them and restart the transaction
To determine whether the read timestamp can be advanced, CRDB maintains the set of keys in the transaction’s read set (up to a memory budget) This involves re-scanning the read set and checking whether any MVCC values fall in the given interval. This process is equivalent to detecting the rw-antidependencies that PostgreSQL tracks for its implementation of SSI. Similar to PostgreSQL, our implementation may allow false positives (forcing a transaction to abort when not strictly necessary) to avoid the overhead of maintaining a full dependency graph

Follower Reads: CRDB allows non-leaseholder replicas to serve requests for read-only queries with timestamps sufficiently in the past. Each leaseholder tracks the timestamps of all incoming requests and periodically emits a closed timestamp, the timestamp below which no further writes will be accepted. Every node keeps a record of its latency with all other nodes in the system. When a node in the cluster receives a read request at a sufficiently old timestamp (closed timestamps typically trail current time by ~2 seconds), it forwards the request to the closest node with a replica of the data.

Clock Synchronization

Hybrid-Logical Clocks:

timestamps that are a combination of physical and logical time
maximum allowable offset defaults to 500 ms
HLCs provide causality tracking through their logical component upon each inter-node exchange. The most important of these is a lease disjointness invariant similar to that in Spanner: for each Range, each lease interval is disjoint from every other lease interval. This is enforced on cooperative lease handoff with causality transfer through the HLC and is enforced on non-cooperative lease acquisition through a delay equal to the maximum clock offset between lease intervals
Strictly monotonic timestamp allocation ensures that two causally dependent transactions originating from the same node are given timestamps that reflect their ordering in real time.

Uncertainty Intervals:

CRDB does not support strict serializability because there is no guarantee that the ordering of transactions touching disjoint key sets will match their ordering in real time. In practice, this is not a problem for applications unless there is an external low-latency communication channel between clients that could potentially impact activity on the DBMS.
CRDB satisfies single-key linearizability for reads and writes.
Transaction can have earlier timestamp, but its not earlier in reality: if transaction sees future key within uncertainty interval, it will perform an uncertainty restart

Behavior under Clock Skew: If any node exceeds the configured maximum offset by more than 80% compared to a majority of other nodes, it self-terminates

如果有一个捣蛋鬼node时间晚了一个小时，也不自杀，该怎么半？certificate被用于验明正身

https://www.cockroachlabs.com/blog/living-without-atomic-clocks/

strongest guarantee of consistency called “external consistency”. To confuse things further, this is what folks interchangeably refer to as “linearizability” or “strict serializability”: 顺序发生的[T1,T2]，T2后执行，但是拿着更早的timestamp。回看历史的时候就认为T2先执行(Note that this can only happen when the two transactions access a disjoint set of keys.)

如果T1 T2 有相同key，那么它们肯定在clock offset前被检测出来

Spanner选择等待7ms, 7ms内T2肯定会到. 但让所有操作等7ms并不实际，最新研究可以减少到1ms

如果没有相同key, CRDB认为这个情况无所谓

Spanner (TrueTime) provides linearizability, CockroachDB only goes as far as to claim serializability
Spanner always waits after writes, CockroachDB sometimes retries reads.
虽然CockroachDB不认为linearizability很重要，但commit timestamp很重要。它必须要比所有nodes时间晚，否则它可能读不到already-committed data - an unacceptable breach of consistency

Spanner选择当前TrueTime即可满足当前committed transaction已经在7ms前完成
CockroachDB makes use of a “causality token”, which is just the maximum timestamp encountered during a transaction. It’s passed from one actor to the next in a causal chain, and serves as a minimum timestamp for successive transactions to guarantee that each has a properly ordered commit timestamp.
When CockroachDB starts a transaction, it chooses a provisional commit timestamp based on the current node’s wall time. It also establishes an upper bound on the selected wall time by adding the maximum clock offset for the cluster \[commit timestamp, commit timestamp + maximum clock offset]. This time interval represents the window of uncertainty. 当碰到比自己晚的timestamp，则uncertainty restart: bumping the provisional commit timestamp just above the timestamp encountered，但是不用改上限 -- Transactions reading constantly updated data from many nodes may be forced to restart multiple times, though never for longer than the uncertainty interval, nor more than once per node.

If it’s unlucky, the read may have to be retried more than once. There is an upper limit on how many retries can occur based on how clocks are synchronized. For NTP, this could be 250ms, so even the most unlucky transactions won’t have to retry for more than 250ms for clock-related reasons.

Primary index is keyed on primary key. Secondary indexes are keyed on index key.
Vectorized execution engine: data from disk is transposed from row to column format as it is being read from CRDB’s KV layer, and transposed again from column to row format right before it is sent back to the end user. The overhead of this process is minimal.
Schema Changes: schema change becomes a sequence of incremental changes. In this protocol, the addition of a secondary index requires two intermediate schema versions between the initial and final ones to ensure that the index is being updated on writes across the entire cluster before it becomes available for reads.

Lessons Learned

Raft

Reducing Chatter: Too many heartbeats for a deployment of thousands of consensus groups (one per Range)

combine heartbeat messages into one per node (not one per-RPC)
pause heartbeat from Raft groups with no recent write activity

Joint Consensus: rebalancing replicas across regions will lose availability during single region outage. (temporarily replicas becomes 2 or 4)

In Joint Consensus, an intermediate configuration exists, but requires instead the quorum of both the old and new majority for writes.

Removal of Snapshot Isolation 很难简单的实现

The only safe mechanism to enforce strong consistency under snapshot isolation is pessimistic locking, via the explicit locking modifiers FOR SHARE and FOR UPDATE on queries. To guarantee strong consistency across concurrent mixed isolation levels, CRDB would need to introduce pessimistic locking for any row updates, even for SERIALIZABLE transactions

Multi Version: Propose the effect of request to Raft, not the request itself

Related Work

Spanner: Spanner [4, 17] is a SQL system that provides the strongest isolation level, strict serializability [30]. It achieves this by acquiring read locks in all read-write transactions and waiting out the clock uncertainty window (the maximum clock offset between nodes in the cluster) on every commit. CRDB’s transaction protocol is significantly different from Spanner’s; it uses pessimistic write locks, but otherwise it is an optimistic protocol with a read refresh mechanism that increases the commit timestamp of a transaction if it observes a conflicting write within the clock uncertainty window. This approach provides serializable isolation and has lower latency than Spanner’s protocol for workloads with low contention. It may require more transaction retries for highly contended workloads, however, and for this reason future versions of CRDB will include support for pessimistic read locks. Note that Spanner’s protocol is only practical in environments where specialized hardware is available to bound the uncertainty window to a few milliseconds. CRDB’s protocol functions in any public or private cloud.
TiDB, NuoDB vs CRDB: these systems are not optimized for geo-distributed workloads and only support snapshot isolation

https://www.cockroachlabs.com/docs/stable/architecture/life-of-a-distributed-transaction.html 简化版本

SQL Client sends a query to your cluster.
Load Balancing routes the request to CockroachDB nodes in your cluster, which will act as a gateway.
Gateway is a CockroachDB node that processes the SQL request and responds to the client.

Generate SQL physical plan. For example, the SQL executor converts INSERTstatements into Put() operations.
TxnCoordSender performs a large amount of the accounting and tracking for a transaction, including:
- Accounts for all keys involved in a transaction. This is used, among other ways, to manage the transaction's state.
- Packages all key-value operations into a BatchRequest, which are forwarded on to the node's DistSender.
DistSender dismantles the initial BatchRequest by taking each operation and finding which physical machine should receive the request for the range—known as the range's leaseholder. The DistSender then waits to receive acknowledgments for all of its write operations, as well as values for all of its read operations.

Leaseholder is a CockroachDB node responsible for serving reads and coordinating writes of a specific range of keys in your query.

write intents for a key: these are provisional, uncommitted writes that express that some other concurrent transaction plans to write a value to the key.

the new transaction attempts to "push" the write intent's transaction by moving that transaction's timestamp forward (i.e., ahead of this transaction's timestamp); however, this only succeeds if the write intent's transaction has become inactive.
总之就是比timestamp，先到先执行，或者尝试把前头的transaction推前，如果自己插队了就自我中止，否则只好等着

leasholder当然有最新的数据供read，但是其它node未必

Raft leader is a CockroachDB node responsible for maintaining consensus among your CockroachDB replicas.
The gateway node's DistSender aggregates commit acknowledgments from all of the write operations in the BatchRequest, as well as any values from read operations that should be returned to the client.

DistSender 检查timestamp是否被push，如果是则read refresh to see if any values it needed have been changed，如果改变了就重来
只有所有operation成功且通过check才会commit. TxnCoordSender 开始异步清理之中生成的write intent，并convert all write intents to fully committed values

Multi-tenant 架构

每个用户独享SQL Pod
Storage pod被共享，EBS is the block storage system

面试 https://github.com/cockroachlabs/open-sourced-interview-process/blob/master/EngineeringExercises.md#infrastructure-design

touch on security, alerting, monitoring, and access. For example, "how do you keep CRDB running?"

Access: SSO for human, TLS public root certificate for SQL client with password

https://www.cockroachlabs.com/docs/dev/recommended-production-settings.html 上线指南

Multiple disks/SSDs per node: not one node per disk
At least 3 availability zones: spread nodes evenly on each zone
At least 3 regions
Node越大越稳定, node越小failure影响越小
Storage

/datastore及/log用独立的volume

Security

A certificate authority (CA) certificate and key, used to sign all of the other certificates.
A separate certificate and key for each node in your deployment, with the common name node.
A separate certificate and key for each client and user you want to connect to your nodes, with the common name set to the username. The default user is root.

Load Balancing: HAproxy with keepalived
Clock synchronization

While serializable consistency is maintained regardless of clock skew, skew outside the configured clock offset bounds can result in violations of single-key linearizability between causally dependent transactions. It's therefore important to prevent clocks from drifting too far by running NTP or other clock synchronization software on each node.

TiDB

https://dbdb.io/db/tidb

https://en.pingcap.com/blog/how-an-htap-database-handles-oltp-and-olap-workloads-at-the-same-time/

https://github.com/pingcap/docs/blob/7eb32c7363e62bbef787e7bf674924b95bc1af37/transaction-isolation-levels.md TiDB implements Snapshot Isolation (SI) consistency, which it advertises as REPEATABLE-READ

TiKV
https://tikv.org/deep-dive/key-value-engine/b-tree-vs-lsm/

The main purpose for TiKV to use LSM-tree instead of B-tree as its underlying storage engine is because using cache technology to promote read performance is much easier than promote write performance.

DHT

Distribute single file to 10k servers using DHT

Consistent hash

Think how node join/leave: 把一个环切成若干段，加入一个新节点时就多加一个段点，新节点负责该点到之前点的数据；后面k-1个节点都需要复制这个段的数据 (replica is k)

etcd

https://dbdb.io/db/etcd

https://etcd.io/docs/v3.5/learning/data_model/

etcd is designed to reliably store infrequently updated data and provide reliable watch queries. etcd exposes previous versions of key-value pairs to support inexpensive snapshots and watch history events (“time travel queries”). A persistent (B+ tree), multi-version, concurrency-control data model is a good fit for these use cases.

Deletion is adding a tombstone

Each transaction/ keyspace modification will create a revision. The key of key-value pair is a 3-tuple (major, sub, type). Major is the store revision holding the key. Sub differentiates among keys within the same revision. The value of the key-value pair contains the modification from previous revision, thus one delta from previous revision. The b+tree is ordered by key in lexical byte-order. Ranged lookups over revision deltas are fast; this enables quickly finding modifications from one specific revision to another.

Kafka

How is Kafka's durability?

Kafka is not using directly I/O, it would return after writing into page cache (rather than maintain as much as possible in-memory and flush it all out to the filesystem in a panic when we run out of space, we invert that. All data is immediately written to a persistent log on the filesystem without necessarily flushing to disk. In effect this just means that it is transferred into the kernel's pagecache): https://kafka.apache.org/documentation/#design

https://blog.newrelic.com/engineering/kafka-consumer-config-auto-commit-data-loss/

Data loss could be minimum as Kafka only returns after replications get done

Prometheus

https://dbdb.io/db/prometheus

Prometheus needs manual management for scalability: https://prometheus.io/docs/introduction/faq/#i-was-told-prometheus-doesnt-scale

Ideal Architecture

Service level design

What is the product use scenario, what need to trade-off? What would piss of our users?
Do we need storage? Uber driver location does not need storage
How to scale storage? Sharding and replication

how to find the sharded storage? Needs to index them somewhere

Is AppServer stateless? Otherwise, why?

Principle: workers only ask for tasks it can handle, add one scheduler to coordinate tasks assignment. e.g., One worker can handle Ohio state, but another work can only handle NYC task
Transactions. Cut super large transactions into stages/more services; add a watcher to record transaction progress
Needs locality for high performance? How high is the locality needs? Can we move these localities/business logic into storage layer?

How about join operations between workers? Has to cut it into multiple stages, like Big Data?
Routing: always route to nearest DC; always determine locality as early as possible

Host level design

Extreme performant Linux web service:

How to exhaust multi-processor's performance:

multi-processing vs multi-threading: depends on how many resources are shared
Python asyncio: ensure each operation is asynchronous, then spawn 32 processes
Goroutines: pro for developer velocity, forget about concurrency cost, just do synchronous operations

Threading model: N:1, kernel schedules at process level, process schedules its threads at its own discretion; 1:1, kernel schedules at thread level. Hybrid model
Linux web programing: multi-processing using epoll; parent process dispatch accepted fd
fork() is dangerous if parent has a mysql connection fd: child could stop it from being closed

CAP Theorem

CAP Twelve Years Later: How the "Rules" Have Changed

Consistency: All clients see the same most recent write or error

Availability: All reads contain data without error, but may not be most recent

Partition Tolerance: The system works despite network failures

分布式系统

https://eng.uber.com/intro-to-ringpop/, drawbacks of masterless

No master, then each worker advertise itself then misconfigured worker will pollute the traffic
The design meant that there were normally 3 network hops & 2 processes between the client and the service. This coupled with the language choice led to high P99s close to 100ms

Service Discovery: what is Consul? etcd? zookeeper?
分布式系统常用思想和技术总结
Riak
https://coolshell.cn/

分布式系统的事务处理

2PC: 如果第一阶段完成后，参与者在第二阶段timeout，那么数据结点会进入“不知所措”的状态，这个状态会block住整个事务
3PC: 在询问的时候并不锁定资源，除非所有人都同意了，才开始锁资源
timeout is the nightmare of all distributed state machine

高可用系统

https://en.wikipedia.org/wiki/State_machine_replication

ACID

数据库事务原子性、一致性是怎样实现的？
MVCC是一种后验性的，读不阻塞写，写也不阻塞读，等到提交的时候才检验是否有冲突，由于没有锁，所以读写不会相互阻塞，从而大大提升了并发性能

PolarDB

RDS Provisioned IOPS is ~ 100k IOPS

阿里云 PolarDB 和 Amazon Aurora 的一个共同设计哲学就是，放弃了通用分布式数据库 OLTP 多路并发写的支持，采用一写多读的架构设计

主节点和只读节点之间是 Active-Active 的 Failover 方式，计算节点资源得到充分利用，由于使用共享存储，共享同一份数据

PolarDB 的设计思想有几个大的革新。一是通过重新设计特定的文件系统来存取 Redo log 这种特定的 WAL I/O 数据，二是通过高速网络和高效协议将数据库文件和 Redo log 文件放在共享存储设备上，避免了多次长路径 I/O 的重复操作，相比较 Binlog 这种方式更加巧妙

主节点和只读节点之间采用 Active-Active 的 Failover 方式，提供 DB 的高可用服务。DB 的数据文件、redolog 等通过 User-Space 用户态文件系统，经过块设备数据管理路由，依靠高速网络和 RDMA 协议传输到远端的 Chunk Server。同时 DB Server 之间仅需同步 Redo log 相关的元数据信息。Chunk Server 的数据采用多副本确保数据的可靠性，并通过 Parallel-Raft 协议保证数据的一致性。

Shared Disk 架构

PolarDB 采用 Shared Disk 架构，其根本原因是上述的计算与存储分离的需要。逻辑上 DB 数据都放在所有 DB server 都能够共享访问的数据 chunk 存储服务器上。而在存储服务内部，实际上数据被切块成 chunk 来达到通过多个服务器并发访问 I/O 的目的

物理 Replication

我们知道，MySQL Binlog 记录的是 Tuple 行级别的数据变更。而在 InnoDB 引擎层，需要支持事务 ACID，也维持了一份 Redo 日志，存储的是基于文件物理页面的修改。这样 MySQL 的一个事务处理默认至少需要调用两次 fsync() 进行日志的持久化操作，这对事务处理的系统响应时间和吞吐性能造成了直接的影响。尽管 MySQL 采用了 Group Commit 的机制来提升高并发下的吞吐量，但并不能完全消除 I/O 瓶颈。

此外，由于单个数据库实例的计算和网络带宽有限，一种典型的做法是通过搭建多个只读实例分担读负载来实现 Scale out。PolarDB 通过将数据库文件以及 Redolog 等存放在共享存储设备上，非常讨巧的解决了只读节点和主节点的数据 Replication 问题。由于数据共享，只读节点的增加无需再进行数据的完全复制，共用一份全量数据和 Redo log，只需要同步元数据信息，支持基本的 MVCC，保证数据读取的一致性即可。这使得系统在主节点发生故障进行 Failover 时候，切换到只读节点的故障恢复时间能缩短到 30 秒以内。系统的高可用能力进一步得到增强。而且，只读节点和主节点之间的数据延迟也可以降低到毫秒级别。

从并发的角度来看，使用 Binlog 复制现在只能按照表级别并行复制，而物理复制只按照数据页维度，粒度更细，并行效率更加高。

最后一点，引入 Redolog 来实现 Replication 的好处是，Binlog 是可以关闭来减少对性能的影响，除非需要 Binlog 来用于逻辑上的容灾备份或者数据迁移。

总之，在 I/O 路径中，通常越往底层做，越容易和上层的业务逻辑和状态解耦而降低系统复杂度。而且这种 WAL Redo log 大文件读写的 I/O 方式也非常适用于分布式文件系统的并发机制，为 PolarDB 带来并发读性能的提高。

高速网络下的 RDMA 协议

RDMA 通常是需要有支持高速网络连接的网络设备（如交换机，NIC 等），通过特定的编程接口，来和 NIC Driver 进行通讯，然后通常以 Zero-Copy 的技术以达到数据在 NIC 和远端应用内存之间高效率低延迟传递，而不用通过中断 CPU 的方式来进行数据从内核态到应用态的拷贝，极大的降低了性能的抖动，提高了整体系统的处理能力。

Snapshot 物理备份

Snapshot 是一种流行的基于存储块设备的备份方案。其本质是采用 Copy-On-Write 的机制，通过记录块设备的元数据变化，对于发生写操作的块设备进行写时复制，将写操作内容改动到新复制出的块设备上，来实现恢复到快照时间点的数据的目的。Snapshot 是一个典型的基于时间以及写负载模型的后置处理机制。也就是说创建 Snapshot 时，并没有备份数据，而是把备份数据的负载均分到创建 Snapshot 之后的实际数据写发生的时间窗口，以此实现备份、恢复的快速响应。PolarDB 提供基于 Snapshot 以及 Redo log 的机制，在按时间点恢复用户数据的功能上，比传统的全量数据结合 Binlog 增量数据的恢复方式更加高效。

Amazon Aurora

storage is independent of computing

The Aurora shared storage architecture makes your data independent from the DB instances in the cluster. For example, you can add a DB instance quickly because Aurora doesn't make a new copy of the table data. Instead, the DB instance connects to the shared volume that already contains all your data. You can remove a DB instance from a cluster without removing any of the underlying data from the cluster. Only when you delete the entire cluster does Aurora remove the data.

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Replication.html

As a result, all Aurora Replicas return the same data for query results with minimal replica lag. This lag is usually much less than 100 milliseconds after the primary instance has written an update. Replica lag varies depending on the rate of database change. That is, during periods where a large amount of write operations occur for the database, you might see an increase in replica lag.

既然主从都用一样的storage: cluster volume，为何主从之间存在replication lag?

Because Aurora Replicas read from the same cluster volume as the writer DB instance, the ReplicaLag metric has a different meaning for an Aurora PostgreSQL DB cluster. The ReplicaLag metric for an Aurora Replica indicates the lag for the page cache of the Aurora Replica compared to that of the writer DB instance.

因为它使用了一个独立的page cache process，而不是直接从disk读。猜想: writer instance的操作会直接replicate到reader instance page cache

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Overview.StorageReliability.html#Aurora.Overview.Reliability

In the unlikely event of a database failure, the page cache remains in memory, which keeps current data pages "warm" in the page cache when the database restarts. This provides a performance gain by bypassing the need for the initial queries to execute read I/O operations to "warm up" the page cache.

MicroServices

Service Mesh

Coding

屏蔽关键词，及替换词{"o": [], "1": ["l"]}：TRIE tree

System Design

design URL shorter

实现聊天室，各种功能

订阅：博文 (Atom)