https://tasteturnpike.blogspot.com/2017/03/sre-knowledge.html

https://www.alibabacloud.com/blog/what-are-the-differences-and-functions-of-the-redo-log-undo-log-and-binlog-in-mysql_598035

Redo: Description ensures the durability of transactions and prevents dirty pages from being written to the disk at the point in time of the failure. When the MySQL service is restarted, redo according to the redo log to achieve the durability of the transaction.

Undo: It stores a version of the data before the transaction occurs, which can be used for rollback. At the same time, it can provide reads (MVCC) under Multi-Version Concurrency control, which is read without locking.

Binlog:

It is used for replication. In master-slave replication, the slave database replays the binlog stored in the master database to achieve master-slave synchronization.
It is used for the database point-in-time restore.

Postgres

Postgres Transaction Isolation
dirty read
A transaction reads data written by a concurrent uncommitted transaction.
nonrepeatable read
A transaction re-reads data it has previously read and finds that data has been modified by another transaction (that committed since the initial read).
phantom read
A transaction re-executes a query returning a set of rows that satisfy a search condition and finds that the set of rows satisfying the condition has changed due to another recently-committed transaction.
serialization anomaly
The result of successfully committing a group of transactions is inconsistent with all possible orderings of running those transactions one at a time.
read uncommitted
read committed
Because Read Committed mode starts each command with a new snapshot that includes all transactions committed up to that instant, subsequent commands in the same transaction will see the effects of the committed concurrent transaction in any case. The point at issue above is whether or not a single command sees an absolutely consistent view of the database.
repeatable read: sees data committed before the transaction began; it never sees either uncommitted data or changes committed during transaction execution by concurrent transactions
create a snapshot for transaction to ensure consistency read. 但如果其它transaction 此时改了此行，则retry
Applications using this level must be prepared to retry transactions due to serialization failures.
serializable: This level emulates serial transaction execution for all committed transactions; as if transactions had been executed one after another, serially, rather than concurrently. However, like the Repeatable Read level, applications using this level must be prepared to retry transactions due to serialization failures. In fact, this isolation level works exactly the same as Repeatable Read except that it monitors for conditions which could make execution of a concurrent set of serializable transactions behave in a manner inconsistent with all possible serial (one at a time) executions of those transactions. This monitoring does not introduce any blocking beyond that present in repeatable read, but there is some overhead to the monitoring, and detection of the conditions which could cause a serialization anomaly will trigger a serialization failure.
serialization failure: 并发transaction互相影响，不同执行顺序会造成不同结果
predicate locking (These will show up in the pg_locks system view with a mode of SIReadLock) : 检测write是否会对并发transaction造成影响。In PostgreSQL these locks do not cause any blocking and therefore can not play any part in causing a deadlock. They are used to identify and flag dependencies among concurrent Serializable transactions which in certain combinations can lead to serialization anomalies. In contrast, a Read Committed or Repeatable Read transaction which wants to ensure data consistency may need to take out a lock on an entire table, which could block other users attempting to use that table, or it may use SELECT FOR UPDATE or SELECT FOR SHARE which not only can block other transactions but cause disk access.
Serializable predicate locking性能好于explicit locks
PostgreSQL's Serializable transaction isolation level only allows concurrent transactions to commit if it can prove there is a serial order of execution that would produce the same effect 如果能提前检查unique constraints violation，则尽量在transaction前检测
优化
Control the number of active connections, using a connection pool if needed. This is always an important performance consideration, but it can be particularly important in a busy system using Serializable transactions.
Eliminate explicit locks, SELECT FOR UPDATE, and SELECT FOR SHARE where no longer needed due to the protections automatically provided by Serializable transactions.
https://zhuanlan.zhihu.com/p/54979396 Snapshot Isolation综述
Linearizability, serializability, transaction isolation and consistency models
The most common isolation level implemented with MVCC is snapshot isolation
MVCC introduces the challenge of how to remove versions that become obsolete and will never be read. In some cases, a process to periodically sweep through and delete the obsolete versions is implemented. This is often a stop-the-world process that traverses a whole table and rewrites it with the last version of each data item. PostgreSQL can use this approach with its VACUUM FREEZE process

Postgres DB vaccum and query conflict

https://www.postgresql.org/docs/9.2/hot-standby.html#HOT-STANDBY-CONFLICT
https://www.cybertec-postgresql.com/en/what-hot_standby_feedback-in-postgresql-really-does/ "hot_standby_feedback" we can teach the standby to periodically inform the primary about the oldest transaction running on the standby. If the primary knows about old transactions on the standby, it can make VACUUM keep rows until the standbys are done.

LevelDB RocksDB

https://dbdb.io/db/rocksdb

https://es.slideshare.net/meeeejin/rocksdb-detail

https://vinodhinic.medium.com/lets-rock-3a73fbc6ea79

The main challenge is that the Flash cells can only be deleted block-wise and written on page-wise. To write new data on a page, it must be physically totally empty. If it is not, then the content of the page has to be deleted. However, it is not possible to erase a single page, but only all pages that are part of one block. Because the block sizes of an SSD are fixed — for example, 512kb, 1024kb up to 4MB. — a block that only contains a page with only 4k of data, will take the full storage space of 512kb anyway.

SSD需要把write分散到各处芯片防止wear leveling导致性能下降

Delete tombstone会被一直compact直到最下层，然后再删除key
常用的key都在L0里，反之都compact到下一层。为了加快，是用bloomfilter确定key是不是存在数据库里
L0: overlapping keys, sorted by flush time. files are sorted based on the time they are flushed. Their key range (as defined by FileMetaData.smallest and FileMetaData.largest) are mostly overlapped with each other. So it needs to look up every L0 file.
L1+: non-overlapping keys, sorted by key

Lock

How to implement mutex?

One way is using Test-and-set (spinlock)
Futexes have the desirable property that they do not require a kernel system call in the common cases of locking or unlocking an uncontended mutex. In these cases, the user-mode code successfully uses an atomic compare and swap (CAS)

Test-and-set

Test-and-set: the location value could only be set after passing test. Supported at machine-level (CPU instruction support): is an instruction used to write 1 (set) to a memory location and return its old value as a single atomic (i.e., non-interruptible) operation

while (test_and_set(lock) == 1); # The calling process obtains the lock if the old value was 0 otherwise while-loop spins waiting to acquire the lock. This is called a spinlock.

Test and test-and-set chooses not spin on test_and_set(), it spins on checking whether the shared lock variable seems free
Performance: When processor P1 has obtained a lock and processor P2 is also waiting for the lock, P2 will keep incurring bus transactions in attempts to acquire the lock. When a processor has obtained a lock, all other processors which also wish to obtain the same lock keep trying to obtain the lock by initiating bus transactions repeatedly until they get hold of the lock. This increases the bus traffic requirement of test-and-set significantly. This slows down all other traffic from cache and coherence misses. It slows down the overall section, since the traffic is saturated by failed lock acquisition attempts. Test-and-test-and-set is an improvement over TSL since it does not initiate lock acquisition requests continuously.

Spinlock

Pro: it avoids overhead from operating system process rescheduling or context switching, spinlocks are efficient if threads are likely to be blocked for only short periods.。因此一些多线程同步机制不使用切换到内核态的同步对象，而是以用户态的自旋锁或其衍生机制（如轻型读写锁）来做同步，付出的时间复杂度相差3个数量级
Con: 单核单线程的CPU不适于使用自旋锁 -> 死机. CPU time wasted

如何理解互斥锁、条件锁、读写锁以及自旋锁
How would you implement your own reader/writer lock in C++11?

Nginx

https://www.nginx.com/blog/inside-nginx-how-we-designed-for-performance-scale/

On this four‑core server, the NGINX master process creates four worker processes and a couple of cache helper processes which manage the on‑disk content cache.

Nginx processes share the same socket: If accept_mutex is enabled, worker processes will accept new connections by turn

https://www.nginx.com/blog/socket-sharding-nginx-release-1-9-1/

With the SO_REUSEPORT option enabled, there are multiple socket listeners for each IP address and port combination, one for each worker process.

Redis Snapshot vs Append Only File

Rsync: incremental backup

Compare directory differences: check subtree files
Compare whether need to update file: check file metadata: mtime, size
Transfer only needed data: f_new calculates the rolling hash with a sliding window

SAP

Availability

Standby server will check each HADR member to determine if it is eligible for promotion: https://help.sap.com/docs/SAP_ASE/efe56ad3cad0467d837c8ff1ac6ba75c/a6c69a21bc2b1014adda8a01ba6488fc.html -- However, network partition in HADR members will prevent all standby servers from promotion.

Scalability

Table partition is the solution

Altruism

2021年4月21日星期三

Database and Application