2021年11月11日星期四

Kubernetes

How to expose service

Ingress

Ingress Controller (Edge Proxy, Pods): make Ingress resource to work, the cluster must have an ingress controller running. Ingress controllers are not started automatically with a cluster.

AWS: Ingress = Application Load Balancers (L7 HTTP), Service-LoadBalancer = Network Load Balancers (L4). AWS Ingress Controller is a k8s Deployment of Pod
Nginx: It is also pods. In AWS we use a Network load balancer (NLB) to expose the NGINX Ingress controller behind a Service of Type=LoadBalancer.

Use Service.Type=LoadBalancer

The big downside is that each service you expose with a LoadBalancer will get its own IP address, and you have to pay for a LoadBalancer per exposed service, which can get expensive!

Use Service.Type=NodePort

https://www.getambassador.io/learn/kubernetes-ingress/

https://stackoverflow.com/questions/45079988/ingress-vs-load-balancer

https://medium.com/google-cloud/kubernetes-nodeport-vs-loadbalancer-vs-ingress-when-should-i-use-what-922f010849e0

Networking

CNI primarily works at L4 layer whereas service mesh works at L7 layer.

There are lots of different kinds of CNI plugins, but the two main ones are:

Network plugins, which are responsible for connecting pods to the network
IPAM (IP Address Management) plugins, which are responsible for allocating pod IP addresses

Service Mesh

Service Mesh implementation: Before the sidecar proxy container and application container are started, the Init container started firstly. The Init container is used to set iptables (the default traffic interception method in Istio, and can also use BPF, IPVS, etc.) to Intercept traffic entering the pod to Envoy sidecar Proxy. All TCP traffic (Envoy currently only supports TCP traffic) will be Intercepted by sidecar, and traffic from other protocols will be requested as originally

https://jimmysong.io/en/blog/understanding-how-envoy-sidecar-intercept-and-route-traffic-in-istio-service-mesh/

Comparison: sidecar proxy vs per node vs per service account per node vs shared remote proxy with micro proxy: https://www.solo.io/blog/ebpf-for-service-mesh/

consider resource overhead / feature isolation / security granularity / upgrade impact
For Linkerd: Per-host proxies are significantly worse than sidecars https://buoyant.io/2022/06/07/ebpf-sidecars-and-the-future-of-the-service-mesh

kube-proxy

kube-proxy is responsible for updating the iptables rules on each node of the cluster. https://betterprogramming.pub/k8s-a-closer-look-at-kube-proxy-372c4e8b090

https://www.tkng.io/services/clusterip/dataplane/iptables/

eBPF & io_uring

eBPF is a virtual machine embedded within the Linux kernel. It allows small programs to be loaded into the kernel, and attached to hooks, which are triggered when some event occurs. This allows the behaviour of the kernel to be (sometimes heavily) customised. While the eBPF virtual machine is the same for each type of hook, the capabilities of the hooks vary considerably. Since loading programs into the kernel could be dangerous; the kernel runs all programs through a very strict static verifier; the verifier sandboxes the program, ensuring it can only access allowed parts of memory and ensuring that it must terminate quickly. https://projectcalico.docs.tigera.io/about/about-ebpf

io_uring supports linking operations, but there is no way to generically pass the result of one system call to the next. With a simple bpf program, the application can tell the kernel how the result of open is to be passed to read — including the error handling, which then allocates its own buffers and keeps reading until the entire file is consumed and finally closed: we can checksum, compress, or search an entire file with a single system call.

https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/

Routing

Cloudflare --proxied--> AWS Route 53 --> ELB (Ingress-managed HA Cloud LoadBalancer) --> EC2 instances (Target Group nodes) --> Ingress Controller Pods (can be deployment or DaemonSet) -> Actual backend pods

Don't use DaemonSet when cluster size is too big - extra burden as each DaemonSet need to connect to k8s API

2021年11月3日星期三

Data Engineering Data Lake

Some good readings regarding Data Lake.

Airflow

Airflow serves as an orchestration tool, the whole data flow:

Trigger Airbyte for data from 3rd party to S3
Trigger Spark (EMR) job from ETL S3 to data lake bronze layer
Trigger DBT job from bronze layer to silver layer (Redshift)
Trigger Jupyter notebook script with Papermill operator that has data analysis logic

GDPR

For GDPR, we should support a mechanism to delete record in raw S3 / bronze / silver / gold layer. Primary key is important, or use period compaction to override changes.

How to evaluate data lake tools

LakeFS

https://www.upsolver.com/blog/getting-data-lake-etl-right-6-guidelines-evaluating-tools
- ETL/ELT transformation engine
- GPDR deletion records
- Object time travel/Data mutation
- Transaction ACID
- Streaming Batching
https://www.slideshare.net/databricks/a-thorough-comparison-of-delta-lake-iceberg-and-hudi
https://blog.csdn.net/younger_china/article/details/125926533 数据湖09：开源框架DeltaLake、Hudi、Iceberg深度对比

https://www.infoq.cn/article/fjebconxd2sz9wloykfo
https://eric-sun.medium.com/rescue-to-distributed-file-system-2dd8abd5d80d Delta Lake > Hudi > Iceberg

Data Versioning

lakeFS deletion regarding GDPR
https://medium.com/datamindedbe/what-is-lakefs-a-critical-survey-edce708a9b8e
https://lakefs.io/new-in-lakefs-data-retention-policies/

GDPR deletion request: Crypto shredding: How it can solve modern data retention challenges:

100B key per user
MemoryDB to hold all keys in memory

AWS Datalake Solution

CENTRALIZE GOVERNANCE FOR YOUR DATA LAKE USING AWS LAKE FORMATION WHILE ENABLING A MODERN DATA ARCHITECTURE WITH AMAZON REDSHIFT SPECTRUM

Building a Data Lake on AWS with Apache Airflow: https://youtu.be/RqjmC8iZEUo?t=320 https://github.com/garystafford/tickit-data-lake-demo

Data Mesh

https://martinfowler.com/articles/data-monolith-to-mesh.html

Architectural failure modes:

Centralized and monolithic - build experimental data pipeline is slow, let users builds it

Coupled pipeline decomposition - build new dataset depends on other teams

Siloed and hyper-specialized ownership - data engineer doesn't care about data

The next enterprise data platform architecture is in the convergence of Distributed Domain Driven Architecture, Self-serve Platform Design, and Product Thinking with Data.

The key to building the data infrastructure as a platform is (a) to not include any domain specific concepts or business logic, keeping it domain agnostic, and (b) make sure the platform hides all the underlying complexity and provides the data infrastructure components in a self-service manner. There is a long list of capabilities that a self-serve data infrastructure as a platform provides to its users, a domain's data engineers. Here are a few of them:

Scalable polyglot big data storage
Encryption for data at rest and in motion
Data product versioning
Data product schema
Data product de-identification
Unified data access control and logging
Data pipeline implementation and orchestration
Data product discovery, catalog registration and publishing
Data governance and standardization
Data product lineage
Data product monitoring/alerting/log
Data product quality metrics (collection and sharing)
In memory data caching
Federated identity management
Compute and data locality

A success criteria for self-serve data infrastructure is lowering the 'lead time to create a new data product' on the infrastructure.

This paradigm shift requires a new set of governing principles accompanied with a new language:

serving over ingesting
discovering and using over extracting and loading
Publishing events as streams over flowing data around via centralized pipelines
Ecosystem of data products over centralized data platform

https://martinfowler.com/articles/data-mesh-principles.html

four underpinning principles that any data mesh implementation embodies to achieve the promise of scale, while delivering quality and integrity guarantees needed to make data usable : 1) domain-oriented decentralized data ownership and architecture, 2) data as a product, 3) self-serve data infrastructure as a platform, and 4) federated computational governance.

Domain ownership

For example, the teams who manage ‘podcasts’, while providing APIs for releasing podcasts, should also be responsible for providing historical data that represents ‘released podcasts’ over time with other facts such as ‘listenership’ over time.

Data as a product

Each domain will include data product developer roles, responsible for building, maintaining and serving the domain's data products. Data product developers will be working alongside other developers in the domain. Each domain team may serve one or multiple data products. It’s also possible to form new teams to serve data products that don’t naturally fit into an existing operational domain.

Self-serve data platform

My personal hope is that we start seeing a convergence of operational and data infrastructure where it makes sense. For example, perhaps running Spark on the same orchestration system, e.g. Kubernetes.

Federated computational governance

striking a balance between what shall be standardized globally, implemented and enforced by the platform for all domains and their data products, and what shall be left to the domains to decide, is an art.

they need to comply with the modeling of quality and specification of SLOs based on a global standard, defined by the global federated governance team, and automated by the platform.

DDD Hexagonal

https://herbertograca.com/2017/11/16/explicit-architecture-01-ddd-hexagonal-onion-clean-cqrs-how-i-put-it-all-together/#application-core-organisation

Application core = business logic

Domain Layer. The objects in this layer contain the data and the logic to manipulate that data, that is specific to the Domain itself and it’s independent of the business processes that trigger that logic, they are independent and completely unaware of the Application Layer

Examples of components can be Authentication, Authorization, Billing, User, Review or Account, but they are always related to the domain. Bounded contexts like Authorization and/or Authentication should be seen as external tools for which we create an adapter and hide behind some kind of port.

The goal, as always, is to have a codebase that is loosely coupled and high cohesive

Trend

https://www.gartner.com/smarterwithgartner/gartner-top-10-data-and-analytics-trends-for-2021

https://www.qlik.com/us/-/media/files/resource-library/global-us/register/ebooks/eb-bi-data-trends-2022-en.pdf A competitor can become a partner, a partner can become a customer, and a customer can become a competitor. The solution is not to wall off but to lean in to a new form of competitive edge: generative relationships with mutually beneficial outcomes. Your only option is to become more “interwoven,” creating a trusted ecosystem built on clear rules of engagement.

Data Sharing Is a Business Necessity to Accelerate Digital Business: Gartner predicts that by 2023, organizations that promote data sharing will outperform their peers on most business value metrics. The traditional “don’t share data unless” mindset should be replaced with “must share data unless.”

2021年4月21日星期三

Database and Application

https://tasteturnpike.blogspot.com/2017/03/sre-knowledge.html

https://www.alibabacloud.com/blog/what-are-the-differences-and-functions-of-the-redo-log-undo-log-and-binlog-in-mysql_598035

Redo: Description ensures the durability of transactions and prevents dirty pages from being written to the disk at the point in time of the failure. When the MySQL service is restarted, redo according to the redo log to achieve the durability of the transaction.

Undo: It stores a version of the data before the transaction occurs, which can be used for rollback. At the same time, it can provide reads (MVCC) under Multi-Version Concurrency control, which is read without locking.

Binlog:

It is used for replication. In master-slave replication, the slave database replays the binlog stored in the master database to achieve master-slave synchronization.
It is used for the database point-in-time restore.

Postgres

Postgres Transaction Isolation
dirty read
A transaction reads data written by a concurrent uncommitted transaction.
nonrepeatable read
A transaction re-reads data it has previously read and finds that data has been modified by another transaction (that committed since the initial read).
phantom read
A transaction re-executes a query returning a set of rows that satisfy a search condition and finds that the set of rows satisfying the condition has changed due to another recently-committed transaction.
serialization anomaly
The result of successfully committing a group of transactions is inconsistent with all possible orderings of running those transactions one at a time.
read uncommitted
read committed
Because Read Committed mode starts each command with a new snapshot that includes all transactions committed up to that instant, subsequent commands in the same transaction will see the effects of the committed concurrent transaction in any case. The point at issue above is whether or not a single command sees an absolutely consistent view of the database.
repeatable read: sees data committed before the transaction began; it never sees either uncommitted data or changes committed during transaction execution by concurrent transactions
create a snapshot for transaction to ensure consistency read. 但如果其它transaction 此时改了此行，则retry
Applications using this level must be prepared to retry transactions due to serialization failures.
serializable: This level emulates serial transaction execution for all committed transactions; as if transactions had been executed one after another, serially, rather than concurrently. However, like the Repeatable Read level, applications using this level must be prepared to retry transactions due to serialization failures. In fact, this isolation level works exactly the same as Repeatable Read except that it monitors for conditions which could make execution of a concurrent set of serializable transactions behave in a manner inconsistent with all possible serial (one at a time) executions of those transactions. This monitoring does not introduce any blocking beyond that present in repeatable read, but there is some overhead to the monitoring, and detection of the conditions which could cause a serialization anomaly will trigger a serialization failure.
serialization failure: 并发transaction互相影响，不同执行顺序会造成不同结果
predicate locking (These will show up in the pg_locks system view with a mode of SIReadLock) : 检测write是否会对并发transaction造成影响。In PostgreSQL these locks do not cause any blocking and therefore can not play any part in causing a deadlock. They are used to identify and flag dependencies among concurrent Serializable transactions which in certain combinations can lead to serialization anomalies. In contrast, a Read Committed or Repeatable Read transaction which wants to ensure data consistency may need to take out a lock on an entire table, which could block other users attempting to use that table, or it may use SELECT FOR UPDATE or SELECT FOR SHARE which not only can block other transactions but cause disk access.
Serializable predicate locking性能好于explicit locks
PostgreSQL's Serializable transaction isolation level only allows concurrent transactions to commit if it can prove there is a serial order of execution that would produce the same effect 如果能提前检查unique constraints violation，则尽量在transaction前检测
优化
Control the number of active connections, using a connection pool if needed. This is always an important performance consideration, but it can be particularly important in a busy system using Serializable transactions.
Eliminate explicit locks, SELECT FOR UPDATE, and SELECT FOR SHARE where no longer needed due to the protections automatically provided by Serializable transactions.
https://zhuanlan.zhihu.com/p/54979396 Snapshot Isolation综述
Linearizability, serializability, transaction isolation and consistency models
The most common isolation level implemented with MVCC is snapshot isolation
MVCC introduces the challenge of how to remove versions that become obsolete and will never be read. In some cases, a process to periodically sweep through and delete the obsolete versions is implemented. This is often a stop-the-world process that traverses a whole table and rewrites it with the last version of each data item. PostgreSQL can use this approach with its VACUUM FREEZE process

Postgres DB vaccum and query conflict

https://www.postgresql.org/docs/9.2/hot-standby.html#HOT-STANDBY-CONFLICT
https://www.cybertec-postgresql.com/en/what-hot_standby_feedback-in-postgresql-really-does/ "hot_standby_feedback" we can teach the standby to periodically inform the primary about the oldest transaction running on the standby. If the primary knows about old transactions on the standby, it can make VACUUM keep rows until the standbys are done.

LevelDB RocksDB

https://dbdb.io/db/rocksdb

https://es.slideshare.net/meeeejin/rocksdb-detail

https://vinodhinic.medium.com/lets-rock-3a73fbc6ea79

The main challenge is that the Flash cells can only be deleted block-wise and written on page-wise. To write new data on a page, it must be physically totally empty. If it is not, then the content of the page has to be deleted. However, it is not possible to erase a single page, but only all pages that are part of one block. Because the block sizes of an SSD are fixed — for example, 512kb, 1024kb up to 4MB. — a block that only contains a page with only 4k of data, will take the full storage space of 512kb anyway.

SSD需要把write分散到各处芯片防止wear leveling导致性能下降

Delete tombstone会被一直compact直到最下层，然后再删除key
常用的key都在L0里，反之都compact到下一层。为了加快，是用bloomfilter确定key是不是存在数据库里
L0: overlapping keys, sorted by flush time. files are sorted based on the time they are flushed. Their key range (as defined by FileMetaData.smallest and FileMetaData.largest) are mostly overlapped with each other. So it needs to look up every L0 file.
L1+: non-overlapping keys, sorted by key

Lock

How to implement mutex?

One way is using Test-and-set (spinlock)
Futexes have the desirable property that they do not require a kernel system call in the common cases of locking or unlocking an uncontended mutex. In these cases, the user-mode code successfully uses an atomic compare and swap (CAS)

Test-and-set

Test-and-set: the location value could only be set after passing test. Supported at machine-level (CPU instruction support): is an instruction used to write 1 (set) to a memory location and return its old value as a single atomic (i.e., non-interruptible) operation

while (test_and_set(lock) == 1); # The calling process obtains the lock if the old value was 0 otherwise while-loop spins waiting to acquire the lock. This is called a spinlock.

Test and test-and-set chooses not spin on test_and_set(), it spins on checking whether the shared lock variable seems free
Performance: When processor P1 has obtained a lock and processor P2 is also waiting for the lock, P2 will keep incurring bus transactions in attempts to acquire the lock. When a processor has obtained a lock, all other processors which also wish to obtain the same lock keep trying to obtain the lock by initiating bus transactions repeatedly until they get hold of the lock. This increases the bus traffic requirement of test-and-set significantly. This slows down all other traffic from cache and coherence misses. It slows down the overall section, since the traffic is saturated by failed lock acquisition attempts. Test-and-test-and-set is an improvement over TSL since it does not initiate lock acquisition requests continuously.

Spinlock

Pro: it avoids overhead from operating system process rescheduling or context switching, spinlocks are efficient if threads are likely to be blocked for only short periods.。因此一些多线程同步机制不使用切换到内核态的同步对象，而是以用户态的自旋锁或其衍生机制（如轻型读写锁）来做同步，付出的时间复杂度相差3个数量级
Con: 单核单线程的CPU不适于使用自旋锁 -> 死机. CPU time wasted

如何理解互斥锁、条件锁、读写锁以及自旋锁
How would you implement your own reader/writer lock in C++11?

Nginx

https://www.nginx.com/blog/inside-nginx-how-we-designed-for-performance-scale/

On this four‑core server, the NGINX master process creates four worker processes and a couple of cache helper processes which manage the on‑disk content cache.

Nginx processes share the same socket: If accept_mutex is enabled, worker processes will accept new connections by turn

https://www.nginx.com/blog/socket-sharding-nginx-release-1-9-1/

With the SO_REUSEPORT option enabled, there are multiple socket listeners for each IP address and port combination, one for each worker process.

Redis Snapshot vs Append Only File

Rsync: incremental backup

Compare directory differences: check subtree files
Compare whether need to update file: check file metadata: mtime, size
Transfer only needed data: f_new calculates the rolling hash with a sliding window

SAP

Availability

Standby server will check each HADR member to determine if it is eligible for promotion: https://help.sap.com/docs/SAP_ASE/efe56ad3cad0467d837c8ff1ac6ba75c/a6c69a21bc2b1014adda8a01ba6488fc.html -- However, network partition in HADR members will prevent all standby servers from promotion.

Scalability

Table partition is the solution

订阅：博文 (Atom)