Ingress

Ingress Controller (Edge Proxy, Pods): make Ingress resource to work, the cluster must have an ingress controller running. Ingress controllers are not started automatically with a cluster.

AWS: Ingress = Application Load Balancers (L7 HTTP), Service-LoadBalancer = Network Load Balancers (L4). AWS Ingress Controller is a k8s Deployment of Pod
Nginx: It is also pods. In AWS we use a Network load balancer (NLB) to expose the NGINX Ingress controller behind a Service of Type=LoadBalancer.

Use Service.Type=LoadBalancer

The big downside is that each service you expose with a LoadBalancer will get its own IP address, and you have to pay for a LoadBalancer per exposed service, which can get expensive!

Use Service.Type=NodePort

https://www.getambassador.io/learn/kubernetes-ingress/

https://stackoverflow.com/questions/45079988/ingress-vs-load-balancer

https://medium.com/google-cloud/kubernetes-nodeport-vs-loadbalancer-vs-ingress-when-should-i-use-what-922f010849e0

Networking

CNI primarily works at L4 layer whereas service mesh works at L7 layer.

There are lots of different kinds of CNI plugins, but the two main ones are:

Network plugins, which are responsible for connecting pods to the network
IPAM (IP Address Management) plugins, which are responsible for allocating pod IP addresses

Service Mesh

Service Mesh implementation: Before the sidecar proxy container and application container are started, the Init container started firstly. The Init container is used to set iptables (the default traffic interception method in Istio, and can also use BPF, IPVS, etc.) to Intercept traffic entering the pod to Envoy sidecar Proxy. All TCP traffic (Envoy currently only supports TCP traffic) will be Intercepted by sidecar, and traffic from other protocols will be requested as originally

https://jimmysong.io/en/blog/understanding-how-envoy-sidecar-intercept-and-route-traffic-in-istio-service-mesh/

Comparison: sidecar proxy vs per node vs per service account per node vs shared remote proxy with micro proxy: https://www.solo.io/blog/ebpf-for-service-mesh/

consider resource overhead / feature isolation / security granularity / upgrade impact
For Linkerd: Per-host proxies are significantly worse than sidecars https://buoyant.io/2022/06/07/ebpf-sidecars-and-the-future-of-the-service-mesh

kube-proxy

kube-proxy is responsible for updating the iptables rules on each node of the cluster. https://betterprogramming.pub/k8s-a-closer-look-at-kube-proxy-372c4e8b090

https://www.tkng.io/services/clusterip/dataplane/iptables/

eBPF & io_uring

eBPF is a virtual machine embedded within the Linux kernel. It allows small programs to be loaded into the kernel, and attached to hooks, which are triggered when some event occurs. This allows the behaviour of the kernel to be (sometimes heavily) customised. While the eBPF virtual machine is the same for each type of hook, the capabilities of the hooks vary considerably. Since loading programs into the kernel could be dangerous; the kernel runs all programs through a very strict static verifier; the verifier sandboxes the program, ensuring it can only access allowed parts of memory and ensuring that it must terminate quickly. https://projectcalico.docs.tigera.io/about/about-ebpf

io_uring supports linking operations, but there is no way to generically pass the result of one system call to the next. With a simple bpf program, the application can tell the kernel how the result of open is to be passed to read — including the error handling, which then allocates its own buffers and keeps reading until the entire file is consumed and finally closed: we can checksum, compress, or search an entire file with a single system call.

https://www.scylladb.com/2020/05/05/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/

Routing

Cloudflare --proxied--> AWS Route 53 --> ELB (Ingress-managed HA Cloud LoadBalancer) --> EC2 instances (Target Group nodes) --> Ingress Controller Pods (can be deployment or DaemonSet) -> Actual backend pods

Don't use DaemonSet when cluster size is too big - extra burden as each DaemonSet need to connect to k8s API

Some good readings regarding Data Lake.

Airflow

Airflow serves as an orchestration tool, the whole data flow:

Trigger Airbyte for data from 3rd party to S3
Trigger Spark (EMR) job from ETL S3 to data lake bronze layer
Trigger DBT job from bronze layer to silver layer (Redshift)
Trigger Jupyter notebook script with Papermill operator that has data analysis logic

GDPR

For GDPR, we should support a mechanism to delete record in raw S3 / bronze / silver / gold layer. Primary key is important, or use period compaction to override changes.

How to evaluate data lake tools

LakeFS

https://www.upsolver.com/blog/getting-data-lake-etl-right-6-guidelines-evaluating-tools
- ETL/ELT transformation engine
- GPDR deletion records
- Object time travel/Data mutation
- Transaction ACID
- Streaming Batching
https://www.slideshare.net/databricks/a-thorough-comparison-of-delta-lake-iceberg-and-hudi
https://blog.csdn.net/younger_china/article/details/125926533 数据湖09：开源框架DeltaLake、Hudi、Iceberg深度对比

https://www.infoq.cn/article/fjebconxd2sz9wloykfo
https://eric-sun.medium.com/rescue-to-distributed-file-system-2dd8abd5d80d Delta Lake > Hudi > Iceberg

Data Versioning

lakeFS deletion regarding GDPR
https://medium.com/datamindedbe/what-is-lakefs-a-critical-survey-edce708a9b8e
https://lakefs.io/new-in-lakefs-data-retention-policies/

GDPR deletion request: Crypto shredding: How it can solve modern data retention challenges:

100B key per user
MemoryDB to hold all keys in memory

AWS Datalake Solution

CENTRALIZE GOVERNANCE FOR YOUR DATA LAKE USING AWS LAKE FORMATION WHILE ENABLING A MODERN DATA ARCHITECTURE WITH AMAZON REDSHIFT SPECTRUM

Building a Data Lake on AWS with Apache Airflow: https://youtu.be/RqjmC8iZEUo?t=320 https://github.com/garystafford/tickit-data-lake-demo

Data Mesh

https://martinfowler.com/articles/data-monolith-to-mesh.html

Architectural failure modes:

Centralized and monolithic - build experimental data pipeline is slow, let users builds it

Coupled pipeline decomposition - build new dataset depends on other teams

Siloed and hyper-specialized ownership - data engineer doesn't care about data

The next enterprise data platform architecture is in the convergence of Distributed Domain Driven Architecture, Self-serve Platform Design, and Product Thinking with Data.

The key to building the data infrastructure as a platform is (a) to not include any domain specific concepts or business logic, keeping it domain agnostic, and (b) make sure the platform hides all the underlying complexity and provides the data infrastructure components in a self-service manner. There is a long list of capabilities that a self-serve data infrastructure as a platform provides to its users, a domain's data engineers. Here are a few of them:

Scalable polyglot big data storage
Encryption for data at rest and in motion
Data product versioning
Data product schema
Data product de-identification
Unified data access control and logging
Data pipeline implementation and orchestration
Data product discovery, catalog registration and publishing
Data governance and standardization
Data product lineage
Data product monitoring/alerting/log
Data product quality metrics (collection and sharing)
In memory data caching
Federated identity management
Compute and data locality

A success criteria for self-serve data infrastructure is lowering the 'lead time to create a new data product' on the infrastructure.

This paradigm shift requires a new set of governing principles accompanied with a new language:

serving over ingesting
discovering and using over extracting and loading
Publishing events as streams over flowing data around via centralized pipelines
Ecosystem of data products over centralized data platform

https://martinfowler.com/articles/data-mesh-principles.html

four underpinning principles that any data mesh implementation embodies to achieve the promise of scale, while delivering quality and integrity guarantees needed to make data usable : 1) domain-oriented decentralized data ownership and architecture, 2) data as a product, 3) self-serve data infrastructure as a platform, and 4) federated computational governance.

Domain ownership

For example, the teams who manage ‘podcasts’, while providing APIs for releasing podcasts, should also be responsible for providing historical data that represents ‘released podcasts’ over time with other facts such as ‘listenership’ over time.

Data as a product

Each domain will include data product developer roles, responsible for building, maintaining and serving the domain's data products. Data product developers will be working alongside other developers in the domain. Each domain team may serve one or multiple data products. It’s also possible to form new teams to serve data products that don’t naturally fit into an existing operational domain.

Self-serve data platform

My personal hope is that we start seeing a convergence of operational and data infrastructure where it makes sense. For example, perhaps running Spark on the same orchestration system, e.g. Kubernetes.

Federated computational governance

striking a balance between what shall be standardized globally, implemented and enforced by the platform for all domains and their data products, and what shall be left to the domains to decide, is an art.

they need to comply with the modeling of quality and specification of SLOs based on a global standard, defined by the global federated governance team, and automated by the platform.

DDD Hexagonal

https://herbertograca.com/2017/11/16/explicit-architecture-01-ddd-hexagonal-onion-clean-cqrs-how-i-put-it-all-together/#application-core-organisation

Application core = business logic

Domain Layer. The objects in this layer contain the data and the logic to manipulate that data, that is specific to the Domain itself and it’s independent of the business processes that trigger that logic, they are independent and completely unaware of the Application Layer

Examples of components can be Authentication, Authorization, Billing, User, Review or Account, but they are always related to the domain. Bounded contexts like Authorization and/or Authentication should be seen as external tools for which we create an adapter and hide behind some kind of port.

The goal, as always, is to have a codebase that is loosely coupled and high cohesive

Trend

https://www.gartner.com/smarterwithgartner/gartner-top-10-data-and-analytics-trends-for-2021

https://www.qlik.com/us/-/media/files/resource-library/global-us/register/ebooks/eb-bi-data-trends-2022-en.pdf A competitor can become a partner, a partner can become a customer, and a customer can become a competitor. The solution is not to wall off but to lean in to a new form of competitive edge: generative relationships with mutually beneficial outcomes. Your only option is to become more “interwoven,” creating a trusted ecosystem built on clear rules of engagement.

Data Sharing Is a Business Necessity to Accelerate Digital Business: Gartner predicts that by 2023, organizations that promote data sharing will outperform their peers on most business value metrics. The traditional “don’t share data unless” mindset should be replaced with “must share data unless.”

Altruism

2021年11月11日星期四

Kubernetes