Some good readings regarding Data Lake.
Airflow
Airflow serves as an orchestration tool, the whole data flow:
- Trigger Airbyte for data from 3rd party to S3
- Trigger Spark (EMR) job from ETL S3 to data lake bronze layer
- Trigger DBT job from bronze layer to silver layer (Redshift)
- Trigger Jupyter notebook script with Papermill operator that has data analysis logic
GDPR
For GDPR, we should support a mechanism to delete record in raw S3 / bronze / silver / gold layer. Primary key is important, or use period compaction to override changes.How to evaluate data lake tools
- LakeFS
- https://lakefs.io/hudi-iceberg-and-delta-lake-data-lake-table-formats-compared/
- https://lakefs.io/hive-metastore-why-its-still-here-and-what-can-replace-it/
- Data Lakes: The Definitive Guide | LakeFS
A data lake is a system of technologies that allow for the querying of data in file or blob objects. When employed…lakefs.io - https://www.upsolver.com/blog/getting-data-lake-etl-right-6-guidelines-evaluating-tools
- ETL/ELT transformation engine
- GPDR deletion records
- Object time travel/Data mutation
- Transaction ACID
- Streaming Batching - https://www.slideshare.net/databricks/a-thorough-comparison-of-delta-lake-iceberg-and-hudi
- https://blog.csdn.net/younger_china/article/details/125926533 数据湖09:开源框架DeltaLake、Hudi、Iceberg深度对比
Data Versioning
lakeFS deletion regarding GDPR
https://medium.com/datamindedbe/what-is-lakefs-a-critical-survey-edce708a9b8e
https://lakefs.io/new-in-lakefs-data-retention-policies/
GDPR deletion request: Crypto shredding: How it can solve modern data retention challenges:
- 100B key per user
- MemoryDB to hold all keys in memory
AWS Datalake Solution
Data Mesh
The key to building the data infrastructure as a platform is (a) to not include any domain specific concepts or business logic, keeping it domain agnostic, and (b) make sure the platform hides all the underlying complexity and provides the data infrastructure components in a self-service manner. There is a long list of capabilities that a self-serve data infrastructure as a platform provides to its users, a domain's data engineers. Here are a few of them:
- Scalable polyglot big data storage
- Encryption for data at rest and in motion
- Data product versioning
- Data product schema
- Data product de-identification
- Unified data access control and logging
- Data pipeline implementation and orchestration
- Data product discovery, catalog registration and publishing
- Data governance and standardization
- Data product lineage
- Data product monitoring/alerting/log
- Data product quality metrics (collection and sharing)
- In memory data caching
- Federated identity management
- Compute and data locality
A success criteria for self-serve data infrastructure is lowering the 'lead time to create a new data product' on the infrastructure.
This paradigm shift requires a new set of governing principles accompanied with a new language:
- serving over ingesting
- discovering and using over extracting and loading
- Publishing events as streams over flowing data around via centralized pipelines
- Ecosystem of data products over centralized data platform
four underpinning principles that any data mesh implementation embodies to achieve the promise of scale, while delivering quality and integrity guarantees needed to make data usable : 1) domain-oriented decentralized data ownership and architecture, 2) data as a product, 3) self-serve data infrastructure as a platform, and 4) federated computational governance.
My personal hope is that we start seeing a convergence of operational and data infrastructure where it makes sense. For example, perhaps running Spark on the same orchestration system, e.g. Kubernetes.
Federated computational governance
striking a balance between what shall be standardized globally, implemented and enforced by the platform for all domains and their data products, and what shall be left to the domains to decide, is an art.
they need to comply with the modeling of quality and specification of SLOs based on a global standard, defined by the global federated governance team, and automated by the platform.
DDD Hexagonal
Examples of components can be Authentication, Authorization, Billing, User, Review or Account, but they are always related to the domain. Bounded contexts like Authorization and/or Authentication should be seen as external tools for which we create an adapter and hide behind some kind of port.
没有评论:
发表评论