반응형
Ingestion
- Ingestion = moving or ingesting data
- frequency
- batch vs streaming
- batch : slower = daily or hourly
- streaming : faster = seconds to sub-seconds. real-time
- micro-batch : combination of batch and streaming
- batch vs streaming
Batch ingestion
- convenient
- less latency
- more forgiving
- Type
- ETL : Extract → Transform → Load
- traditional data warehouse : clean → put DW
- why ETL needs cleaning? DW is expensive!
- most common
- traditional data warehouse : clean → put DW
- ELT : Extract → Load → Transform
- new trend
- store raw data with little cleaning
- for data lakehouse : cheaper than DW
- ETL : Extract → Transform → Load
Streaming ingestion
- error handling by dead-letter queue : reroute failed events
- challenge
- strict latency
- late-arriving data
- delivered out of order
- multiple delivery : more than once
- retention period
- too short → lost data
- too long → slow performance
- message size
- scalability
- make sure auto scaling based on workload
- buffer for fluctuating event volumes
- schema change
- structured data = fixed schema : if you add column → break schema
- communicate with upstream teams, validation and test to catch schema changes
- schema migration is not perfect
How to ingest data
- DB connection : connect directly to database
- row-based
- CDC(Change Data Capture) : capture the log(before and after)
- other : APIs, Managed connectors, Streaming
- API : lot of manual work to connect, frequently error(error-prone)
- Managed connectors : pre-built APIs, need some custom code
- Streaming : real-time data
Transformation
- Transformation : change shape of data for business
Query
- asking question about data
- retrieving the data from storage & manipulating that data
- type
- DDL : definition. create, drop, alter etc
- DML : manpulation. select, insert, update, delete
- DCL : control. grant etc
- process
- query issued
- parsing : validate query
- query planning : how to get answer from the storage system
- optimization : differenct in performance in cost
- How to
- reduce the search space : always filter, prune first
- use explain SQL statement : for complex query
- pre-join frequently joined tables : multiple tables are expensive
- vaccum dead record : unused record = take up more space & interface with optimizer
- leverage caching
- cache the expensive query → improve average performance
- return stale(out-date) result
- How to
- execution : result back
- query streaming data
- window : tiny batches by grouping events
- enrichment : steaming data + other data
- stream-to-stream joining : combine multiple stream
data modeling
- data model : simplified version of the real world
- purpose : understanding business data & consistent definitions
- need
- communicate with data consumers
- model the data at the lowest granularity possible
- e.g) not aggreate data first, create raw data table first
- need
- step
- conceptual : visualize data entities by ER diagram(ERD)
- logical : visualize data relationships by ER diagram(ERD)
- physical : visulize data entities, relationships, FK/PK by ER diagram(ERD)
- normalization
- denormalized tables?
- convenient(no need to search another table) but duplicated
- create normalized table first, then denormalized data
- if one of the row failed to update & duplicated → corrupted data
- convenient(no need to search another table) but duplicated
- no duplicates : seperate the data tables
- DRY “Don’t Repeat Yourself”
- denormalized tables?
- technique
- inmon : highly normalized data
- kimball : denormalized data
- flexibility
- faster interaction
- duplicate data using star schema(fact tables and dimension tables)
- data vault : 3 entities
- hubs : core business concept
- links : relationships between hubs and satelites
- satelites : store information about hubs
- wide tables: join many related tables into a single highly denormalized table
- many columns than before = wide
- denormalized table pattern(wide table) is becoming popular!
- less rigid(strict) than other patterns
- faster analytical query : because of pre-joining
- normalized patterns
- invented when storage was expensive & tightly coupled with compute → has to be modeled carefully
- but these days, storage is cheap → denormalized form for analysis
- orchestrate : a lot of queries to transform raw data into target data → complex
- e.g) airflow, dexter
- SQL vs Code
- when quries are easier to do in SQL → SQL
- when quries are complex → Python or Pyspark
- Views and materializations
- Views : virtual tables. not stored in physical storage
- re-computed when you need view
- data is always fresh
- materialization : store the query in physical storage
- Views : virtual tables. not stored in physical storage
- Events vs States
- Event : Immutable. unchange
- State : Mutable. change
- Streaming vs batch
- streaming : continuous stream of immutable events
- require more computing power & is more costly
- low-latency, real-time
- batch : window over this continuous stream of events
- sorted and aggregated
- streaming : continuous stream of immutable events
Serving
- final step for data pipeline
- analytics : BI
- which metrics are useful : put yourself in data consumer’s shoes
- Trust is very hard to gain but easy to lose
- monitoring metrics
- communicate data consumers
- ML/AI
- scalability : ML/AI dataset can be very large
- local notebook won’t be enough → cloud based notebook
- feature stores : for classic ML
- object storage
- data lakehouse
ML DL tabular data, structured data unstructured data limited feature large feature smaller larger CPU GPU - scalability : ML/AI dataset can be very large
- Reverse ETL : storage → source system
Reference
* https://bennyaustin.com/2010/05/02/kimball-and-inmon-dw-models/