[Udemy] Data Engineering 101: The Beginner's Guide - End-to-end data pipeline in-depth(2)

Data/Data Engineering

[Udemy] Data Engineering 101: The Beginner's Guide - End-to-end data pipeline in-depth(2)

sennysideup 2025. 1. 18. 11:20

Ingestion

Ingestion = moving or ingesting data
frequency
- batch vs streaming
  - batch : slower = daily or hourly
  - streaming : faster = seconds to sub-seconds. real-time
- micro-batch : combination of batch and streaming

Batch ingestion

convenient
less latency
more forgiving
Type
- ETL : Extract → Transform → Load
  - traditional data warehouse : clean → put DW
    - why ETL needs cleaning? DW is expensive!
  - most common
- ELT : Extract → Load → Transform
  - new trend
  - store raw data with little cleaning
  - for data lakehouse : cheaper than DW

Streaming ingestion

error handling by dead-letter queue : reroute failed events
challenge
- strict latency
- late-arriving data
- delivered out of order
- multiple delivery : more than once
- retention period
  - too short → lost data
  - too long → slow performance
- message size
scalability
- make sure auto scaling based on workload
- buffer for fluctuating event volumes
schema change
- structured data = fixed schema : if you add column → break schema
- communicate with upstream teams, validation and test to catch schema changes
  - schema migration is not perfect

How to ingest data

DB connection : connect directly to database
- row-based
CDC(Change Data Capture) : capture the log(before and after)
other : APIs, Managed connectors, Streaming
- API : lot of manual work to connect, frequently error(error-prone)
- Managed connectors : pre-built APIs, need some custom code
- Streaming : real-time data

Transformation

Transformation : change shape of data for business

Query

asking question about data
retrieving the data from storage & manipulating that data
type
- DDL : definition. create, drop, alter etc
- DML : manpulation. select, insert, update, delete
- DCL : control. grant etc
process
1. query issued
2. parsing : validate query
3. query planning : how to get answer from the storage system
4. optimization : differenct in performance in cost
  - How to
    - reduce the search space : always filter, prune first
    - use explain SQL statement : for complex query
    - pre-join frequently joined tables : multiple tables are expensive
    - vaccum dead record : unused record = take up more space & interface with optimizer
    - leverage caching
      - cache the expensive query → improve average performance
      - return stale(out-date) result
5. execution : result back
query streaming data
- window : tiny batches by grouping events
- enrichment : steaming data + other data
- stream-to-stream joining : combine multiple stream

data modeling

data model : simplified version of the real world
purpose : understanding business data & consistent definitions
- need
  - communicate with data consumers
  - model the data at the lowest granularity possible
    - e.g) not aggreate data first, create raw data table first
step
1. conceptual : visualize data entities by ER diagram(ERD)
2. logical : visualize data relationships by ER diagram(ERD)
3. physical : visulize data entities, relationships, FK/PK by ER diagram(ERD)
normalization
- denormalized tables?
  - convenient(no need to search another table) but duplicated
    - create normalized table first, then denormalized data
  - if one of the row failed to update & duplicated → corrupted data
- no duplicates : seperate the data tables
- DRY “Don’t Repeat Yourself”
technique
- inmon : highly normalized data
- kimball : denormalized data
  - flexibility
  - faster interaction
  - duplicate data using star schema(fact tables and dimension tables)
- data vault : 3 entities
  - hubs : core business concept
  - links : relationships between hubs and satelites
  - satelites : store information about hubs
- wide tables: join many related tables into a single highly denormalized table
  - many columns than before = wide
  - denormalized table pattern(wide table) is becoming popular!
    - less rigid(strict) than other patterns
    - faster analytical query : because of pre-joining
    - normalized patterns
      - invented when storage was expensive & tightly coupled with compute → has to be modeled carefully
      - but these days, storage is cheap → denormalized form for analysis
orchestrate : a lot of queries to transform raw data into target data → complex
- e.g) airflow, dexter
SQL vs Code
- when quries are easier to do in SQL → SQL
- when quries are complex → Python or Pyspark
Views and materializations
- Views : virtual tables. not stored in physical storage
  - re-computed when you need view
  - data is always fresh
- materialization : store the query in physical storage
Events vs States
- Event : Immutable. unchange
- State : Mutable. change
Streaming vs batch
- streaming : continuous stream of immutable events
  - require more computing power & is more costly
  - low-latency, real-time
- batch : window over this continuous stream of events
  - sorted and aggregated

Serving

final step for data pipeline
analytics : BI
- which metrics are useful : put yourself in data consumer’s shoes
- Trust is very hard to gain but easy to lose
  - monitoring metrics
  - communicate data consumers
ML/AI
- scalability : ML/AI dataset can be very large
  - local notebook won’t be enough → cloud based notebook
- feature stores : for classic ML
- object storage
- data lakehouse
ML DL

tabular data, structured data unstructured data

limited feature large feature

smaller larger

CPU GPU
Reverse ETL : storage → source system

Reference

* https://bennyaustin.com/2010/05/02/kimball-and-inmon-dw-models/

저작자표시 비영리 변경금지 (새창열림)

'Data > Data Engineering' 카테고리의 다른 글

[Udemy] Data Engineering 101: The Beginner's Guide - Data Pipeline architecture(2) ~ Trend (1)	2025.02.04
[Udemy] Data Engineering 101: The Beginner's Guide - Data Pipeline architecture(1) (0)	2025.01.29
[Udemy] Data Engineering 101: The Beginner's Guide - Undercurrents (0)	2025.01.24
[Udemy] Data Engineering 101: The Beginner's Guide - End-to-end data pipeline in-depth(1) (0)	2025.01.12
[Udemy] Data Engineering 101: The Beginner's Guide - Intro (0)	2025.01.05

현재글[Udemy] Data Engineering 101: The Beginner's Guide - End-to-end data pipeline in-depth(2)

Carat Thinker

[Udemy] Data Engineering 101: The Beginner's Guide - End-to-end data pipeline in-depth(2)

Ingestion

Batch ingestion

Streaming ingestion

How to ingest data

Transformation

Query

data modeling

Serving

Reference

'Data > Data Engineering' 카테고리의 다른 글

'Data/Data Engineering'의 다른글

티스토리툴바

ML	DL
tabular data, structured data	unstructured data
limited feature	large feature
smaller	larger
CPU	GPU

[Udemy] Data Engineering 101: The Beginner's Guide - End-to-end data pipeline in-depth(2)

Ingestion

Batch ingestion

Streaming ingestion

How to ingest data

Transformation

Query

data modeling

Serving

Reference

'Data > Data Engineering' 카테고리의 다른 글

'Data/Data Engineering'의 다른글

관련글

티스토리툴바