Data/Data Engineering

[Udemy] Data Engineering 101: The Beginner's Guide - End-to-end data pipeline in-depth(2)

sennysideup 2025. 1. 18. 11:20
반응형

Ingestion

  • Ingestion = moving or ingesting data
  • frequency
    • batch vs streaming
      • batch : slower = daily or hourly
      • streaming : faster = seconds to sub-seconds. real-time
    • micro-batch : combination of batch and streaming

Batch ingestion

  • convenient
  • less latency
  • more forgiving 
  • Type
    • ETL : Extract → Transform → Load
      • traditional data warehouse : clean → put DW
        • why ETL needs cleaning? DW is expensive!
      • most common
    • ELT : Extract → Load → Transform
      • new trend
      • store raw data with little cleaning
      • for data lakehouse : cheaper than DW

Streaming ingestion

  • error handling by dead-letter queue : reroute failed events
  • challenge
    • strict latency
    • late-arriving data
    • delivered out of order
    • multiple delivery : more than once
    • retention period
      • too short → lost data
      • too long → slow performance
    • message size
  • scalability
    • make sure auto scaling based on workload
    • buffer for fluctuating event volumes
  • schema change
    • structured data = fixed schema : if you add column → break schema
    • communicate with upstream teams, validation and test to catch schema changes
      • schema migration is not perfect

How to ingest data

  1. DB connection : connect directly to database
    • row-based
  2. CDC(Change Data Capture) : capture the log(before and after)
  3. other : APIs, Managed connectors, Streaming
    • API : lot of manual work to connect, frequently error(error-prone)
    • Managed connectors : pre-built APIs, need some custom code
    • Streaming : real-time data

 

Transformation

  • Transformation : change shape of data for business

Query

  • asking question about data
  • retrieving the data from storage & manipulating that data
  • type
    • DDL : definition. create, drop, alter etc
    • DML : manpulation. select, insert, update, delete
    • DCL : control. grant etc
  • process
    1. query issued
    2. parsing : validate query
    3. query planning : how to get answer from the storage system
    4. optimization : differenct in performance in cost
      • How to
        • reduce the search space : always filter, prune first
        • use explain SQL statement : for complex query
        • pre-join frequently joined tables : multiple tables are expensive
        • vaccum dead record : unused record = take up more space & interface with optimizer
        • leverage caching
          • cache the expensive query → improve average performance
          • return stale(out-date) result
    5. execution : result back
  • query streaming data

    • window : tiny batches by grouping events
    • enrichment : steaming data + other data
    • stream-to-stream joining : combine multiple stream

data modeling

  • data model : simplified version of the real world
  • purpose : understanding business data & consistent definitions
    • need
      • communicate with data consumers
      • model the data at the lowest granularity possible
        • e.g) not aggreate data first, create raw data table first
  • step
    1. conceptual : visualize data entities by ER diagram(ERD)
    2. logical : visualize data relationships by ER diagram(ERD)
    3. physical : visulize data entities, relationships, FK/PK by ER diagram(ERD)
  • normalization
    • denormalized tables?
      • convenient(no need to search another table) but duplicated
        • create normalized table first, then denormalized data
      • if one of the row failed to update & duplicated → corrupted data
    • no duplicates : seperate the data tables
    • DRY “Don’t Repeat Yourself”
  • technique

    • inmon : highly normalized data
    • kimball : denormalized data
      • flexibility
      • faster interaction
      • duplicate data using star schema(fact tables and dimension tables)
    • data vault : 3 entities
      • hubs : core business concept
      • links : relationships between hubs and satelites
      • satelites : store information about hubs
    • wide tables: join many related tables into a single highly denormalized table
      • many columns than before = wide
      • denormalized table pattern(wide table) is becoming popular!
        • less rigid(strict) than other patterns
        • faster analytical query : because of pre-joining
        • normalized patterns
          • invented when storage was expensive & tightly coupled with compute → has to be modeled carefully
          • but these days, storage is cheap → denormalized form for analysis
  • orchestrate : a lot of queries to transform raw data into target data → complex
    • e.g) airflow, dexter
  • SQL vs Code
    • when quries are easier to do in SQL → SQL
    • when quries are complex → Python or Pyspark
  • Views and materializations
    • Views : virtual tables. not stored in physical storage
      • re-computed when you need view
      • data is always fresh
    • materialization : store the query in physical storage
  • Events vs States
    • Event : Immutable. unchange
    • State : Mutable. change
  • Streaming vs batch
    • streaming : continuous stream of immutable events
      • require more computing power & is more costly
      • low-latency, real-time
    • batch : window over this continuous stream of events
      • sorted and aggregated

 

Serving

  • final step for data pipeline
  • analytics : BI
    • which metrics are useful : put yourself in data consumer’s shoes
    • Trust is very hard to gain but easy to lose
      • monitoring metrics
      • communicate data consumers
  • ML/AI
    • scalability : ML/AI dataset can be very large
      • local notebook won’t be enough → cloud based notebook
    • feature stores : for classic ML
    • object storage
    • data lakehouse

    ML DL
    tabular data, structured data unstructured data
    limited feature large feature
    smaller larger
    CPU GPU
  • Reverse ETL : storage → source system

 

Reference

* https://bennyaustin.com/2010/05/02/kimball-and-inmon-dw-models/