Data/Data Engineering

[Udemy] Data Engineering 101: The Beginner's Guide - End-to-end data pipeline in-depth(1)

sennysideup 2025. 1. 12. 14:51
반응형

Generation of source data

  • structured / unstructured : differences in store, search..
    • structured data : tabular, 2-demensional(rows and columns)
      • use SQL
      • BI, classical ML
    • unstructured data : files
      • use Deep Learning(Neural Networks)
  • database : if choose wrong database, suffer from performance
    • RDBMS : Relational
      • transactional data, tabular format
      • relation between tables
      • inflexible, strict, normalized
      • single machine → limiting performance
        • need to replace machine to bigger one
    • NoSQL : Non-relational
      • flexible
      • distributed → scale horizontally
        • Scale horizontally(Scaling out) : increasing the capacity of a system by adding machines
        • Scale Vertically(Scaling up) : increasing the capacity of a system by adding capability to the machine
      • multi machine : need more machine → scale things further
      • non relational
    • key-value store : like dictionary in python, non-relational
      • memory-based : shut down DB → data is gone
      • persisted
        • single table = don’t need to perform any table join
        • handle high volume and high concurrency
  • third-party system : connected by API
    • flow : event producers → event broker → event consumers
    • challenge
      • move the data from microservices to microservices e.g) cart service → payment service
      • ordering data : first occured event can arrive late
      • delivery frequency : exactly-once vs at least once delivery
        • exactly-once delivery : difficult
        • at least once : can duplicate data
      • Idempotency : same result comes out whenever you run it
        • insert rows : not idempotency
        • merge rows : idempotency
        • at-least-once delivery + Idempotency → delete duplicate

Storage

  • hardware
    • HDD : traditional disk drive
      • cheaper than SSD, cloud
      • slower than SSD, cloud
    • SSD : laptop. newer, faster drive
      • faster than HDD
      • expensive than HDD
    • Memory(RAM) : laptop = hardware(HDD, SSD) + memory
      • faster than SSD
      • temporary datahardware
    • networking : connect between storages
      • if one of storages fail, recover with rest of storages
      • improve performance by accessing the data in parallel </aside>
  • serialization : data → byte streams to easily process it
    • serialize on sending, de-serialize on receiving
    • type
      • row-based : xml, json, csv
        • fast lookup of individual rows
      • column-based : arquet, orc, arrow
        • aggreation by column
    • cost : time & computing resources
  • compression and caching
    • compression
      • save storage
      • faster query, transport
    • caching
      • move data to faster storage
      • faster storage : smaller and expensive → can’t move all data to faster storage
  • distributed storage : =horizontally scaling storage
    • bigger data : need horizontal scaling
      • vertical scaling : need bigger machine
      • horizontal scaling : need more machine + network between machine
        • +distributed compute : store, retrieve, process data faster
    • limits of parallelization : can’t always be done in parellel
      • strong consistency : single server
        • new data → write once
        • ask for same row → same result
        • sacrificing performance
      • eventual consistency : multiple server
        • new data → write multiple times
        • if one server fail to write → ask for same row → different result = no consistency → failed server : try to succeed (→ slow) → eventually consistency
        • good tradeoff : performance ↔ speed
    • ACID / BASE
      • ACID : Atomictiy, Consistency, Isolation, Durabillity
        • single machine, transactional DB
        • strong consistency
      • BASE : Basically Available, Soft-state, Eventual consistency
        • distributed DB
  • Type
    • file : tree structure
      • fast data retrieval
      • limits scaling properties
    • block : HDD, SSD
      • fast access
      • transactional DB
    • object : most important to Data Engineer
      • all kinds of shape and size
      • VS file storage
          file object
        scale limit no limit
        → ideal for data engineering
        structure tree flat(no nesting)
        latency fast slow(need to search all objects)
        updates mutable
        - modify in place
        immutable
        - replaced entirety
    • cache : ram, memory
      • faster than hard drive
      • temporary data → data loss
      • Redis etc
    • streaming
      • often ephemeral
      • persisted by storing in object storage
      • buffering : masssive data coming in at the same time → need buffer to prevent overload system
  • OLTP(row based DB) vs OLAP(columnar DB)

    • finding specific row : easy in OLTP, inefficient in OLAP
    • aggregate data : easy in OLAP
    • OLTP : online transaction processing
      • SSD : fast but expensive. speed is important
    • OLAP : online analytical processing
      • HDD : slow but cheap. volumes are important
  • Data warehouse, Data lakehouse
    • DW : use in Business Analytics
      • OLTP → OLAP : for aggregated
      • hard to manage unstructured data like log, free text
      • costly → well organized data. not raw data
    • Data lake
      • inexpensive
      • flexible → dumping all raw data → data swamps
        • Data swamps : deteriorated and unmanageable data lake due to the lack of proper data management
      • store unstructured data
    • Data lakehouse
      • OLAP
      • combine advantages of Data Warehouse + Data lake
  • separation of compute from storage
    • storage & conpute in same machine
      • no networking → improve performance
      • fast, low latency disk read
      • high bandwidth
        • bandwidth : The maximum amount of data transmitted over an internet connection in a given amount of time, 대역폭
    • separation : cost of performance
      • independently storage and compute
      • serverless pattern : no need to run server 24/7, only when you need
  • Data Storage lifecycle
    • hot : accessed often → high cost, fast retrival
    • warm : access infrequenty → inexpensive
    • cold : rarely access → cheap
      • if you try to access more → large penalty
      • Amazon S3 Glacier etc

 

 

 

reference