[Udemy] Data Engineering 101: The Beginner's Guide - End-to-end data pipeline in-depth(1)

Data/Data Engineering

[Udemy] Data Engineering 101: The Beginner's Guide - End-to-end data pipeline in-depth(1)

sennysideup 2025. 1. 12. 14:51

Generation of source data

structured / unstructured : differences in store, search..
- structured data : tabular, 2-demensional(rows and columns)
  - use SQL
  - BI, classical ML
- unstructured data : files
  - use Deep Learning(Neural Networks)
database : if choose wrong database, suffer from performance
- RDBMS : Relational
  - transactional data, tabular format
  - relation between tables
  - inflexible, strict, normalized
  - single machine → limiting performance
    - need to replace machine to bigger one
- NoSQL : Non-relational
  - flexible
  - distributed → scale horizontally
    - Scale horizontally(Scaling out) : increasing the capacity of a system by adding machines
    - Scale Vertically(Scaling up) : increasing the capacity of a system by adding capability to the machine

- - multi machine : need more machine → scale things further
  - non relational
- key-value store : like dictionary in python, non-relational
- - memory-based : shut down DB → data is gone
  - persisted
    - single table = don’t need to perform any table join
    - handle high volume and high concurrency
third-party system : connected by API
- flow : event producers → event broker → event consumers
- challenge
  - move the data from microservices to microservices e.g) cart service → payment service
  - ordering data : first occured event can arrive late
  - delivery frequency : exactly-once vs at least once delivery
    - exactly-once delivery : difficult
    - at least once : can duplicate data
  - Idempotency : same result comes out whenever you run it
    - insert rows : not idempotency
    - merge rows : idempotency
    - at-least-once delivery + Idempotency → delete duplicate

Storage

hardware
- HDD : traditional disk drive
  - cheaper than SSD, cloud
  - slower than SSD, cloud

- SSD : laptop. newer, faster drive
  - faster than HDD
  - expensive than HDD
- Memory(RAM) : laptop = hardware(HDD, SSD) + memory
  - faster than SSD
  - temporary datahardware
- networking : connect between storages
  - if one of storages fail, recover with rest of storages
  - improve performance by accessing the data in parallel </aside>
serialization : data → byte streams to easily process it
- serialize on sending, de-serialize on receiving
- type
  - row-based : xml, json, csv
    - fast lookup of individual rows
  - column-based : arquet, orc, arrow
    - aggreation by column
- cost : time & computing resources
compression and caching
- compression
  - save storage
  - faster query, transport
- caching
  - move data to faster storage
  - faster storage : smaller and expensive → can’t move all data to faster storage
distributed storage : =horizontally scaling storage
- bigger data : need horizontal scaling
  - vertical scaling : need bigger machine
  - horizontal scaling : need more machine + network between machine
    - +distributed compute : store, retrieve, process data faster
- limits of parallelization : can’t always be done in parellel
  - strong consistency : single server
    - new data → write once
    - ask for same row → same result
    - sacrificing performance
  - eventual consistency : multiple server
    - new data → write multiple times
    - if one server fail to write → ask for same row → different result = no consistency → failed server : try to succeed (→ slow) → eventually consistency
    - good tradeoff : performance ↔ speed
- ACID / BASE
  - ACID : Atomictiy, Consistency, Isolation, Durabillity
    - single machine, transactional DB
    - strong consistency
  - BASE : Basically Available, Soft-state, Eventual consistency
    - distributed DB

Type

file : tree structure
- fast data retrieval
- limits scaling properties
block : HDD, SSD
- fast access
- transactional DB

object : most important to Data Engineer

all kinds of shape and size

VS file storage

	file	object
scale	limit	no limit → ideal for data engineering
structure	tree	flat(no nesting)
latency	fast	slow(need to search all objects)
updates	mutable - modify in place	immutable - replaced entirety

cache : ram, memory
- faster than hard drive
- temporary data → data loss
- Redis etc
streaming
- often ephemeral
- persisted by storing in object storage
- buffering : masssive data coming in at the same time → need buffer to prevent overload system

OLTP(row based DB) vs OLAP(columnar DB)
- finding specific row : easy in OLTP, inefficient in OLAP
- aggregate data : easy in OLAP
- OLTP : online transaction processing
  - SSD : fast but expensive. speed is important
- OLAP : online analytical processing
  - HDD : slow but cheap. volumes are important
Data warehouse, Data lakehouse
- DW : use in Business Analytics
  - OLTP → OLAP : for aggregated
  - hard to manage unstructured data like log, free text
  - costly → well organized data. not raw data
- Data lake
  - inexpensive
  - flexible → dumping all raw data → data swamps
    - Data swamps : deteriorated and unmanageable data lake due to the lack of proper data management
  - store unstructured data
- Data lakehouse
  - OLAP
  - combine advantages of Data Warehouse + Data lake
separation of compute from storage
- storage & conpute in same machine
  - no networking → improve performance
  - fast, low latency disk read
  - high bandwidth
    - bandwidth : The maximum amount of data transmitted over an internet connection in a given amount of time, 대역폭
- separation : cost of performance
  - independently storage and compute
  - serverless pattern : no need to run server 24/7, only when you need
Data Storage lifecycle
- hot : accessed often → high cost, fast retrival
- warm : access infrequenty → inexpensive
- cold : rarely access → cheap
  - if you try to access more → large penalty
  - Amazon S3 Glacier etc

reference

저작자표시 비영리 변경금지

'Data > Data Engineering' 카테고리의 다른 글

[Udemy] Data Engineering 101: The Beginner's Guide - Data Pipeline architecture(2) ~ Trend (1)	2025.02.04
[Udemy] Data Engineering 101: The Beginner's Guide - Data Pipeline architecture(1) (0)	2025.01.29
[Udemy] Data Engineering 101: The Beginner's Guide - Undercurrents (0)	2025.01.24
[Udemy] Data Engineering 101: The Beginner's Guide - End-to-end data pipeline in-depth(2) (0)	2025.01.18
[Udemy] Data Engineering 101: The Beginner's Guide - Intro (0)	2025.01.05

현재글[Udemy] Data Engineering 101: The Beginner's Guide - End-to-end data pipeline in-depth(1)

Carat Thinker

[Udemy] Data Engineering 101: The Beginner's Guide - End-to-end data pipeline in-depth(1)

Generation of source data

Storage

reference

'Data > Data Engineering' 카테고리의 다른 글

'Data/Data Engineering'의 다른글

티스토리툴바

[Udemy] Data Engineering 101: The Beginner's Guide - End-to-end data pipeline in-depth(1)

Generation of source data

Storage

reference

'Data > Data Engineering' 카테고리의 다른 글

'Data/Data Engineering'의 다른글

관련글

티스토리툴바