[Udemy] Data Engineering 101: The Beginner's Guide

Data/Data Engineering

[Udemy] Data Engineering 101: The Beginner's Guide - Data Pipeline architecture(1)

sennysideup 2025. 1. 29. 12:59

data architecture

what is good data architecture
- performance : using computing and storage resources efficiently
  - trade-off between performance and complexity
- scalability : data volumes = fluctuate
  - upstream system fail → increasing data volumes
  - scale up/down should be automatical : scale-down can save a lot of money
- reliability : available system & avoid failure
  - Automate as much as possible → reduce human errors
- expect failure : there isn’t 100% reliable system
  - plan for alert, notification, recovery plan
- security : follow the principles
  - most important thing
- modularity : connect systems via API(external system)
- cost effectiveness : direct, indirect cost(time)
  - be aware of trade off(direct cost - operational complexity)
- flexibility : extensibility of architecture
architectures for solve limitations of hadoop : groundwork for modern data stack
- lambda : separates streaming and batch workloads → combine in serving
  - streaming : sends data directly to serve
  - batch : sends data to data warehouse
  - two code base → difficult to manage
- kappa : single layer → solve lambda architecture’s problem
  - difficult & expensive → not widely adopted
modern data stack : monolithic, proprietary tools → cloud-based, easy to use
- goal : reduce complexity and increase modularity → reduce problems, cost & more flexible architecture
factors you should consider
- speed : fast decision, able to reverse decision if required
  - modular tools, not monolithic system
  - open source tools : flexible but paying for cloud
- size of team : small team → simple system
  - complexity is enemy of speed
  - simple but modular and flexible → upgrade quickly
- integration : Does tool play nicely with others?
- tool central pipeline
  - choose other tools which integrate with central tool
    - watch out for vendor lock-in : don’t stuck with tools
      - vendor lock-in : the cost of changing vendors is so high that you become stuck with the original vendor
        
        cost of changing vendor
        
        financial
        
        human resource
        
        business shut down risk
        
        Why you should worry about vendor lock-in
        
        low quality of vendor service
        
        complete shut down of the vendor
        
        massive increase in costs
        
        changes to the product that make it unable to meet business requirements
        
        how to avoid vendor lock-in
        
        carefully review the cloud service before signing a contract
        
        ensure that you can migrate your data easily
        
        back up all data
        
        use multi-cloud or hybird cloud approach
- benchmarks
  - metrics : highly cherry picked
  - evaluate tools by multiple dimensions

BI stack

start with technology that are harder to change
storage : most hard system to change
- data warehouse
  - traditional storage : unstructured data grow → data lake
  - Amazon Redshift : if pipeline resides in AWS
  - Snowflake : easy to use, deep integrations
    - but cannot ingest all types of data
    - data lock-in risk
    - don’t have local version
    - small team, small data volume
- data lake : easily becomes data swamps
- data lakehouse : data lake + data warehouse
  - apache lceberg + Trino(compute engine) → replace snowflake
  - able to turn an object storage into data warehouse
  - Apache lceberg : large data volume, long term
    - ACID compliance
    - OLAP data store : ideal for analytical use cases
    - open table format : fully open source and open standard
    - large data volume
    - cost efficient : just an object storage
- S3 : object storage. keep all raw data
  - cheaper than snowflake
  - can ingest all types of data
duckDB : improve developement iteration speed
- if data fit in local machine or able to sample
- on-disk : traditional way
- OLTP : persistence is purpose
  - row-based database
- in-memory : temp data store
- OLAP : columnar database
orchestrator
- airflow
  - difficult to local env
  - have to test about production → risk of production, slower iteration
  - complex to deploy and manage
- dagster
  - local env → faster iteration, safe testing
  - cloud-native → simple deployment and management
  - declarative : many winning technologies
    
    Imperative Declarative
    
    How That
    
    exact steps desired outcome
ingestion
- few source system → python code to connect system via API
  - Why not Python?
    - a lot of source → hard to write custom code
    - API change → need to change code
  - so, use connector tools
- connector tools : don’t need custom code to connect API
  - UI-based / code based
  - open source solution : able to fork the connectors and edit
  - library format : no infrastructure requirement
transformation and visualization
- compute engine(query engine) : for conform data
  - snowflake : automatically scaling, write SQL
  - Trino : no storage component
    - connected to storage system
    - easy to set up and use
    - distributed query engine
  - Spark : when data is large
    - SQL, Python code
    - distributed query engine
  - DBT : not compute engine → need to connect with compute engine
    - powerful to SQL
- visualization
  - hire people who are using it → not easy to rip out the tools

Streaming stack

when : continuous unbounded(= doesn’t end) events ↔ batch
- e.g : click data, IoT data etc
latency
- hours : not need streaming
- seconds : need streaming
Event Ingestion
- Kafka : distributed system
  - scale a large number of events
  - but expensive
Event Processing
- Apache Flink : streaming processing engine, distributed compute engine
  - need to connect data storage
  - transform continuous, unbounded streaming events
  - but also handle batch data : streaming data → batch data can work
  - challenge : ordering
    - latency and completeness tradeoff
  - end : connect to OLAP database

Reference

https://www.cloudflare.com/ko-kr/learning/cloud/what-is-vendor-lock-in/

저작자표시 비영리 변경금지

'Data > Data Engineering' 카테고리의 다른 글

[Udemy] Data Engineering 101: The Beginner's Guide - Data Pipeline architecture(2) ~ Trend (1)	2025.02.04
[Udemy] Data Engineering 101: The Beginner's Guide - Undercurrents (0)	2025.01.24
[Udemy] Data Engineering 101: The Beginner's Guide - End-to-end data pipeline in-depth(2) (0)	2025.01.18
[Udemy] Data Engineering 101: The Beginner's Guide - End-to-end data pipeline in-depth(1) (0)	2025.01.12
[Udemy] Data Engineering 101: The Beginner's Guide - Intro (0)	2025.01.05

현재글[Udemy] Data Engineering 101: The Beginner's Guide - Data Pipeline architecture(1)

Carat Thinker

[Udemy] Data Engineering 101: The Beginner's Guide - Data Pipeline architecture(1)

data architecture

BI stack

Streaming stack

Reference

'Data > Data Engineering' 카테고리의 다른 글

'Data/Data Engineering'의 다른글

티스토리툴바

Imperative	Declarative
How	That
exact steps	desired outcome

[Udemy] Data Engineering 101: The Beginner's Guide - Data Pipeline architecture(1)

data architecture

BI stack

Streaming stack

Reference

'Data > Data Engineering' 카테고리의 다른 글

'Data/Data Engineering'의 다른글

관련글

티스토리툴바