Data/Data Engineering

[Udemy] Data Engineering 101: The Beginner's Guide - Data Pipeline architecture(1)

sennysideup 2025. 1. 29. 12:59
반응형

data architecture

  • what is good data architecture
    • performance : using computing and storage resources efficiently
      • trade-off between performance and complexity
    • scalability : data volumes = fluctuate
      • upstream system fail → increasing data volumes
      • scale up/down should be automatical : scale-down can save a lot of money
    • reliability : available system & avoid failure
      • Automate as much as possible → reduce human errors
    • expect failure : there isn’t 100% reliable system
      • plan for alert, notification, recovery plan
    • security : follow the principles
      • most important thing
    • modularity : connect systems via API(external system)
    • cost effectiveness : direct, indirect cost(time)
      • be aware of trade off(direct cost - operational complexity)
    • flexibility : extensibility of architecture
  • architectures for solve limitations of hadoop : groundwork for modern data stack
    • lambda : separates streaming and batch workloads → combine in serving
      • streaming : sends data directly to serve
      • batch : sends data to data warehouse
      • two code base → difficult to manage
    • kappa : single layer → solve lambda architecture’s problem
      • difficult & expensive → not widely adopted
  • modern data stack : monolithic, proprietary tools → cloud-based, easy to use
    • goal : reduce complexity and increase modularity → reduce problems, cost & more flexible architecture
  • factors you should consider
    • speed : fast decision, able to reverse decision if required
      • modular tools, not monolithic system
      • open source tools : flexible but paying for cloud
    • size of team : small team → simple system
      • complexity is enemy of speed
      • simple but modular and flexible → upgrade quickly
    • integration : Does tool play nicely with others?
    • tool central pipeline
      • choose other tools which integrate with central tool
        • watch out for vendor lock-in : don’t stuck with tools 
          • vendor lock-in : the cost of changing vendors is so high that you become stuck with the original vendor
            • cost of changing vendor
              • financial
              • human resource
              • business shut down risk
            • Why you should worry about vendor lock-in
              • low quality of vendor service
              • complete shut down of the vendor
              • massive increase in costs
              • changes to the product that make it unable to meet business requirements
            • how to avoid vendor lock-in
              • carefully review the cloud service before signing a contract
              • ensure that you can migrate your data easily
              • back up all data
              • use multi-cloud or hybird cloud approach
    • benchmarks
      • metrics : highly cherry picked
      • evaluate tools by multiple dimensions

 

BI stack

  • start with technology that are harder to change


  • storage : most hard system to change
    • data warehouse
      • traditional storage : unstructured data grow → data lake
      • Amazon Redshift : if pipeline resides in AWS
      • Snowflake : easy to use, deep integrations
        • but cannot ingest all types of data
        • data lock-in risk
        • don’t have local version
        • small team, small data volume
    • data lake : easily becomes data swamps
    • data lakehouse : data lake + data warehouse
      • apache lceberg + Trino(compute engine) → replace snowflake
      • able to turn an object storage into data warehouse
      • Apache lceberg : large data volume, long term
        • ACID compliance
        • OLAP data store : ideal for analytical use cases
        • open table format : fully open source and open standard
        • large data volume
        • cost efficient : just an object storage
    • S3 : object storage. keep all raw data
      • cheaper than snowflake
      • can ingest all types of data
  • duckDB : improve developement iteration speed

    • if data fit in local machine or able to sample
    • on-disk : traditional way
    • OLTP : persistence is purpose
      • row-based database
    • in-memory : temp data store
    • OLAP : columnar database
  • orchestrator
    • airflow
      • difficult to local env
      • have to test about production → risk of production, slower iteration
      • complex to deploy and manage
    • dagster
      • local env → faster iteration, safe testing
      • cloud-native → simple deployment and management
      • declarative : many winning technologies 
        Imperative Declarative
        How That
        exact steps desired outcome
         
  •  ingestion
    • few source system → python code to connect system via API
      • Why not Python?
        • a lot of source → hard to write custom code
        • API change → need to change code
      • so, use connector tools
    • connector tools : don’t need custom code to connect API
      • UI-based / code based
      • open source solution : able to fork the connectors and edit
      • library format : no infrastructure requirement
  • transformation and visualization
    • compute engine(query engine) : for conform data
      • snowflake : automatically scaling, write SQL
      • Trino : no storage component
        • connected to storage system
        • easy to set up and use
        • distributed query engine
      • Spark : when data is large
        • SQL, Python code
        • distributed query engine
      • DBT : not compute engine → need to connect with compute engine
        • powerful to SQL
    • visualization
      • hire people who are using it → not easy to rip out the tools

 

Streaming stack

  • when : continuous unbounded(= doesn’t end) events ↔ batch
    • e.g : click data, IoT data etc
  • latency
    • hours : not need streaming
    • seconds : need streaming
  • Event Ingestion
    • Kafka : distributed system
      • scale a large number of events
      • but expensive
  • Event Processing
    • Apache Flink : streaming processing engine, distributed compute engine
      • need to connect data storage
      • transform continuous, unbounded streaming events
      • but also handle batch data : streaming data → batch data can work
      • challenge : ordering
        • latency and completeness tradeoff
      • end : connect to OLAP database

 

 

Reference