[Udemy] Data Engineering 101: The Beginner's Guide - Data Pipeline architecture(2) ~ Trend

Data/Data Engineering

sennysideup 2025. 2. 4. 09:47

ML stack

feature store : DB with all features of data
- feature = column
model training
- exploratory work : inspect sample data, look into feature store
  - requires the most amount of compute
  - tools
    - jupyter notebook : easily access to lakehouse, query engine
    - Ray : distributed training
- Experiment tracking : experiment = hyperparameter tuning etc
  - automatically saved hyperparameter and experiment result
  - tools : mlflow, weights and biases
- data lineage / data versioning : keeping track of data origin
  - model performance depends on data
  - use orchestrators : dagster, iceberg
model serving
- inference
  - online inference : one input, one model inference, one prediction
  - batch inference : many input, one model inference, many prediction
    - save a lot of resources
model monitoring
- for drift and re-training
  - drift : data patterns dramatically shift → model doesn’t

labeling : often we have input without labels
- create labeled by human
No feature store
- feature store is useful for tabular data
model training
- same as ML pipeline : access to data by notebook
- difference : requires a lot compute → use GPU
inference
- model is so large → need to use GPU not only training but also inference
- with low latency → data & model is so large → need GPU to training ad inference
model monitoring : also drift
if using open source model or fine tuning of the model → need data engineering
- be aware of RAG, Vector DB

RAG(검색 증강 생성, Retrieval-Augmented Generation) : Enhancing LLM performance, particularly in generation and accuracy, by integrating external knowledge DB
Problem of LLMs
- bias : May contain stereotypes and discriminatory expressions
- hallucination : Answers are not always correct and may be trained on false information
- limitations of understanding context : Struggles with long sentences or complex contexts
- lack of consistency : Answers may vary even for the same question
- ethical issue : Potential for misuse and difficulty in determining responsibility
How RAG improve LLM
- utilizing external knowledge
  - connects LLM to a vast knowledge DB
  - Retrieves relevant information from the database based on the query
- generation based on evidence
  - uses DB's results as supporting evidence
  - specifies the source of the answer
- enhanced context understanding
  - gains background and context information from external knowledge DB
  - generates answers based on inference ability rather than simple pattern matching
benefits of RAG
- cost-effective : reduce costs when adding new data to LLM
- latest information : provides answers based on latest research and news
- strengthened user trust : enhances trust by specifying the source of the answer
- enhanced developer control : allows developer to test and modify the model more effectively

data validations
- type : int, null…
- constraint : limit min/max value
- code : is code correct?
testing
- unit testing : testing the smallest functional unit of code
  - verifying if the code bock functions as intended by the developer
- statistical testing : reduce data change issue
  - data can change without code change
observability
- monitoring, alert, notifications : cannot handle automatically
- upstream schema changes : upstream data → change schema → break pipeline
- detailed logs and stack traces : identify source of error
- always expect failures
- observe infra, not just data pipeline
  - massive volumes of data → scaling issues for infra
CI/CD : automation of deployment process
- automate CI/CD process : manual process = very error prone
  - for convenience and robustness
- automate = more frequent releases

learning from software engineering best practices
- local dev environment
- Modularization of services
- Automated CI/CD
- aiming best practice even if there are differences between data engineering and software engineering
Better tooling integration
- huge number of tools = a lot of glue code for connect tools
  - glue code : code for connecting differenct software components or modules
  - glue code requirements → take away from the core work
- sidetrack : integration layer for open source
  - scaffold an end-to-end data pipeline with single CLI command, using open source tools with zero setup
- merging between application, data and AI
  - application, data and AI team : silo(separated)
    - why? team’s skill sets are different
    - need to blend more
- more AI
  - Think about hype cycle : AI is just getting started
  - hype cycle : how market expectation about technology changes by the time
    1. technology trigger : technology gains interest but no commercial product exist
    2. the peak of inflated expectations : some companyies attempt to use technology, but most are just observing
    3. trough of disillusionment : most of company failed to adopt the technology, and only the surviving ones continue investing
    4. slope of enlightenment : the market starts to understanding technology, leading to an increase in company investing in it
    5. plateau of productivitiy : the technology establishes itself in the market, and clear evaluation standards emerge

[Udemy] Data Engineering 101: The Beginner's Guide - Data Pipeline architecture(1) (0)	2025.01.29
[Udemy] Data Engineering 101: The Beginner's Guide - Undercurrents (0)	2025.01.24
[Udemy] Data Engineering 101: The Beginner's Guide - End-to-end data pipeline in-depth(2) (0)	2025.01.18
[Udemy] Data Engineering 101: The Beginner's Guide - End-to-end data pipeline in-depth(1) (0)	2025.01.12
[Udemy] Data Engineering 101: The Beginner's Guide - Intro (0)	2025.01.05

현재글[Udemy] Data Engineering 101: The Beginner's Guide - Data Pipeline architecture(2) ~ Trend