Data/Data Engineering

[Udemy] Data Engineering 101: The Beginner's Guide - Data Pipeline architecture(2) ~ Trend

sennysideup 2025. 2. 4. 09:47
반응형

ML stack

  • AI / ML / DL
    • AI : machines with human like intelligence
      • programming : useing software(=many function)
      • AI : don’t build function ourselves
        • giving an answer(labels, expected output)
        • training → inference
          • training : need a lot of data and compute
    • ML : field within AI, learn from data rather than handwriting
      • structured, tabular data
      • small data, less compute required
      • downstream consumer of the data engineering pipeline
    • DL : type of ML, loosely based on human brain
      • unstructured data
      • large data, large compute required

  • feature store : DB with all features of data
    • feature = column
  • model training
    • exploratory work : inspect sample data, look into feature store
      • requires the most amount of compute
      • tools
        • jupyter notebook : easily access to lakehouse, query engine
        • Ray : distributed training
    • Experiment tracking : experiment = hyperparameter tuning etc
      • automatically saved hyperparameter and experiment result
      • tools : mlflow, weights and biases
    • data lineage / data versioning : keeping track of data origin
      • model performance depends on data
      • use orchestrators : dagster, iceberg
  • model serving
    • inference
      • online inference : one input, one model inference, one prediction
      • batch inference : many input, one model inference, many prediction
        • save a lot of resources
  • model monitoring
    • for drift and re-training
      • drift : data patterns dramatically shift → model doesn’t

 

Deep Learning Stack

  • labeling : often we have input without labels
    • create labeled by human
  • No feature store
    • feature store is useful for tabular data
  • model training
    • same as ML pipeline : access to data by notebook
    • difference : requires a lot compute → use GPU
  • inference
    • model is so large → need to use GPU not only training but also inference
    • with low latency → data & model is so large → need GPU to training ad inference
  • model monitoring : also drift
  • if using open source model or fine tuning of the model → need data engineering
    • be aware of RAG, Vector DB

What is RAG 

  • RAG(검색 증강 생성, Retrieval-Augmented Generation) : Enhancing LLM performance, particularly in generation and accuracy, by integrating external knowledge DB
  • Problem of LLMs
    • bias : May contain stereotypes and discriminatory expressions
    • hallucination : Answers are not always correct and may be trained on false information
    • limitations of understanding context : Struggles with long sentences or complex contexts
    • lack of consistency : Answers may vary even for the same question
    • ethical issue : Potential for misuse and difficulty in determining responsibility
  • How RAG improve LLM
    • utilizing external knowledge
      • connects LLM to a vast knowledge DB
      • Retrieves relevant information from the database based on the query
    • generation based on evidence
      • uses DB's results as supporting evidence
      • specifies the source of the answer
    • enhanced context understanding
      • gains background and context information from external knowledge DB
      • generates answers based on inference ability rather than  simple pattern matching 
  • benefits of RAG
    • cost-effective : reduce costs when adding new data to LLM
    • latest information : provides answers based on latest research and news
    • strengthened user trust : enhances trust by specifying the source of the answer
    • enhanced developer control : allows developer to test and modify the model more effectively

 

Other Considerations

  • data validations
    • type : int, null…
    • constraint : limit min/max value
    • code : is code correct?
  • testing
    • unit testing : testing the smallest functional unit of code
      • verifying if the code bock functions as intended by the developer
    • statistical testing : reduce data change issue
      • data can change without code change
  • observability
    • monitoring, alert, notifications : cannot handle automatically
    • upstream schema changes : upstream data → change schema → break pipeline
    • detailed logs and stack traces : identify source of error
    • always expect failures
    • observe infra, not just data pipeline
      • massive volumes of data → scaling issues for infra
  • CI/CD : automation of deployment process
    • automate CI/CD process : manual process = very error prone
      • for convenience and robustness
    • automate = more frequent releases

 

Wrap Up

Trend

  • learning from software engineering best practices
    • local dev environment
    • Modularization of services
    • Automated CI/CD
    • aiming best practice even if there are differences between data engineering and software engineering
  • Better tooling integration
    • huge number of tools = a lot of glue code for connect tools
      • glue code : code for connecting differenct software components or modules 
      • glue code requirements → take away from the core work
    • sidetrack : integration layer for open source
      • scaffold an end-to-end data pipeline with single CLI command, using open source tools with zero setup
    • merging between application, data and AI
      • application, data and AI team : silo(separated)
        • why? team’s skill sets are different
        • need to blend more
    • more AI
      • Think about hype cycle : AI is just getting started
      • hype cycle : how market expectation about technology changes by the time

        1. technology trigger : technology gains interest but no commercial product exist
        2. the peak of inflated expectations : some companyies attempt to use technology, but most are just observing
        3. trough of disillusionment : most of company failed to adopt the technology, and only the surviving ones continue investing
        4. slope of enlightenment : the market starts to understanding technology, leading to an increase in company investing in it
        5. plateau of productivitiy : the technology establishes itself in the market, and clear evaluation standards emerge

 

 

Reference