반응형
ML stack
- AI / ML / DL
- AI : machines with human like intelligence
- programming : useing software(=many function)
- AI : don’t build function ourselves
- giving an answer(labels, expected output)
- training → inference
- training : need a lot of data and compute
- ML : field within AI, learn from data rather than handwriting
- structured, tabular data
- small data, less compute required
- downstream consumer of the data engineering pipeline
- DL : type of ML, loosely based on human brain
- unstructured data
- large data, large compute required
- AI : machines with human like intelligence
- feature store : DB with all features of data
- feature = column
- model training
- exploratory work : inspect sample data, look into feature store
- requires the most amount of compute
- tools
- jupyter notebook : easily access to lakehouse, query engine
- Ray : distributed training
- Experiment tracking : experiment = hyperparameter tuning etc
- automatically saved hyperparameter and experiment result
- tools : mlflow, weights and biases
- data lineage / data versioning : keeping track of data origin
- model performance depends on data
- use orchestrators : dagster, iceberg
- exploratory work : inspect sample data, look into feature store
- model serving
- inference
- online inference : one input, one model inference, one prediction
- batch inference : many input, one model inference, many prediction
- save a lot of resources
- inference
- model monitoring
- for drift and re-training
- drift : data patterns dramatically shift → model doesn’t
- for drift and re-training
Deep Learning Stack
- labeling : often we have input without labels
- create labeled by human
- No feature store
- feature store is useful for tabular data
- model training
- same as ML pipeline : access to data by notebook
- difference : requires a lot compute → use GPU
- inference
- model is so large → need to use GPU not only training but also inference
- with low latency → data & model is so large → need GPU to training ad inference
- model monitoring : also drift
- if using open source model or fine tuning of the model → need data engineering
- be aware of RAG, Vector DB
What is RAG
- RAG(검색 증강 생성, Retrieval-Augmented Generation) : Enhancing LLM performance, particularly in generation and accuracy, by integrating external knowledge DB
- Problem of LLMs
- bias : May contain stereotypes and discriminatory expressions
- hallucination : Answers are not always correct and may be trained on false information
- limitations of understanding context : Struggles with long sentences or complex contexts
- lack of consistency : Answers may vary even for the same question
- ethical issue : Potential for misuse and difficulty in determining responsibility
- How RAG improve LLM
- utilizing external knowledge
- connects LLM to a vast knowledge DB
- Retrieves relevant information from the database based on the query
- generation based on evidence
- uses DB's results as supporting evidence
- specifies the source of the answer
- enhanced context understanding
- gains background and context information from external knowledge DB
- generates answers based on inference ability rather than simple pattern matching
- utilizing external knowledge
- benefits of RAG
- cost-effective : reduce costs when adding new data to LLM
- latest information : provides answers based on latest research and news
- strengthened user trust : enhances trust by specifying the source of the answer
- enhanced developer control : allows developer to test and modify the model more effectively
Other Considerations
- data validations
- type : int, null…
- constraint : limit min/max value
- code : is code correct?
- testing
- unit testing : testing the smallest functional unit of code
- verifying if the code bock functions as intended by the developer
- statistical testing : reduce data change issue
- data can change without code change
- unit testing : testing the smallest functional unit of code
- observability
- monitoring, alert, notifications : cannot handle automatically
- upstream schema changes : upstream data → change schema → break pipeline
- detailed logs and stack traces : identify source of error
- always expect failures
- observe infra, not just data pipeline
- massive volumes of data → scaling issues for infra
- CI/CD : automation of deployment process
- automate CI/CD process : manual process = very error prone
- for convenience and robustness
- automate = more frequent releases
- automate CI/CD process : manual process = very error prone
Wrap Up
Trend
- learning from software engineering best practices
- local dev environment
- Modularization of services
- Automated CI/CD
- aiming best practice even if there are differences between data engineering and software engineering
- Better tooling integration
- huge number of tools = a lot of glue code for connect tools
- glue code : code for connecting differenct software components or modules
- glue code requirements → take away from the core work
- sidetrack : integration layer for open source
- scaffold an end-to-end data pipeline with single CLI command, using open source tools with zero setup
- merging between application, data and AI
- application, data and AI team : silo(separated)
- why? team’s skill sets are different
- need to blend more
- application, data and AI team : silo(separated)
- more AI
- Think about hype cycle : AI is just getting started
- hype cycle : how market expectation about technology changes by the time
- technology trigger : technology gains interest but no commercial product exist
- the peak of inflated expectations : some companyies attempt to use technology, but most are just observing
- trough of disillusionment : most of company failed to adopt the technology, and only the surviving ones continue investing
- slope of enlightenment : the market starts to understanding technology, leading to an increase in company investing in it
- plateau of productivitiy : the technology establishes itself in the market, and clear evaluation standards emerge
- huge number of tools = a lot of glue code for connect tools
Reference
- https://aws.amazon.com/ko/what-is/retrieval-augmented-generation/
- https://aws.amazon.com/ko/what-is/unit-testing/
- https://subokim.wordpress.com/2017/12/21/gartner-hype-cycle/
- https://www.wearedevelopers.com/dictionary/glue-code