반응형
data architecture
- what is good data architecture
- performance : using computing and storage resources efficiently
- trade-off between performance and complexity
- scalability : data volumes = fluctuate
- upstream system fail → increasing data volumes
- scale up/down should be automatical : scale-down can save a lot of money
- reliability : available system & avoid failure
- Automate as much as possible → reduce human errors
- expect failure : there isn’t 100% reliable system
- plan for alert, notification, recovery plan
- security : follow the principles
- most important thing
- modularity : connect systems via API(external system)
- cost effectiveness : direct, indirect cost(time)
- be aware of trade off(direct cost - operational complexity)
- flexibility : extensibility of architecture
- performance : using computing and storage resources efficiently
- architectures for solve limitations of hadoop : groundwork for modern data stack
- lambda : separates streaming and batch workloads → combine in serving
- streaming : sends data directly to serve
- batch : sends data to data warehouse
- two code base → difficult to manage
- kappa : single layer → solve lambda architecture’s problem
- difficult & expensive → not widely adopted
- lambda : separates streaming and batch workloads → combine in serving
- modern data stack : monolithic, proprietary tools → cloud-based, easy to use
- goal : reduce complexity and increase modularity → reduce problems, cost & more flexible architecture
- factors you should consider
- speed : fast decision, able to reverse decision if required
- modular tools, not monolithic system
- open source tools : flexible but paying for cloud
- size of team : small team → simple system
- complexity is enemy of speed
- simple but modular and flexible → upgrade quickly
- integration : Does tool play nicely with others?
- tool central pipeline
- choose other tools which integrate with central tool
- watch out for vendor lock-in : don’t stuck with tools
- vendor lock-in : the cost of changing vendors is so high that you become stuck with the original vendor
- cost of changing vendor
- financial
- human resource
- business shut down risk
- Why you should worry about vendor lock-in
- low quality of vendor service
- complete shut down of the vendor
- massive increase in costs
- changes to the product that make it unable to meet business requirements
- how to avoid vendor lock-in
- carefully review the cloud service before signing a contract
- ensure that you can migrate your data easily
- back up all data
- use multi-cloud or hybird cloud approach
- cost of changing vendor
- vendor lock-in : the cost of changing vendors is so high that you become stuck with the original vendor
- watch out for vendor lock-in : don’t stuck with tools
- choose other tools which integrate with central tool
- benchmarks
- metrics : highly cherry picked
- evaluate tools by multiple dimensions
- speed : fast decision, able to reverse decision if required
BI stack
- start with technology that are harder to change
- storage : most hard system to change
- data warehouse
- traditional storage : unstructured data grow → data lake
- Amazon Redshift : if pipeline resides in AWS
- Snowflake : easy to use, deep integrations
- but cannot ingest all types of data
- data lock-in risk
- don’t have local version
- small team, small data volume
- data lake : easily becomes data swamps
- data lakehouse : data lake + data warehouse
- apache lceberg + Trino(compute engine) → replace snowflake
- able to turn an object storage into data warehouse
- Apache lceberg : large data volume, long term
- ACID compliance
- OLAP data store : ideal for analytical use cases
- open table format : fully open source and open standard
- large data volume
- cost efficient : just an object storage
- S3 : object storage. keep all raw data
- cheaper than snowflake
- can ingest all types of data
- data warehouse
- duckDB : improve developement iteration speed
- if data fit in local machine or able to sample
- on-disk : traditional way
- OLTP : persistence is purpose
- row-based database
- in-memory : temp data store
- OLAP : columnar database
- orchestrator
- airflow
- difficult to local env
- have to test about production → risk of production, slower iteration
- complex to deploy and manage
- dagster
- local env → faster iteration, safe testing
- cloud-native → simple deployment and management
- declarative : many winning technologies
Imperative Declarative How That exact steps desired outcome
- airflow
- ingestion
- few source system → python code to connect system via API
- Why not Python?
- a lot of source → hard to write custom code
- API change → need to change code
- so, use connector tools
- Why not Python?
- connector tools : don’t need custom code to connect API
- UI-based / code based
- open source solution : able to fork the connectors and edit
- library format : no infrastructure requirement
- few source system → python code to connect system via API
- transformation and visualization
- compute engine(query engine) : for conform data
- snowflake : automatically scaling, write SQL
- Trino : no storage component
- connected to storage system
- easy to set up and use
- distributed query engine
- Spark : when data is large
- SQL, Python code
- distributed query engine
- DBT : not compute engine → need to connect with compute engine
- powerful to SQL
- visualization
- hire people who are using it → not easy to rip out the tools
- compute engine(query engine) : for conform data
Streaming stack
- when : continuous unbounded(= doesn’t end) events ↔ batch
- e.g : click data, IoT data etc
- latency
- hours : not need streaming
- seconds : need streaming
- Event Ingestion
- Kafka : distributed system
- scale a large number of events
- but expensive
- Kafka : distributed system
- Event Processing
- Apache Flink : streaming processing engine, distributed compute engine
- need to connect data storage
- transform continuous, unbounded streaming events
- but also handle batch data : streaming data → batch data can work
- challenge : ordering
- latency and completeness tradeoff
- end : connect to OLAP database
- Apache Flink : streaming processing engine, distributed compute engine