반응형
Generation of source data
- structured / unstructured : differences in store, search..
- structured data : tabular, 2-demensional(rows and columns)
- use SQL
- BI, classical ML
- unstructured data : files
- use Deep Learning(Neural Networks)
- structured data : tabular, 2-demensional(rows and columns)
- database : if choose wrong database, suffer from performance
- RDBMS : Relational
- transactional data, tabular format
- relation between tables
- inflexible, strict, normalized
- single machine → limiting performance
- need to replace machine to bigger one
- NoSQL : Non-relational
- flexible
- distributed → scale horizontally
- Scale horizontally(Scaling out) : increasing the capacity of a system by adding machines
- Scale Vertically(Scaling up) : increasing the capacity of a system by adding capability to the machine
- RDBMS : Relational
-
-
- multi machine : need more machine → scale things further
- non relational
- key-value store : like dictionary in python, non-relational
- memory-based : shut down DB → data is gone
- persisted
- single table = don’t need to perform any table join
- handle high volume and high concurrency
-
- third-party system : connected by API
- flow : event producers → event broker → event consumers
- challenge
- move the data from microservices to microservices e.g) cart service → payment service
- ordering data : first occured event can arrive late
- delivery frequency : exactly-once vs at least once delivery
- exactly-once delivery : difficult
- at least once : can duplicate data
- Idempotency : same result comes out whenever you run it
- insert rows : not idempotency
- merge rows : idempotency
- at-least-once delivery + Idempotency → delete duplicate
Storage
- hardware
- HDD : traditional disk drive
- cheaper than SSD, cloud
- slower than SSD, cloud
- HDD : traditional disk drive
-
- SSD : laptop. newer, faster drive
- faster than HDD
- expensive than HDD
- Memory(RAM) : laptop = hardware(HDD, SSD) + memory
- faster than SSD
- temporary datahardware
- networking : connect between storages
- if one of storages fail, recover with rest of storages
- improve performance by accessing the data in parallel </aside>
- SSD : laptop. newer, faster drive
- serialization : data → byte streams to easily process it
- serialize on sending, de-serialize on receiving
- type
- row-based : xml, json, csv
- fast lookup of individual rows
- column-based : arquet, orc, arrow
- aggreation by column
- row-based : xml, json, csv
- cost : time & computing resources
- compression and caching
- compression
- save storage
- faster query, transport
- caching
- move data to faster storage
- faster storage : smaller and expensive → can’t move all data to faster storage
- compression
- distributed storage : =horizontally scaling storage
- bigger data : need horizontal scaling
- vertical scaling : need bigger machine
- horizontal scaling : need more machine + network between machine
- +distributed compute : store, retrieve, process data faster
- limits of parallelization : can’t always be done in parellel
- strong consistency : single server
- new data → write once
- ask for same row → same result
- sacrificing performance
- eventual consistency : multiple server
- new data → write multiple times
- if one server fail to write → ask for same row → different result = no consistency → failed server : try to succeed (→ slow) → eventually consistency
- good tradeoff : performance ↔ speed
- strong consistency : single server
- ACID / BASE
- ACID : Atomictiy, Consistency, Isolation, Durabillity
- single machine, transactional DB
- strong consistency
- BASE : Basically Available, Soft-state, Eventual consistency
- distributed DB
- ACID : Atomictiy, Consistency, Isolation, Durabillity
- bigger data : need horizontal scaling
- Type
- file : tree structure
- fast data retrieval
- limits scaling properties
- block : HDD, SSD
- fast access
- transactional DB
- object : most important to Data Engineer
- all kinds of shape and size
- VS file storage
file object scale limit no limit
→ ideal for data engineeringstructure tree flat(no nesting) latency fast slow(need to search all objects) updates mutable
- modify in placeimmutable
- replaced entirety
- cache : ram, memory
- faster than hard drive
- temporary data → data loss
- Redis etc
- streaming
- often ephemeral
- persisted by storing in object storage
- buffering : masssive data coming in at the same time → need buffer to prevent overload system
- file : tree structure
- OLTP(row based DB) vs OLAP(columnar DB)
- finding specific row : easy in OLTP, inefficient in OLAP
- aggregate data : easy in OLAP
- OLTP : online transaction processing
- SSD : fast but expensive. speed is important
- OLAP : online analytical processing
- HDD : slow but cheap. volumes are important
- Data warehouse, Data lakehouse
- DW : use in Business Analytics
- OLTP → OLAP : for aggregated
- hard to manage unstructured data like log, free text
- costly → well organized data. not raw data
- Data lake
- inexpensive
- flexible → dumping all raw data → data swamps
- Data swamps : deteriorated and unmanageable data lake due to the lack of proper data management
- store unstructured data
- Data lakehouse
- OLAP
- combine advantages of Data Warehouse + Data lake
- DW : use in Business Analytics
- separation of compute from storage
- storage & conpute in same machine
- no networking → improve performance
- fast, low latency disk read
- high bandwidth
- bandwidth : The maximum amount of data transmitted over an internet connection in a given amount of time, 대역폭
- separation : cost of performance
- independently storage and compute
- serverless pattern : no need to run server 24/7, only when you need
- storage & conpute in same machine
- Data Storage lifecycle
- hot : accessed often → high cost, fast retrival
- warm : access infrequenty → inexpensive
- cold : rarely access → cheap
- if you try to access more → large penalty
- Amazon S3 Glacier etc