Definition
Memory Hierarchy is the structured arrangement of memory from fastest but smallest (CPU cache) to slowest but largest (disk storage). Each level has different access speed, capacity, and cost. Understanding hierarchy is essential for system performance because memory access time dominates many operations.
Memory Hierarchy Levels
Level 0: Registers
What: Ultra-fast on-chip memory inside CPU
Size: 32-512 bytes (handful of data)
Access Time: < 1 nanosecond (essentially 0)
Cost: Extremely expensive per byte
Use: Hold current values, arithmetic operations
CPU Register (1 nanosecond)
│ Instruction: x = y + z
│ Register holds y and z simultaneously
└ Execute in 1 nanosecond
Level 1: L1 Cache
What: Small cache on CPU core
Size: 32-128 KB per core
Access Time: 4-5 nanoseconds
Cost: Very expensive
Hit Rate: 95-98% (most instructions find data here)
Use: Store frequently accessed data/instructions
Accessing L1 Cache:
CPU needs data
Check L1 (few nanoseconds)
Found! Use it (very fast)
Level 2: L2 Cache
What: Larger cache per core or shared
Size: 256 KB - 1 MB per core
Access Time: 10-20 nanoseconds
Cost: Expensive
Hit Rate: 90-95%
Use: Secondary cache for L1 overflow
CPU needs data
Check L1 → Miss
Check L2 → Found! (20 nanoseconds)
Fetch to L1 for next use
Level 3: L3 Cache
What: Larger cache shared across cores
Size: 2-16 MB
Access Time: 40-75 nanoseconds
Cost: Moderately expensive
Hit Rate: 80-90%
Use: Data shared between cores
Level 4: Main Memory (RAM)
What: Primary working memory
Size: 4 GB - 128 GB (modern systems)
Access Time: 100-300 nanoseconds (300x slower than L1!)
Cost: Affordable
Technology: DRAM (Dynamic RAM)
Refresh: Must be refreshed every few milliseconds
Typical access:
CPU needs data
Check L1, L2, L3 → All miss
Fetch from RAM (200 nanoseconds)
Load into cache for future
This 200ns delay adds up!
Level 5: Secondary Storage (Disk)
What: Persistent storage
Size: 256 GB - 8 TB (modern systems)
Access Time: 1-10 milliseconds (millions of times slower than RAM!)
Cost: Cheap per byte
Technology: HDD (spinning disk) or SSD (flash memory)
Persistence: Survives power off
SSD access: 1 millisecond
= 1,000,000 nanoseconds
= 10,000x slower than L1 cache!
One disk access ≈ executing 10,000 CPU instructions
Level 6: Offline Storage (Tape/Archival)
What: Archival backup storage
Access Time: Seconds to minutes
Cost: Very cheap per byte
Use: Long-term backup, cold storage
Memory Hierarchy Trade-offs
Hierarchy Level | Speed | Capacity | Cost/Byte
─────────────────────────────────────────────
Registers | ▓▓▓▓▓ | ▓ | ▓▓▓▓▓
L1 Cache | ▓▓▓▓ | ▓▓ | ▓▓▓▓
L2 Cache | ▓▓▓ | ▓▓▓ | ▓▓▓
L3 Cache | ▓▓ | ▓▓▓▓ | ▓▓
RAM | ▓ | ▓▓▓▓▓ | ▓
Disk | ░ | ▓▓▓▓▓▓▓ | ░░░
Key Insight: Fast memory expensive, large memory slow
Why Hierarchy Works?
Locality of Reference
Programs don’t access memory randomly - they exhibit patterns!
Temporal Locality:
Recently accessed data likely accessed again soon
Example:
for (int i = 0; i < 1000000; i++) {
sum += array[i]; // Accesses sum repeatedly
}
sum accessed 1 million times but probably in L1 cache!
Spatial Locality:
Data near recently accessed data likely accessed soon
Example:
for (int i = 0; i < 1000; i++) {
printf("%d ", array[i]); // Sequential access
}
Fetches array[0], gets array[0-15] in cache
Next access array[1] already in cache!
Consequence
Most accesses hit cache (fast) Few accesses miss to RAM (slow)
Observed access time ≈ weighted average
= (hit_rate × cache_time) + (miss_rate × memory_time)
= (0.95 × 5ns) + (0.05 × 200ns)
= 4.75ns + 10ns
= 14.75ns (much closer to cache than memory!)
Cache Parameters
Cache Size
Larger cache:
- ✓ Higher hit rate
- ❌ Slower access time (need to search larger cache)
- ❌ More expensive
- ❌ More power consumption
- ❌ Heat generation
Trade-off: Typical 64KB L1, 256KB L2, 8MB L3
Cache Associativity
Direct Mapped (1-way):
- One place in cache for each memory location
- Fast lookup
- High conflict misses (two locations map to same cache spot)
N-Way Associative:
- N places in cache for each memory location
- Slower lookup (check N spots)
- Fewer conflict misses
Fully Associative:
- Any memory location can go anywhere
- Slowest lookup (check all spots)
- Fewest conflict misses
Replacement Policies
When cache full and need new data:
LRU (Least Recently Used):
- Replace data not used for longest time
- Works well due to temporal locality
- Most common
FIFO (First In First Out):
- Replace oldest data
- Simpler to implement
- Less effective
Random:
- Random replacement
- Very simple
- Acceptable performance
Memory Access Performance
Cache Hit vs Miss
L1 Cache Hit: 4-5 ns
- Data found in L1
- Immediate use
L1 Miss, L2 Hit: 20-30 ns
- Miss L1, hit L2
- Fetch L2 to L1
L2 Miss, L3 Hit: 50-80 ns
- Miss L1 and L2, hit L3
L3 Miss, RAM Hit: 200-300 ns
- Miss all caches, hit RAM
- Very expensive!
RAM Miss (Page Fault): 1-10 ms
- Must fetch from disk
- Catastrophic (1,000,000x slower than L1!)
Performance Impact
Fast program:
- Mostly L1 hits (95%)
- Occasional L2 hits (4%)
- Rare memory accesses (1%)
- Average: 0.95×5 + 0.04×20 + 0.01×200 = 12ns
Slow program:
- Many memory accesses (20%)
- Bad cache utilization
- Average: 0.80×5 + 0.20×200 = 44ns
- 3.7x slower!
Optimization Strategies
1. Improve Temporal Locality
Reuse data to keep in cache:
Bad (poor temporal locality):
for each loop:
process array[0]
process array[1]
process array[2]
Good (reuse in cache):
for each element:
process all relevant fields
(keeps data warm in cache)
2. Improve Spatial Locality
Access data sequentially:
Bad (random access):
for (i in random_order) {
process(array[i])
}
Good (sequential):
for (i = 0 to n) {
process(array[i])
}
3. Reduce Working Set
Keep active data < cache size:
Process 1MB of data repeatedly:
If L3 cache = 8MB → Fits! High hit rate
If L3 cache = 512KB → Doesn't fit → More misses
Real-World Example
Processing Large File
Scenario: Read 1GB file, process each line
Bad Approach:
for each byte in file:
read byte from disk
process byte
1 billion disk reads! Each is 1-10ms!
Total: hours
Better Approach:
while not EOF:
read 1MB block into memory (1 disk access!)
process each byte in memory (all in cache!)
Total: 1000 disk reads instead of 1 billion!
Much faster!
Summary
Memory hierarchy from registers (fast) to disk (slow) exploits locality to achieve average access time close to cache. L1/L2/L3 caches hide RAM latency. RAM speeds hide disk latency. Understanding hierarchy crucial for performance optimization. Locality of reference is key - programs that access nearby data recently used perform best. Modern systems have 6+ levels of memory, and effective use determines program performance. Minimizing cache misses and disk accesses far more important than algorithmic optimizations for many workloads.