Memory Hierarchy

Definition

Memory Hierarchy is the structured arrangement of memory from fastest but smallest (CPU cache) to slowest but largest (disk storage). Each level has different access speed, capacity, and cost. Understanding hierarchy is essential for system performance because memory access time dominates many operations.

Memory Hierarchy Levels

Level 0: Registers

What: Ultra-fast on-chip memory inside CPU

Size: 32-512 bytes (handful of data)

Access Time: < 1 nanosecond (essentially 0)

Cost: Extremely expensive per byte

Use: Hold current values, arithmetic operations

CPU Register (1 nanosecond)
│ Instruction: x = y + z
│ Register holds y and z simultaneously
└ Execute in 1 nanosecond

Level 1: L1 Cache

What: Small cache on CPU core

Size: 32-128 KB per core

Access Time: 4-5 nanoseconds

Cost: Very expensive

Hit Rate: 95-98% (most instructions find data here)

Use: Store frequently accessed data/instructions

Accessing L1 Cache:
CPU needs data
Check L1 (few nanoseconds)
Found! Use it (very fast)

Level 2: L2 Cache

What: Larger cache per core or shared

Size: 256 KB - 1 MB per core

Access Time: 10-20 nanoseconds

Cost: Expensive

Hit Rate: 90-95%

Use: Secondary cache for L1 overflow

CPU needs data
Check L1 → Miss
Check L2 → Found! (20 nanoseconds)
Fetch to L1 for next use

Level 3: L3 Cache

What: Larger cache shared across cores

Size: 2-16 MB

Access Time: 40-75 nanoseconds

Cost: Moderately expensive

Hit Rate: 80-90%

Use: Data shared between cores

Level 4: Main Memory (RAM)

What: Primary working memory

Size: 4 GB - 128 GB (modern systems)

Access Time: 100-300 nanoseconds (300x slower than L1!)

Cost: Affordable

Technology: DRAM (Dynamic RAM)

Refresh: Must be refreshed every few milliseconds

Typical access:
CPU needs data
Check L1, L2, L3 → All miss
Fetch from RAM (200 nanoseconds)
Load into cache for future

This 200ns delay adds up!

Level 5: Secondary Storage (Disk)

What: Persistent storage

Size: 256 GB - 8 TB (modern systems)

Access Time: 1-10 milliseconds (millions of times slower than RAM!)

Cost: Cheap per byte

Technology: HDD (spinning disk) or SSD (flash memory)

Persistence: Survives power off

SSD access: 1 millisecond
= 1,000,000 nanoseconds
= 10,000x slower than L1 cache!

One disk access ≈ executing 10,000 CPU instructions

Level 6: Offline Storage (Tape/Archival)

What: Archival backup storage

Access Time: Seconds to minutes

Cost: Very cheap per byte

Use: Long-term backup, cold storage

Memory Hierarchy Trade-offs

Hierarchy Level | Speed  | Capacity | Cost/Byte
─────────────────────────────────────────────
Registers       | ▓▓▓▓▓ | ▓        | ▓▓▓▓▓
L1 Cache        | ▓▓▓▓  | ▓▓       | ▓▓▓▓
L2 Cache        | ▓▓▓   | ▓▓▓      | ▓▓▓
L3 Cache        | ▓▓    | ▓▓▓▓     | ▓▓
RAM             | ▓     | ▓▓▓▓▓    | ▓
Disk            | ░     | ▓▓▓▓▓▓▓  | ░░░

Key Insight: Fast memory expensive, large memory slow

Why Hierarchy Works?

Locality of Reference

Programs don’t access memory randomly - they exhibit patterns!

Temporal Locality:

Recently accessed data likely accessed again soon

Example:
for (int i = 0; i < 1000000; i++) {
    sum += array[i];  // Accesses sum repeatedly
}
sum accessed 1 million times but probably in L1 cache!

Spatial Locality:

Data near recently accessed data likely accessed soon

Example:
for (int i = 0; i < 1000; i++) {
    printf("%d ", array[i]);  // Sequential access
}
Fetches array[0], gets array[0-15] in cache
Next access array[1] already in cache!

Consequence

Most accesses hit cache (fast) Few accesses miss to RAM (slow)

Observed access time ≈ weighted average
= (hit_rate × cache_time) + (miss_rate × memory_time)
= (0.95 × 5ns) + (0.05 × 200ns)
= 4.75ns + 10ns
= 14.75ns (much closer to cache than memory!)

Cache Parameters

Cache Size

Larger cache:

✓ Higher hit rate
❌ Slower access time (need to search larger cache)
❌ More expensive
❌ More power consumption
❌ Heat generation

Trade-off: Typical 64KB L1, 256KB L2, 8MB L3

Cache Associativity

Direct Mapped (1-way):

One place in cache for each memory location
Fast lookup
High conflict misses (two locations map to same cache spot)

N-Way Associative:

N places in cache for each memory location
Slower lookup (check N spots)
Fewer conflict misses

Fully Associative:

Any memory location can go anywhere
Slowest lookup (check all spots)
Fewest conflict misses

Replacement Policies

When cache full and need new data:

LRU (Least Recently Used):

Replace data not used for longest time
Works well due to temporal locality
Most common

FIFO (First In First Out):

Replace oldest data
Simpler to implement
Less effective

Random:

Random replacement
Very simple
Acceptable performance

Memory Access Performance

Cache Hit vs Miss

L1 Cache Hit: 4-5 ns

Data found in L1
Immediate use

L1 Miss, L2 Hit: 20-30 ns

Miss L1, hit L2
Fetch L2 to L1

L2 Miss, L3 Hit: 50-80 ns

Miss L1 and L2, hit L3

L3 Miss, RAM Hit: 200-300 ns

Miss all caches, hit RAM
Very expensive!

RAM Miss (Page Fault): 1-10 ms

Must fetch from disk
Catastrophic (1,000,000x slower than L1!)

Performance Impact

Fast program:
- Mostly L1 hits (95%)
- Occasional L2 hits (4%)
- Rare memory accesses (1%)
- Average: 0.95×5 + 0.04×20 + 0.01×200 = 12ns

Slow program:
- Many memory accesses (20%)
- Bad cache utilization
- Average: 0.80×5 + 0.20×200 = 44ns
- 3.7x slower!

Optimization Strategies

1. Improve Temporal Locality

Reuse data to keep in cache:

Bad (poor temporal locality):
for each loop:
    process array[0]
    process array[1]
    process array[2]

Good (reuse in cache):
for each element:
    process all relevant fields
    (keeps data warm in cache)

2. Improve Spatial Locality

Access data sequentially:

Bad (random access):
for (i in random_order) {
    process(array[i])
}

Good (sequential):
for (i = 0 to n) {
    process(array[i])
}

3. Reduce Working Set

Keep active data < cache size:

Process 1MB of data repeatedly:
If L3 cache = 8MB → Fits! High hit rate
If L3 cache = 512KB → Doesn't fit → More misses

Real-World Example

Processing Large File

Scenario: Read 1GB file, process each line

Bad Approach:

for each byte in file:
    read byte from disk
    process byte

1 billion disk reads! Each is 1-10ms!
Total: hours

Better Approach:

while not EOF:
    read 1MB block into memory (1 disk access!)
    process each byte in memory (all in cache!)

Total: 1000 disk reads instead of 1 billion!
Much faster!

Summary

Memory hierarchy from registers (fast) to disk (slow) exploits locality to achieve average access time close to cache. L1/L2/L3 caches hide RAM latency. RAM speeds hide disk latency. Understanding hierarchy crucial for performance optimization. Locality of reference is key - programs that access nearby data recently used perform best. Modern systems have 6+ levels of memory, and effective use determines program performance. Minimizing cache misses and disk accesses far more important than algorithmic optimizations for many workloads.