File Organization

Definition

File Organization: How files are organized and structured on disk for efficient access and storage. Methods differ in access patterns, storage efficiency, and implementation complexity.

File Organization Methods

1. Sequential Organization

Concept

Records stored one after another:

File: employee.txt
Record 1: [Name: Alice, Age: 30, Salary: 50000]
Record 2: [Name: Bob, Age: 35, Salary: 60000]
Record 3: [Name: Carol, Age: 28, Salary: 55000]

Stored sequentially on disk:
Block 100: Alice record
Block 101: Bob record
Block 102: Carol record

Access Pattern

Reading record 2:

Start at beginning
Skip record 1 (read it but don't use)
Read record 2

Insertion:

New record: Eve (between Carol and Dave)
Option 1: Insert in place (must shift all after)
Option 2: Append at end, sort later

Advantages

✓ Simple implementation
✓ Efficient for sequential processing (read all records)
✓ Minimal overhead
✓ Good for tapes (sequential media)

Disadvantages

❌ Slow for direct access (must read from start)
❌ Slow insertion/deletion (must shift records)
❌ Wasteful for sparse data

Use Cases

Batch processing: Read all payroll records monthly
Log files: Append-only, read sequentially
Archive: Write once, read many times

2. Direct (Relative) Organization

Concept

Records stored at calculated positions:

Record key → Calculate position → Disk location

Example:
Record ID 5 → Position = 5
Record ID 10 → Position = 10
Record ID 15 → Position = 15

Retrieve record 10:
Position 10 × record_size = byte offset
Read directly!

Access Pattern

Random access:

Retrieve record 100:
Position = 100
Byte offset = 100 × record_size (say 256)
= 25,600 bytes
Seek to block, offset, read
Direct access!

Advantages

✓ Fast direct access
✓ O(1) access time (calculate and read)
✓ Good for known ID ranges

Disadvantages

❌ Requires key-to-position mapping
❌ Wasteful if keys sparse (gaps in numbering)
❌ Insertion/deletion complex

Use Cases

Disk arrays: RAID, where blocks numbered
Fixed-size records: Same size per record
Direct address space: Know all possible keys

3. Indexed Sequential Organization

Concept

Sequential organization with index:

Index:
Alice → Block 100
Bob → Block 101
Carol → Block 102
Dave → Block 103

Retrieve Bob:
Lookup index → Block 101
Read block directly!

Structure

Master File (Sequential):
Block 100: [Alice record]
Block 101: [Bob record]
Block 102: [Carol record]
Block 103: [Dave record]

Index File:
"Alice" → Block 100
"Bob" → Block 101
"Carol" → Block 102
"Dave" → Block 103

Access Pattern

Find Carol:

1. Search index for "Carol" → Block 102
2. Read block 102 directly
3. Access time: O(log n) for index search

Advantages

✓ Fast direct access (via index)
✓ Sequential access still possible (master file)
✓ Index supports binary search
✓ Can have multiple indexes (different keys)

Disadvantages

❌ Index maintenance on insert/delete
❌ Extra storage for index
❌ Index must fit in memory (or it’s slow)

Use Cases

Databases: SQL indexes
File systems: Inode tables
Library: Catalog system
Large sequential files with random lookups

4. Hashed Organization

Concept

Hash function maps record to disk location:

Hash function: H(key) = disk_block

Example:
H("Alice") = "Alice" % 100 = Block 45
H("Bob") = "Bob" % 100 = Block 50
H("Carol") = "Carol" % 100 = Block 45 (collision!)

Records stored at hashed block (with collision handling)

Hash Function

Common: Modulo division
H(key) = key % table_size

Example:
H(123) = 123 % 100 = 23

Properties:
- Fast O(1) computation
- Distributes evenly (with good function)
- Deterministic (same key, same hash)

Collision Handling

Chaining:

Block 45: [Alice] → [Carol] → [Emma] → null

Multiple records in same block (list)

Open Addressing:

Block 45: Alice (occupied)
Block 46: Carol (next block)
Block 47: Emma (next block)

Probe sequence on collision

Access Pattern

Find Bob:

1. Hash("Bob") = 50
2. Read block 50
3. Search block (if chaining)
O(1) average, O(n) worst case

Advantages

✓ Very fast average access O(1)
✓ Simple implementation
✓ No complex index structure

Disadvantages

❌ Worst-case O(n) if many collisions
❌ Cannot do range queries (hash destroys order)
❌ Load factor affects performance
❌ Difficult to reorganize when file grows

Use Cases

In-memory hash tables (most common use)
Database hash indexes
Symbol tables (compiler)
Caching systems

Comparison

Feature	Sequential	Direct	Indexed	Hashed
Random Access	Slow	Very Fast	Fast	Fast (avg)
Sequential Access	Fast	Medium	Fast	Impossible
Space Efficiency	Good	Poor (gaps)	Medium	Medium
Insert/Delete	Slow	Complex	Medium	Medium
Range Queries	Easy	Easy	Easy	Impossible
Implementation	Simple	Simple	Complex	Medium
Memory Overhead	None	Possible gaps	Index size	Hash table

Real-World Examples

Example 1: Sequential File (Log)

Scenario: Server access log

File: /var/log/access.log
Organization: Sequential

Record:
2024-01-15 10:30:45 user1 GET /index.html 200
2024-01-15 10:30:46 user2 GET /about.html 404
2024-01-15 10:30:47 user1 GET /api/data 200

Processing:
Count total requests: Read entire file sequentially (fast!)
Find request by IP: Search entire file (slow)
Add new request: Append to end (fast)

Best for: Logs, append-only data, sequential analysis

Example 2: Indexed File (Database Table)

Scenario: Employee database

File: employees.db
Organization: Indexed Sequential

Records:
ID 100: Alice, salary 50000
ID 101: Bob, salary 60000
ID 102: Carol, salary 55000

Index:
ID 100 → Block 50
ID 101 → Block 51
ID 102 → Block 52

Query: SELECT * FROM employees WHERE ID = 101
1. Search index → Block 51
2. Read block 51 → Get Bob record
3. Fast! (no need to scan entire file)

Query: SELECT * FROM employees WHERE salary > 50000
1. Scan all records (salary not indexed)
2. Check each record
3. Slow (but necessary for non-indexed fields)

Example 3: Hashed File (Cache)

Scenario: Web cache

File: Cache entries
Organization: Hashed

URL: "example.com/page1"
Hash: "example.com/page1" % buckets = Bucket 45
Store: Bucket 45 → Cache entry

Retrieve:
1. Hash URL → Bucket 45
2. Retrieve entry directly
3. Very fast! O(1)

Problem:
Range queries: "All URLs starting with example.com"
Cannot do with hash! (hash destroys order)

Solution: Use indexed file for range queries

Hybrid Approaches

Indexed Hashed

Index on hash buckets:
Hash groups → Index → Sequential access within group

Benefits:
- O(1) hash lookup
- Sequential access within group
- Reasonable for range queries

Used in: Advanced database systems

Multi-Level Indexing

Primary index: Sequential
Secondary indexes: Hashed on different keys

Example:
Employee database indexed by:
- ID (primary index)
- Name (secondary index)
- Department (secondary index)

Lookup by ID: Fast (primary)
Lookup by name: Fast (secondary)
Lookup by salary range: Scan needed

Choosing Organization

Decision Factors

Mostly sequential access? → Sequential
Quick lookups needed? → Indexed or Hashed
Must support range queries? → Indexed (not Hashed)
Known fixed size? → Direct
Variable size, many insertions? → Indexed Sequential
Performance critical? → Hashed (if no range needed)
Memory abundant? → Indexed (more indexes)

Performance Comparison

File size: 1 million records
Search for one record:

Sequential: 500,000 average reads (5 seconds @ 10μs/read)
Direct: 1 read (10 microseconds) [if dense]
Indexed: 20 reads (log 1M) (200 microseconds)
Hashed: 1 read (10 microseconds)

Hashed and Direct fastest, but can't range query!
Indexed good compromise

Summary

Sequential organization simple, good for batch processing but slow random access. Direct/relative organization fast if keys dense, otherwise wasteful. Indexed sequential combines benefits: fast lookups via index, sequential access possible. Hashed organization fastest for exact match lookups but cannot handle range queries or maintain order. Choice depends on access patterns and query requirements. Real systems often use multiple organization methods (indexes) on same data for different query types.

File Organization

Definition