Definition
File Organization: How files are organized and structured on disk for efficient access and storage. Methods differ in access patterns, storage efficiency, and implementation complexity.
File Organization Methods
1. Sequential Organization
Concept
Records stored one after another:
File: employee.txt
Record 1: [Name: Alice, Age: 30, Salary: 50000]
Record 2: [Name: Bob, Age: 35, Salary: 60000]
Record 3: [Name: Carol, Age: 28, Salary: 55000]
Stored sequentially on disk:
Block 100: Alice record
Block 101: Bob record
Block 102: Carol record
Access Pattern
Reading record 2:
Start at beginning
Skip record 1 (read it but don't use)
Read record 2
Insertion:
New record: Eve (between Carol and Dave)
Option 1: Insert in place (must shift all after)
Option 2: Append at end, sort later
Advantages
- ✓ Simple implementation
- ✓ Efficient for sequential processing (read all records)
- ✓ Minimal overhead
- ✓ Good for tapes (sequential media)
Disadvantages
- ❌ Slow for direct access (must read from start)
- ❌ Slow insertion/deletion (must shift records)
- ❌ Wasteful for sparse data
Use Cases
Batch processing: Read all payroll records monthly
Log files: Append-only, read sequentially
Archive: Write once, read many times
2. Direct (Relative) Organization
Concept
Records stored at calculated positions:
Record key → Calculate position → Disk location
Example:
Record ID 5 → Position = 5
Record ID 10 → Position = 10
Record ID 15 → Position = 15
Retrieve record 10:
Position 10 × record_size = byte offset
Read directly!
Access Pattern
Random access:
Retrieve record 100:
Position = 100
Byte offset = 100 × record_size (say 256)
= 25,600 bytes
Seek to block, offset, read
Direct access!
Advantages
- ✓ Fast direct access
- ✓ O(1) access time (calculate and read)
- ✓ Good for known ID ranges
Disadvantages
- ❌ Requires key-to-position mapping
- ❌ Wasteful if keys sparse (gaps in numbering)
- ❌ Insertion/deletion complex
Use Cases
Disk arrays: RAID, where blocks numbered
Fixed-size records: Same size per record
Direct address space: Know all possible keys
3. Indexed Sequential Organization
Concept
Sequential organization with index:
Index:
Alice → Block 100
Bob → Block 101
Carol → Block 102
Dave → Block 103
Retrieve Bob:
Lookup index → Block 101
Read block directly!
Structure
Master File (Sequential):
Block 100: [Alice record]
Block 101: [Bob record]
Block 102: [Carol record]
Block 103: [Dave record]
Index File:
"Alice" → Block 100
"Bob" → Block 101
"Carol" → Block 102
"Dave" → Block 103
Access Pattern
Find Carol:
1. Search index for "Carol" → Block 102
2. Read block 102 directly
3. Access time: O(log n) for index search
Advantages
- ✓ Fast direct access (via index)
- ✓ Sequential access still possible (master file)
- ✓ Index supports binary search
- ✓ Can have multiple indexes (different keys)
Disadvantages
- ❌ Index maintenance on insert/delete
- ❌ Extra storage for index
- ❌ Index must fit in memory (or it’s slow)
Use Cases
Databases: SQL indexes
File systems: Inode tables
Library: Catalog system
Large sequential files with random lookups
4. Hashed Organization
Concept
Hash function maps record to disk location:
Hash function: H(key) = disk_block
Example:
H("Alice") = "Alice" % 100 = Block 45
H("Bob") = "Bob" % 100 = Block 50
H("Carol") = "Carol" % 100 = Block 45 (collision!)
Records stored at hashed block (with collision handling)
Hash Function
Common: Modulo division
H(key) = key % table_size
Example:
H(123) = 123 % 100 = 23
Properties:
- Fast O(1) computation
- Distributes evenly (with good function)
- Deterministic (same key, same hash)
Collision Handling
Chaining:
Block 45: [Alice] → [Carol] → [Emma] → null
Multiple records in same block (list)
Open Addressing:
Block 45: Alice (occupied)
Block 46: Carol (next block)
Block 47: Emma (next block)
Probe sequence on collision
Access Pattern
Find Bob:
1. Hash("Bob") = 50
2. Read block 50
3. Search block (if chaining)
O(1) average, O(n) worst case
Advantages
- ✓ Very fast average access O(1)
- ✓ Simple implementation
- ✓ No complex index structure
Disadvantages
- ❌ Worst-case O(n) if many collisions
- ❌ Cannot do range queries (hash destroys order)
- ❌ Load factor affects performance
- ❌ Difficult to reorganize when file grows
Use Cases
In-memory hash tables (most common use)
Database hash indexes
Symbol tables (compiler)
Caching systems
Comparison
| Feature | Sequential | Direct | Indexed | Hashed |
|---|---|---|---|---|
| Random Access | Slow | Very Fast | Fast | Fast (avg) |
| Sequential Access | Fast | Medium | Fast | Impossible |
| Space Efficiency | Good | Poor (gaps) | Medium | Medium |
| Insert/Delete | Slow | Complex | Medium | Medium |
| Range Queries | Easy | Easy | Easy | Impossible |
| Implementation | Simple | Simple | Complex | Medium |
| Memory Overhead | None | Possible gaps | Index size | Hash table |
Real-World Examples
Example 1: Sequential File (Log)
Scenario: Server access log
File: /var/log/access.log
Organization: Sequential
Record:
2024-01-15 10:30:45 user1 GET /index.html 200
2024-01-15 10:30:46 user2 GET /about.html 404
2024-01-15 10:30:47 user1 GET /api/data 200
Processing:
Count total requests: Read entire file sequentially (fast!)
Find request by IP: Search entire file (slow)
Add new request: Append to end (fast)
Best for: Logs, append-only data, sequential analysis
Example 2: Indexed File (Database Table)
Scenario: Employee database
File: employees.db
Organization: Indexed Sequential
Records:
ID 100: Alice, salary 50000
ID 101: Bob, salary 60000
ID 102: Carol, salary 55000
Index:
ID 100 → Block 50
ID 101 → Block 51
ID 102 → Block 52
Query: SELECT * FROM employees WHERE ID = 101
1. Search index → Block 51
2. Read block 51 → Get Bob record
3. Fast! (no need to scan entire file)
Query: SELECT * FROM employees WHERE salary > 50000
1. Scan all records (salary not indexed)
2. Check each record
3. Slow (but necessary for non-indexed fields)
Example 3: Hashed File (Cache)
Scenario: Web cache
File: Cache entries
Organization: Hashed
URL: "example.com/page1"
Hash: "example.com/page1" % buckets = Bucket 45
Store: Bucket 45 → Cache entry
Retrieve:
1. Hash URL → Bucket 45
2. Retrieve entry directly
3. Very fast! O(1)
Problem:
Range queries: "All URLs starting with example.com"
Cannot do with hash! (hash destroys order)
Solution: Use indexed file for range queries
Hybrid Approaches
Indexed Hashed
Index on hash buckets:
Hash groups → Index → Sequential access within group
Benefits:
- O(1) hash lookup
- Sequential access within group
- Reasonable for range queries
Used in: Advanced database systems
Multi-Level Indexing
Primary index: Sequential
Secondary indexes: Hashed on different keys
Example:
Employee database indexed by:
- ID (primary index)
- Name (secondary index)
- Department (secondary index)
Lookup by ID: Fast (primary)
Lookup by name: Fast (secondary)
Lookup by salary range: Scan needed
Choosing Organization
Decision Factors
Mostly sequential access? → Sequential
Quick lookups needed? → Indexed or Hashed
Must support range queries? → Indexed (not Hashed)
Known fixed size? → Direct
Variable size, many insertions? → Indexed Sequential
Performance critical? → Hashed (if no range needed)
Memory abundant? → Indexed (more indexes)
Performance Comparison
File size: 1 million records
Search for one record:
Sequential: 500,000 average reads (5 seconds @ 10μs/read)
Direct: 1 read (10 microseconds) [if dense]
Indexed: 20 reads (log 1M) (200 microseconds)
Hashed: 1 read (10 microseconds)
Hashed and Direct fastest, but can't range query!
Indexed good compromise
Summary
Sequential organization simple, good for batch processing but slow random access. Direct/relative organization fast if keys dense, otherwise wasteful. Indexed sequential combines benefits: fast lookups via index, sequential access possible. Hashed organization fastest for exact match lookups but cannot handle range queries or maintain order. Choice depends on access patterns and query requirements. Real systems often use multiple organization methods (indexes) on same data for different query types.