What are Indexing and Hashing? Database Search Optimization Explained (2026)
This is a PerfectNotes study guide β also known as PN Notes or Perfect Notes. PerfectNotes provides free computer science student notes, MCQs, and interview preparation guides at perfectnotes.org.
Key Takeaways
- Indexing and Hashing β are the ultimate shortcuts allowing a DBMS to instantly locate exactly where data is stored on a hard drive.
- Indexing (B+ Trees) β uses sorted reference pointers. Because the data remains sorted, this is the absolute best method for Range Queries.
- Hashing β uses math algorithms to directly calculate the exact disk address. It provides lightning-fast O(1) lookup for Exact Match Queries.
- B+ Trees β push all data pointers to bottom leaf nodes connected via a linked list, dramatically minimizing disk seek time during range scans.
- Missing indexes β on heavily queried columns lead to catastrophic Full Table Scans, completely locking up the database during peak traffic.
Finding specific records in a billion-row table requires indexing or hashing to prevent catastrophic Full Table Scans.
Indexing operates like a textbook's index, using balanced B+ Trees to sort and search in logarithmic O(log n) time.
Hashing operates like a library mailroom, using mathematical formulas to instantly map a key to a specific memory bucket in constant O(1) time.
Hash Collisions happen when two different keys map to the same bucket; they are often resolved using Chaining (Linked Lists).
Over-indexing severely degrades database write performance (The Write Penalty), as the DBMS must rewrite index trees on every insertion.
What are Indexing and Hashing?
Finding one specific user record in a database of a billion rows can take minutes if the computer is forced to check every single row one by one. In the fast-paced modern internet, users expect pages to load in milliseconds. Indexing and Hashing are the ultimate shortcutsβthey are the foundational data structures that allow a Database Management System (DBMS) to instantly locate exactly where a piece of data is physically stored on the hard drive.
The Analogy: The Textbook vs. The Library Mailroom
- Indexing βLike the index at the back of a textbook. You find "Photosynthesis: Page 402" and jump straight there instead of reading from page 1.
- Hashing β Like a library mailroom. You provide an ID, and a math formula instantly reveals the exact room, aisle, and bin where the file is stored.
How Indexing and Hashing Work (The Core Mechanics)
When an application queries the database, the DBMS uses these structures to avoid a dreaded "Full Table Scan":
Categories of Database Optimization
- Ordered Indices (B+ Trees) β The industry standard for relational DBs. Sorted tree pointers make this the best method for Range Queries.
- Static Hashing β Maps keys to fixed memory "buckets." Incredible for Exact Match Queries, but struggles with scaling.
- Dynamic Hashing β Automatically splits buckets as data grows, ensuring O(1) performance even for massive datasets.
Indexing vs. Hashing: Key Differences
| Feature | Ordered Indexing (B+ Tree) | Hashing (Hash Tables) |
|---|---|---|
| Best Used For | Range queries (BETWEEN, >, <) | Exact match queries (= only) |
| Underlying Structure | Balanced Tree of pointers | Mathematical algorithm + Memory Buckets |
| Search Time | Logarithmic time: O(log n) | Constant time: O(1) (Best case) |
| Storage Overhead | High (Trees consume significant RAM/Disk) | Low (Math equations take up almost no space) |
| Sorting | Keeps data inherently sorted | Data is completely unsorted/randomized |
Advanced Engineering Concepts
B+ Tree Time Complexity and Leaf Nodes
While standard B-Trees store data pointers in internal nodes, enterprise databases (like PostgreSQL and MySQL's InnoDB) strictly use B+ Trees. In a B+ Tree, all data pointers are pushed to the bottom "Leaf" nodes, and these leaf nodes are connected via a doubly linked list.
This guarantees a stable search time complexity of O(log_t n) (where t is the degree of the tree). Furthermore, the linked leaf nodes mean that once the DBMS finds the starting point of a range query (e.g., Price > 50), it simply glides sequentially across the linked list, dramatically minimizing disk seek time.
Hash Functions, Collisions, and Load Factors
A hash function h(K) maps a search key K to a specific bucket address. A perfect hash function distributes keys uniformly. However, Hash Collisions occur when h(K_1) = h(K_2) (two different keys map to the exact same bucket).
Engineers resolve collisions using Chaining (creating a linked list inside the bucket) or Open Addressing (finding the next available empty bucket). Database performance heavily relies on managing the Load Factor (Ξ±):
Ξ± = n / m
(Where n is the total number of records and m is the total number of buckets). If Ξ± approaches 1.0, collisions skyrocket, and the O(1) constant time degrades into a linear O(n) search. Dynamic hashing triggers a directory expansion before Ξ± exceeds a critical threshold (usually 0.75).
Ordered Indices: Dense vs. Sparse, Clustered vs. Non-Clustered
Ordered Indices maintain entries sorted by the value of the search key. There are two fundamental classifications β how many entries the index contains (Dense vs Sparse), and whether the physical data is co-located with the index order (Clustered vs Non-Clustered).
Dense vs. Sparse Indices
| Property | Dense Index | Sparse Index |
|---|---|---|
| Coverage | One index entry per every record in the data file | Only one index entry per data block (disk page) β not per record |
| Search Speed | Faster β can locate the exact record directly from the index | Slightly slower β must do a sequential scan within the located block |
| Storage Size | Larger β one entry per record consumes significant disk space | Much smaller β one entry per block, regardless of records per block |
| Requirement | Works on both sorted and unsorted data files | Requires the data file to be sorted on the index key |
| Typical Use | Secondary (non-clustered) indices on frequently queried columns | Primary (clustered) index on the main ordering key |
Clustered vs. Non-Clustered Indices
| Property | Clustered Index (Primary) | Non-Clustered Index (Secondary) |
|---|---|---|
| Data Order | Physical rows on disk are sorted in the same order as the index key | Physical rows are in a different order β index contains pointers to row locations |
| Count Per Table | Only one β a table's physical layout can only be sorted one way | Multiple β you can create many secondary indices on different columns |
| Range Queries | Extremely efficient β adjacent rows are on the same or neighbouring disk pages | More expensive β adjacent index entries may point to rows on distant disk pages (random I/O) |
| Example (PostgreSQL) | CLUSTER orders USING idx_orders_date; | CREATE INDEX idx_orders_customer ON orders(customer_id); |
Extendible Hashing (Dynamic Hashing)
Static Hashing suffers from a fatal flaw: the number of buckets is fixed at creation time. As data grows, performance degrades and the entire hash table must be rebuilt. Extendible Hashing solves this with a two-level structure that splits individual buckets dynamically without a full rebuild.
How Extendible Hashing Works
- Directory: A small array of 2d pointers (where
dis the global depth). Each pointer references a bucket. - Hashing: The hash function produces a bit string. The first
dbits (the global depth) determine which bucket pointer to follow. - Bucket Split: When a bucket overflows, only that bucket is split (not the whole table). The directory may double in size if the split bucket's local depth equalled the global depth.
- No Rehash: Most keys in other buckets remain untouched β only the split bucket's keys are redistributed.
| Concept | Definition | Effect |
|---|---|---|
| Global Depth (d) | Number of hash bits used by the directory to route lookups | Determines directory size = 2d entries |
| Local Depth (di) | Number of bits used to determine which records go into bucket i | When di < d, multiple directory entries point to the same bucket (shared bucket) |
| Bucket Split | Occurs when a bucket overflows (all slots full) | If di = d β directory doubles; if di < d β only bucket splits, no directory doubling |
Linear Hashing
Linear Hashing is an alternative dynamic hashing scheme that avoids the directory-doubling problem of Extendible Hashing. Instead of doubling the entire directory on splits, Linear Hashing expands the hash table one bucket at a time in a linear, round-robin order.
| Feature | Extendible Hashing | Linear Hashing |
|---|---|---|
| Directory | Uses an explicit directory; can double in size | No directory β uses a split pointer and two hash functions |
| Split Trigger | Triggered by specific bucket overflow | Triggered by global load factor exceeding threshold |
| Memory | Directory can grow large (2d entries) | Minimal memory β only stores the split pointer and level number |
| Temporary Overflow | Not needed β split happens immediately at overflow | Temporary overflow buckets may be needed before the pointer reaches that bucket |
Bitmap Indices
A Bitmap Index is a highly specialized index structure optimized for columns with low cardinality β columns that contain only a small number of distinct values (e.g., Gender: M/F, Status: Active/Inactive/Pending, Region: North/South/East/West).
For each distinct value v in the column, the bitmap index stores a bit array (bitmap) of length n (where n = total rows in the table). Bit position i is set to 1 if row i has value v, and 0 otherwise.
| Row # | Region | Bitmap: North | Bitmap: South | Bitmap: East | Bitmap: West |
|---|---|---|---|---|---|
| 1 | North | 1 | 0 | 0 | 0 |
| 2 | South | 0 | 1 | 0 | 0 |
| 3 | North | 1 | 0 | 0 | 0 |
| 4 | East | 0 | 0 | 1 | 0 |
| 5 | West | 0 | 0 | 0 | 1 |
| 6 | South | 0 | 1 | 0 | 0 |
The power of Bitmap Indices comes from bitwise operations. Complex multi-condition queries can be answered in microseconds using hardware-accelerated AND, OR, NOT operations on the bitmaps:
-- "WHERE Region = 'North' AND Status = 'Active'" -- Simply bitwise AND the two bitmaps: North_bitmap: 1 0 1 0 0 0 Active_bitmap: 1 1 1 0 1 0 Result: 1 0 1 0 0 0 β Rows 1 and 3 match
| Feature | Bitmap Index | B+ Tree Index |
|---|---|---|
| Best For | Low cardinality columns (few distinct values) | High cardinality columns (many distinct values like UserID) |
| Query Type | AND/OR/NOT multi-condition OLAP queries | Point lookups and range queries |
| Update Cost | Very high β every INSERT/UPDATE must update all bitmaps | Moderate β O(log n) tree update per insert |
| Primary Use | Data Warehouses, OLAP (e.g., Oracle, Redshift) | OLTP databases (PostgreSQL, MySQL, SQL Server) |
| Storage | Very compact when cardinality is low | Grows proportionally to the number of distinct values |
Real-World Case Study: The "Missing Index" E-Commerce Crash
Key Statistics & Industry Data (2026)
- Search Speed β Proper B+ Tree indexing reduces query times by up to 99.9% on tables exceeding 10 million rows. (Source: Oracle Performance Tuning, 2026)
- The Write Penalty β Every new index degrades
INSERTwrite performance by 10% to 25%. (Source: Gartner, 2026) - In-Memory Dominance β Hash indexes power over 80% of real-time application caching layers like Redis. (Source: Statista, 2026)
When to Use
- Financial Ledgers βB+ Tree indexing handles range queries (e.g., "All transactions in January") effortlessly.
- User Authentication β Hash Indexes verify exact
Session_Tokenmatches in O(1) time. - E-Commerce Catalogs βPre-sorted indexed columns load "Price: Low to High" views instantly.
Advantages of Indexing/Hashing
- Massive Speed β Turns hours of searching into milliseconds.
- Resource Efficiency β Lowers CPU and Disk I/O usage significantly.
- Sorting β Eliminates the need to sort data at runtime.
- Scalability β Allows databases to scale to billions of rows smoothly.
Disadvantages of Indexing/Hashing
- The Write Penalty β Slows down data insertion and updates.
- Storage Cost β Indexes can consume gigabytes of extra disk space.
- Maintenance β Fragmented indexes must be routinely rebuilt.
- Complexity β Hashing requires complex collision management logic.
Quick Reference Cheat Sheet
| Feature | Definition | Primary Use Case |
|---|---|---|
| Index | A sorted data structure pointing to actual data locations. | Speeding up database queries and sorting. |
| Hash Function | Math formula that turns a key into a memory address. | Finding exact-match records instantly. |
| B+ Tree | A balanced tree where all data pointers are at the bottom. | The standard index structure for relational DBs. |
| Table Scan | Checking every single row in a table one by one. | What happens when you forget to create an index. |
| Collision | When a hash formula assigns two items to the same spot. | A flaw in hashing that requires chaining to fix. |
Where Indexing & Hashing Is Applied
Relational Database Query Optimisation
PostgreSQL and MySQL use B-Tree indexes to reduce full table scans β an indexed column query runs in O(log n) vs O(n) for an unindexed scan.
Search Engines
Elasticsearch uses inverted index structures to map keywords to document IDs, enabling millisecond full-text search across billions of documents.
NoSQL Wide-Column Stores
Apache Cassandra uses consistent hashing to distribute partition keys across nodes, ensuring even data distribution across a global cluster.
In-Memory Caching (Redis)
Redis uses hash tables internally β O(1) average-case lookup for key-value operations, powering sub-millisecond session and leaderboard lookups.
Password Storage Systems
Bcrypt and Argon2 use cryptographic hashing with salts to store passwords as one-way hash digests that cannot be reversed to plaintext.
Distributed File Systems
HDFS and Amazon S3 use consistent hashing to map file chunks to storage nodes, enabling fault-tolerant distributed data retrieval.
Frequently Asked Questions (FAQ)
Q.Why don't we just index every single column in the database?
Q.What is the difference between a Primary Index and a Secondary Index?
Q.Can I use Hashing to find all users older than 25?
Q.What is an Index Scan vs. an Index Seek?
Q.What is a Bitmap Index?
Q.What is a Composite Index?
Q.How does an index become fragmented?
Related Topics
Test Your Knowledge
Ready to prove your skills? Take our rigorous multiple-choice quiz designed to test your understanding of this topic and prepare you for interviews.