As commitment to our database literacy campaign, we're offering our Database Foundations course—for FREE!

Skip to main content
Completion requirements

1. Types of Indexes

Indexes are essential for enhancing query performance by allowing quick lookups of data. The choice of an index type depends on the data distribution, query patterns, and storage considerations. Below, we discuss different categories:

A. Clustered vs. Non-Clustered Indexes

  • Clustered Indexes

    • Definition: A clustered index determines the physical order in which data is stored in the table. There can be only one clustered index per table because the data rows themselves are stored in this order.
    • Use Case: Ideal for range queries or when the data is frequently accessed in sorted order.
    • Example: In an orders table, having a clustered index on an order date can improve performance for queries that retrieve orders within a specific date range.
    -- Example: Creating a clustered index on OrderDate in SQL Server
    CREATE CLUSTERED INDEX IDX_Orders_OrderDate
    ON Orders (OrderDate);
    
  • Non-Clustered Indexes

    • Definition: A non-clustered index creates a separate data structure that references the table's rows. Data remains in its original order; instead, the index contains pointers to the locations of the data.
    • Use Case: Useful when you need to support multiple indexing strategies for different queries, since you can have many non-clustered indexes on one table.
    • Example: Creating a non-clustered index on a customer's last name to improve search performance based on customer names.
    -- Example: Creating a non-clustered index on LastName
    CREATE NONCLUSTERED INDEX IDX_Customers_LastName
    ON Customers (LastName);
    

B. Unique, Composite, and Full-Text Indexes

  • Unique Indexes

    • Definition: Enforces uniqueness on the indexed column(s) to guarantee that no duplicate values exist.
    • Use Case: Ideal for columns that require unique entries such as email addresses or user IDs.
    • Example:
      -- Example: Creating a unique index on Email
      CREATE UNIQUE INDEX IDX_Users_Email
      ON Users (Email);
      
  • Composite Indexes

    • Definition: An index that includes multiple columns. They are useful when queries frequently filter on several columns simultaneously.

    • Considerations: The order of columns in a composite index is critical. It should reflect the most common query uses.

    • Example: If many queries filter by both LastName and FirstName, a composite index can be applied.

      -- Example: Creating a composite index on LastName and FirstName
      CREATE INDEX IDX_Customers_Name
      ON Customers (LastName, FirstName);
      
  • Full-Text Indexes

    • Definition: Used for advanced text searching, full-text indexes enable search capabilities on large text columns.
    • Use Case: Particularly useful for applications that require keyword searches within large bodies of text, such as product descriptions or document repositories.
    • Example:
      -- Example: Creating a full-text index on a Documents table
      CREATE FULLTEXT INDEX ON Documents (Content)
      KEY INDEX PK_Documents;
      

C. When and How to Choose the Right Index Type

  • Consider Query Patterns:

    • If queries involve range scans, a clustered index might provide significant benefits.
    • For equality searches on non-primary key columns, non-clustered indexes are usually preferable.
  • Data Distribution & Cardinality:

    • If the indexed column has high cardinality (i.e., many unique values), indexes tend to perform better.
    • Low cardinality columns are less likely to benefit much from indexing.
  • Maintenance Overhead:

    • Every index incurs storage and update overhead. It's essential to balance the performance gains against additional processing during inserts, updates, or deletes.

2. Evaluating Query Execution Plans

An execution plan outlines the steps executed by the database engine to run a query. Analyzing these plans is key to determining and optimizing the effectiveness of indexing strategies.

A. Reading and Interpreting Execution Plans

  • Understanding Plan Operators:

    • Operators such as scans, seeks, joins, and sorts indicate how the query retrieves and organizes data.
    • Index Seek vs. Index Scan:
      • Index Seek: Indicates that the query optimizer is using an index effectively to locate rows.
      • Index Scan: May indicate that the query could benefit from a more selective index because it reads more rows than necessary.
  • Cost Estimation:

    • Execution plans often include cost percentages. Operators with higher costs should be the primary focus of tuning efforts.
  • Example:

    • In SQL Server Management Studio (SSMS), you can click on the "Display Estimated Execution Plan" button to visually inspect the plan. Look for any warnings, such as missing indexes or expensive sorts.

B. Tools and Techniques for Performance Tuning

  • Database-Specific Tools:

    • SQL Server: SQL Server Management Studio’s Execution Plan Viewer.
    • Oracle: Oracle’s SQL Developer and Automatic Workload Repository (AWR) reports.
    • MySQL: EXPLAIN and EXPLAIN ANALYZE commands provide insights.
  • Techniques:

    • Index Tuning: Analyze missing index recommendations provided by the optimizer.
    • Query Rewriting: Modify queries to be more index-friendly, such as avoiding functions on indexed columns.
    • Use of Hints: Provide index hints where necessary, though these should be used sparingly and only after careful analysis.
  • Example:

    -- Using EXPLAIN in MySQL
    EXPLAIN SELECT * FROM Orders WHERE OrderDate > '2023-01-01';
    
    • The output can reveal whether the query is using an index seek or a full table scan.

3. Partitioning Strategies

Partitioning involves dividing a table into smaller, more manageable pieces without changing the logical structure. This technique can significantly boost performance and ease database maintenance.

A. Horizontal Partitioning

  • Definition:
    Horizontal partitioning (also known as sharding in certain contexts) distributes rows of a table across multiple physical segments (tables or servers) based on a partitioning key.

  • Benefits:

    • Improves performance by reducing the amount of data scanned.
    • Allows for more parallel processing.
  • Design Considerations:

    • Choose a partition key that evenly distributes data.
    • Consider how data will be frequently queried (e.g., by date ranges, geographic regions).
  • Example:

    • An orders table can be partitioned by year. This reduces the data scanned for queries restricted to a particular year.
    -- SQL Server example using partitioning function and scheme
    CREATE PARTITION FUNCTION pfOrderDate(datetime)
    AS RANGE RIGHT FOR VALUES ('2020-01-01', '2021-01-01', '2022-01-01');
    
    CREATE PARTITION SCHEME psOrderDate
    AS PARTITION pfOrderDate
    TO ([PRIMARY], [SECONDARY], [TERTIARY], [QUATERNARY]);
    
    CREATE TABLE Orders (
        OrderId INT PRIMARY KEY,
        OrderDate datetime,
        CustomerId INT,
        ...
    ) ON psOrderDate(OrderDate);
    

B. Vertical Partitioning

  • Definition:
    Vertical partitioning divides a table by columns. Critical, frequently accessed columns are stored separately from infrequently accessed or sensitive columns.

  • Benefits:

    • Improves performance by reducing I/O when only a subset of columns is required.
    • Enhances security by isolating sensitive information.
  • Design Considerations:

    • Determine which columns are most often queried together.
    • Ensure that key columns necessary for joins and integrity are maintained.
  • Example:

    • Consider an employee table that contains both HR-sensitive columns and frequently accessed data for daily operations. The table can be partitioned vertically into an "Employee_Public" table and an "Employee_Sensitive" table.
    -- Employee tables vertical partition example
    -- Employee_Public: Contains non-sensitive, frequently accessed columns.
    CREATE TABLE Employee_Public (
        EmployeeId INT PRIMARY KEY,
        FirstName VARCHAR(50),
        LastName VARCHAR(50),
        Department VARCHAR(50)
    );
    
    -- Employee_Sensitive: Contains sensitive columns.
    CREATE TABLE Employee_Sensitive (
        EmployeeId INT PRIMARY KEY,
        SocialSecurityNumber CHAR(11),
        Salary DECIMAL(10,2),
        FOREIGN KEY (EmployeeId) REFERENCES Employee_Public(EmployeeId)
    );
    

4. Sharding Fundamentals

Sharding is a partitioning method that distributes data across multiple servers. Though it’s often associated with NoSQL environments, sharding can be applied within traditional RDBMS systems to improve scalability.

A. Definition and Benefits of Sharding

  • Definition:
    Sharding is the process of horizontally partitioning data across multiple databases or servers (called shards), each operating as an independent, self-contained database.

  • Benefits:

    • Scale Out: Distributes load across multiple servers, improving overall performance.
    • High Availability: Isolates failures to individual shards.
    • Reduced Contention: By working on smaller datasets, each shard can process queries more efficiently.

B. Basic Design Principles and Common Approaches

  • Shard Key Selection:

    • The shard key is critical for ensuring even data distribution. Choose a key that balances load evenly across shards.
    • Example: For a user database, a shard key could be based on the geographical region or a hash function on the user ID.
  • Common Sharding Approaches:

    • Range-based Sharding: Data is split based on a continuous range of values.
      • Example: Shard by order date ranges.
    • Hash-based Sharding: A hash function is applied to the shard key to determine the shard.
      • Example: Using a modulo operation on a user ID.
    • Directory-based Sharding: Maintains a lookup table to map specific data ranges or keys to the correct shard.
      • Example: A mapping table that directs queries based on user region.
  • Example:

    • A simple hash-based sharding implementation might be:
    -- Pseudocode for a hash-based sharding algorithm
    DECLARE @UserId INT = 12345;
    DECLARE @ShardId INT = @UserId % 4;  -- Assuming we have 4 shards
    
    -- The above calculation helps determine which shard the user's data resides in.
    
  • Considerations in RDBMS Environment:

    • Data Consistency: Ensure mechanisms to maintain ACID properties across shards.
    • Infrastructure Complexity: Sharding adds complexity in routing queries, maintaining data consistency, and managing transactions across multiple nodes.
    • Cross-Shard Queries: Plan for how to handle queries that might need to aggregate or join data across shards.
Last modified: Friday, 11 April 2025, 10:33 AM