Introduction:
Generally ETL performance testing is confirmation test to ensure that an ETL ‘system’ can handle the load of multiple users and transaction. For any project this is primarily ensuring that the ‘system’ can easily manage the throughput of millions of transactions.
You can improve your application performance by optimizing the queries you use. The following sections outline techniques you can use to optimize query performance.
Improve Indexes:
- Creating useful indexes is one of the most important ways to achieve better query performance. Useful indexes help you find data with fewer disk I/O operations and less system resource usage.
- To create useful indexes, you must understand how the data is used, the types of queries and the frequencies they run, and how the query processor can use indexes to find your data quickly.
- When you choose what indexes to create, examine your critical queries, the performance of which will affect the user’s experience most. Create indexes to specifically aid these queries. After adding an index, rerun the query to see if performance is improved. If it is not, remove the index.
- As with most performance optimization techniques, there are tradeoffs. For example, with more indexes, SELECT queries will potentially run faster. However, DML (INSERT, UPDATE, and DELETE) operations will slow down significantly because more indexes must be maintained with each operation. Therefore, if your queries are mostly SELECT statements, more indexes can be helpful. If your application performs many DML operations, you should be conservative with the number of indexes you create.
- We recommend that you always create indexes on primary keys. It is frequently useful to also create indexes on foreign keys. This is because primary keys and foreign keys are frequently used to join tables. Indexes on these keys let the optimizer consider more efficient index join algorithms. If your query joins tables by using other columns, it is frequently helpful to create indexes on those columns for the same reason.
- When primary key and foreign key constraints are created, SQL Server Compact 3.5 automatically creates indexes for them and takes advantage of them when optimizing queries. Remember to keep primary keys and foreign keys small. Joins run faster this way.
- Indexes can be used to speed up the evaluation of certain types of filter clauses. Although all filter clauses reduce the final result set of a query, some can also help reduce the amount of data that must be scanned.
- A search argument (SARG) limits a search because it specifies an exact match, a range of values, or a conjunction of two or more items joined by AND. It has one of the following forms:
- Response time is the time it takes for a query to return the first record. Total time is the time it takes for the query to return all records. For an interactive application, response time is important because it is the perceived time for the user to receive visual affirmation that a query is being processed. For a batch application, total time reflects the overall throughput. You have to determine what the performance criteria are for your application and queries, and then design accordingly.
- Suppose the query returns 100 records and is used to populate a list with the first five records. In this case, you are not concerned with how long it takes to return all 100 records. Instead, you want the query to return the first few records quickly, so that you can populate the list.
- Many query operations can be performed without having to store intermediate results. These operations are said to be pipelined. Examples of pipelined operations are projections, selections, and joins. Queries implemented with these operations can return results immediately. Other operations, such as SORT and GROUP-BY, require using all their input before returning results to their parent operations. These operations are said to require materialization. Queries implemented with these operations typically have an initial delay because of materialization. After this initial delay, they typically return records very quickly.
- Queries with response time requirements should avoid materialization. For example, using an index to implement ORDER-BY, yields better response time than using sorting. The following section describes this in more detail.
- The ORDER-BY, GROUP-BY, and DISTINCT operations are all types of sorting. The SQL Server Compact 3.5 query processor implements sorting in two ways. If records are already sorted by an index, the processor needs to use only the index. Otherwise, the processor has to use a temporary work table to sort the records first. Such preliminary sorting can cause significant initial delays on devices with lower power CPUs and limited memory, and should be avoided if response time is important.
- In the context of multiple-column indexes, for ORDER-BY or
GROUP-BY to consider a particular index, the ORDER-BY or GROUP-BY
columns must match the prefix set of index columns with the exact order.
For example, the index CREATE INDEX Emp_Name ON Employees (“Last
Name” ASC, “First Name” ASC) can help optimize the following
queries:
- .. ORDER BY / GROUP BY “Last Name” …
- … ORDER BY / GROUP BY “Last Name”, “First Name” …
- … ORDER BY / GROUP BY “First Name” …
- … ORDER BY / GROUP BY “First Name”, “Last Name” …
- … DISTINCT “Last Name”, “First Name” …
- … DISTINCT “First Name”, “Last Name” …
- … DISTINCT “First Name” …
- … DISTINCT “Last Name” …
Sometimes you can rewrite a subquery to use JOIN and achieve better performance. The advantage of creating a JOIN is that you can evaluate tables in a different order from that defined by the query. The advantage of using a subquery is that it is frequently not necessary to scan all rows from the subquery to evaluate the subquery expression. For example, an EXISTS subquery can return TRUE upon seeing the first qualifying row.
Example:
To determine all the orders that have at least one item with a 25 percent discount or more, you can use the following EXISTS subquery:
SELECT “Order ID” FROM Orders O
WHERE EXISTS (SELECT “Order ID”
FROM “Order Details” OD
WHERE O.”Order ID” = OD.”Order ID”
AND Discount >= 0.50)
You can rewrite this by using JOIN:
SELECT DISTINCT O.”Order ID” FROM Orders O INNER JOIN “Order Details”
OD ON O.”Order ID” = OD.”Order ID” WHERE Discount >= 0.50
Limit Using Outer JOINs
OUTER JOINs are treated differently from INNER JOINs in the optimizer. It does not try to rearrange the join order of OUTER JOIN tables as it does to INNER JOIN tables. The outer table (the left table in LEFT OUTER JOIN and the right table in RIGHT OUTER JOIN) is accessed first, followed by the inner table. This fixed join order could lead to execution plans that are less than optimal.
Use Parameterized Queries:
- If your application runs a series of queries that are only different in some constants, you can improve performance by using a parameterized query. For example, to return orders by different customers, you can run the following query:
- SELECT “Customer ID” FROM Orders WHERE “Order ID” = ?
- Parameterized queries yield better performance by compiling the query only once and executing the compiled plan multiple times. Programmatically, you must hold on to the command object that contains the cached query plan. Destroying the previous command object and creating a new one destroys the cached plan. This requires the query to be re-compiled. If you must run several parameterized queries in interleaved manner, you can create several command objects, each caching the execution plan for a parameterized query. This way, you effectively avoid re-compilations for all of them.
1. Avoid Cartesian products
2. Avoid full table scans on large tables
3. Use SQL standards and conventions to reduce parsing
4. Lack of indexes on columns contained in the WHERE clause
5. Avoid joining too many tables
6. Monitor V$SESSION_LONGOPS to detect long running operations
7. Use hints as appropriate
8. Use the SHARED_CURSOR parameter
9. Use the Rule-based optimizer if I is better than the Cost-based optimizer
10. Avoid unnecessary sorting
11. Monitor index browning (due to deletions; rebuild as necessary)
12. Use compound indexes with care (Do not repeat columns)
13. Monitor query statistics
14. Use different tablespaces for tables and indexes (as a general rule; this is old-school somewhat, but the main point is reduce I/O contention)
15. Use table partitioning (and local indexes) when appropriate (partitioning is an extra cost feature)
16. Use literals in the WHERE clause (use bind variables)
17. Keep statistics up to date
Conclusion
ETL projects today are designed for correct functionality and adequate performance, i.e., to complete within a time window. However, the task of optimizing ETL designs is left to the experience and intuition of the ETL designers. In addition, ETL designs face additional objectives beyond performance.
0 comments:
Post a Comment