Introduction:
Comprehensive testing of a data warehouse at every point throughout
the ETL (extract, transform, and load) process is becoming increasingly
important as more data is being collected and used for strategic
decision-making. Data warehouse or ETL testing is often initiated as a
result of mergers and acquisitions, compliance and regulations, data
consolidation, and the increased reliance on data-driven decision making
(use of Business Intelligence tools, etc.). ETL testing is commonly
implemented either manually or with the help of a tool (functional
testing tool, ETL tool, proprietary utilities). Let us understand some
of the basic ETL concepts.
BI / Data Warehousing testing projects can be conjectured to be
divided into ETL (Extract – Transform – Load) testing and henceforth the
report testing.
Extract Transform Load is the process to enable
businesses to consolidate their data while moving it from place to place
(i.e.) moving data from source systems into the data warehouse. The
data can arrive from any source:
Extract - It can be defined as extracting the data from numerous heterogeneous systems.
Transform - Applying the business logics as specified b y the business on the data derived from sources.
Load - Pumping the data into the final warehouse after
completing the above two process. The ETL part of the testing mainly
deals with how, when, from, where and what data we carry in our data
warehouse from which the final reports are supposed to be generated.
Thus, ETL testing spreads across all and each stage of data flow in the
warehouse starting from the source databases to the final target
warehouse.
Star Schema
The star schema is perhaps the simplest data warehouse schema.
It is called a star schema because the entity-relationship diagram of
this schema resembles a star, with points radiating from a central
table. The center of the star consists of a large fact table and the
points of the star are the dimension tables.
A star schema is characterized by one OR more of very large fact
tables that contain the primary information in the data warehouse, and a
number of much smaller dimension tables (OR lookup tables), each of
which contains information about the entries for a particular attribute
in the fact table.
A star query is a join between a fact table and a number of dimension
tables. Each dimension table is joined to the fact table using a
primary key to foreign key join, but the dimension tables are not joined
to each other. The cost-based optimizer recognizes star queries
and generates efficient execution plans for them. A typical fact table
contains keys and measures. For example, in the sample schema, the fact
table sales, contain the measures, quantity sold, amount, average, the
keys time key, item-key, branch key, and location key. The dimension
tables are time, branch, item and location.
Snow-Flake Schema
The snowflake schema is a more complex data warehouse model
than a star schema, and is a type of star schema. It is called a
snowflake schema because the diagram of the schema resembles
a snowflake. Snowflake schemas normalize dimensions to eliminate
redundancy. That is, the dimension data has been grouped into multiple
tables instead of one large table.
For example, a location dimension table in a star schema might be
normalized into a location table and city table in a snowflake schema.
While this saves space, it increases the number of dimension tables and
requires more foreign key joins. The result is more complex queries and
reduced query performance. Figure above presents a graphical
representation of a snowflake schema.
When to use star schema and snowflake schema?
When we refer to Star and Snowflake Schemas, we are talking
about a dimensional model for a Data Warehouse or a Datamart. The Star
schema model gets it name from the design appearance because there is
one central fact table surrounded by many dimension tables. The
relationship between the fact and dimension tables is created by PK
-> FK relationship and the keys are generally surrogate to the
natural or business key of the dimension tables. All data for any given
dimension is stored in the one dimension table. Thus, the design of the
model could potentially look like a STAR. On the other hand, the
Snowflake schema model breaks the dimension data into multiple tables
for the purpose of making the data more easily understood or for
reducing the width of the dimension table. An example of this type of
schema might be a dimension with Product data of multiple levels. Each
level in the Product Hierarchy might have multiple attributes that are
meaningful only to that level. Thus, one would break the single
dimension table into multiple tables in a hierarchical fashion with the
highest level tied to the fact table. Each table in the dimension
hierarchy would be tied to the level above by natural or business key
where the highest level would be tied to the fact table by a surrogate
key. As you can imagine the appearance of this schema design could
resemble the appearance of a snowflake.
Types of Dimensions Tables
Type 1: This is straightforward r e f r e s h
. The fields are constantly overwritten and history is not kept for the
column. For example should a description change for a Product
number,the old value will be over written by the new value.
Type 2: This is known as a slowly changing
dimension, as history can be kept. The column(s) where the history is
captured has to be defined. In our example of the Product description
changing for a product number, if the slowly changing attribute captured
is the product description, a new row of data will be created showing
the new product description. The old description will still be contained
in the old.
Type 3: This is also a slowly changing dimension.
However, instead of a new row, in the example, the old product
description will be moved to an “old value” column in the dimension,
while the new description will overwrite the existing column. In
addition, a date stamp column exists to say when the value was updated.
Although there will be no full history here, the previous value prior to
the update is captured. No new rows will be created for history as the
attribute is measured for the slowly changing value.
Types of fact tables:
Transactional: Most facts will fall into this
category. The transactional fact will capture transactional data such as
sales lines or stock movement lines. The measures for these facts can
be summed together.
Snapshot: A snapshot fact will capture the current
data for point for a day. For example, all the current stock positions,
where items are, in which branch, at the end of a working day can be
captured.
Snapshot fact measures can be summed for this day, but cannot be summed
across more than 2 snapshot days as this data will be incorrect.
Accumulative: An accumulative snapshot will sum data
up for an attribute, and is not based on time. For example, to get the
accumulative sales quantity for a sale of a particular product, the row
of data will be calculated for this row each night – giving an
“accumulative” value.
Key hit-points in ETL testing are:There are several
levels of testing that can be performed during data warehouse testing
and they should be defined as part of the testing strategy in different
phases (Component Assembly, Product) of testing. Some examples include:
1. Constraint Testing: During constraint testing,
the objective is to validate unique constraints, primary keys, foreign
keys, indexes, and relationships. The test script should include
these validation points. Some ETL processes can be developed to validate
constraints during the loading of the warehouse. If the decision is
made to add constraint validation to the ETL process, the ETL code must
validate all business rules and relational data requirements. In
Automation, it should be ensured that the setup is done correctly and
maintained throughout the ever-changing requirements process for
effective testing. An alternative to automation is to use manual
queries. Queries are written to cover all test scenarios and executed
manually.
2. Source to Target Counts: The objective of the
count test scripts is to determine if the record counts in the source
match the record counts in the target. Some ETL processes are capable
of capturing record count information such as records read, records
written, records in error, etc. If the ETL process used can capture that
level of detail and create a list of the counts, allow it to do so.
This will save time during the validation process. It is always a good
practice to use queries to double check the source to target counts.
3. Source to Target Data Validation: No ETL process
is smart enough to perform source to target field-to-field validation.
This piece of the testing cycle is the most labor intensive and requires
the most thorough analysis of the data. There are a variety of tests
that can be performed during source to target validation. Below is a
list of tests that are best practices:
4. Transformation and Business Rules: Tests to
verify all possible outcomes of the transformation rules, default
values, straight moves and as specified in the Business Specification
document. As a special mention, Boundary conditions must be tested on
the business rules.
5. Batch Sequence & Dependency Testing: ETL’s in
DW are essentially a sequence of processes that execute in a particular
sequence. Dependencies do exist among various processes and the same is
critical to maintain the integrity of the data. Executing the sequences
in a wrong order might result in inaccurate data in the warehouse. The
testing process must include at least 2 iterations of the end–end
execution of the whole batch sequence. Data must be checked for its
integrity during this testing. The most common type of errors caused
because of incorrect sequence is the referential integrity failures,
incorrect end-dating (if applicable) etc, reject
records etc.
6. Job restart Testing: In a real production
environment, the ETL jobs/processes fail because of number of reasons
(say for ex: database related failures, connectivity failures etc). The
jobs can fail half/partly executed. A good design always allows for a
restart ability of the jobs from the failure point. Although this is
more of a design suggestion/approach, it is suggested that every ETL job
is built and tested for restart capability.
7. Error Handling: Understanding a script might fail
during data validation, may confirm the ETL process is working through
process validation. During process validation the testing team will work
to identify additional data cleansing needs, as well as identify
consistent error patterns that could possibly be diverted by modifying
the ETL code. It is the responsibility of the validation team to
identify any and all records that seem suspect. Once a record has been
both data and process validated and the script has passed, the ETL
process is functioning correctly. Conversely, if suspect records have
been identified and documented during data validation those are not
supported through process validation, the ETL process is not functioning
correctly.
8. Views: Views created on the tables should be
tested to ensure the attributes mentioned in the views are correct and
the data loaded in the target table matches what is being reflected in
the views.
9. Sampling: Sampling will involve creating
predictions out of a representative portion of the data that is to be
loaded into the target table; these predictions will be matched with the
actual results obtained from the data loaded for business Analyst
Testing. Comparison will be verified to ensure that the predictions
match the data loaded into the target table.
10. Process Testing: The testing of intermediate
files and processes to ensure the final outcome is valid and that
performance meets the system/business need.
11. Duplicate Testing: Duplicate Testing must be
performed at each stage of the ETL process and in the final target
table. This testing involves checks for duplicates rows and also checks
for multiple rows with same primary key, both of which cannot be
allowed.
12. Performance: It is the most important aspect
after data validation. Performance testing should check if the ETL
process is completing within the load window.
13. Volume: Verify that the system can process the maximum expected quantity of data for a given cycle in the time expected.
14.Connectivity Tests: As the name suggests, this
involves testing the upstream, downstream interfaces and intra DW
connectivity. It is suggested that the testing represents the
exact transactions between these interfaces. For ex: If the design
approach is to extract the files from source system, we should actually
test extracting a file out of the system and not just the
connectivity.
15. Negative Testing: Negative Testing checks
whether the application fails and where it should fail with invalid
inputs and out of boundary scenarios and to check the behavior of
the application.
16. Operational Readiness Testing (ORT): This is the
final phase of testing which focuses on verifying the deployment of
software and the operational readiness of the application. The
main areas of testing in this phase include:
Deployment Test
1. Tests the deployment of the solution
2. Tests overall technical deployment “checklist” and timeframes
3. Tests the security aspects of the system including user authentication and
authorization, and user-access levels.
Conclusion
Evolving needs of the business and changes in the source systems will
drive continuous change in the data warehouse schema and the data being
loaded. Hence, it is necessary that development and testing processes
are clearly defined, followed by impact-analysis and strong alignment
between development, operations and the business.