Building Data Quality Tests with CoginitiScript
Learn how to build a comprehensive data quality testing suite using CoginitiScript's SQL-based testing framework. This tutorial guides you through creating tests that validate data integrity, completeness, and consistency in a realistic e-commerce database.
What You'll Learn:
- Write SQL-based data quality tests using CoginitiScript's assertion model
- Validate schema, uniqueness, completeness, and data integrity
- Organize and run test suites effectively
- Debug test failures and interpret results
Prerequisites:
- Basic SQL knowledge (SELECT, WHERE, GROUP BY, aggregations)
- Familiarity with CoginitiScript basics
- Access to a data warehouse with sample data
- Understanding of common data quality issues
Understanding CoginitiScript Tests
CoginitiScript uses an assertion-based testing model where:
- Test passes: Query returns zero rows
- Test fails: Query returns one or more rows (the failing records)
This approach provides immediate diagnostic information by showing exactly which records violate your quality rules.
Basic test structure:
#+test sql TestName()
#+meta {
:doc "Description of what this test validates"
}
#+begin
SELECT columns_that_help_debug
FROM table_name
WHERE condition_that_should_never_be_true;
#+end
Project Setup
We'll build tests for an e-commerce database with these tables:
customers- Customer informationorders- Order recordsorder_items- Individual line itemsproducts- Product catalog
Create this directory structure:
data_quality_tests/
├── schema_tests # Schema validation
├── uniqueness_tests # Uniqueness and duplicate checks
├── completeness_tests # Missing values and nulls
├── integrity_tests # Cross-table relationships
├── volume_tests # Row counts and volume checks
└── run_all_tests # Test orchestration
Step 1: Schema Validation Tests
Start by validating that your tables have the expected structure. Schema tests catch breaking changes from upstream systems.
File: schema_tests
#+test sql CustomersTableStructure()
#+meta {
:doc "Verify customers table has all required columns"
}
#+begin
WITH expected_columns AS (
SELECT unnest(ARRAY[
'customer_id',
'email',
'first_name',
'last_name',
'created_at',
'updated_at'
]) as col_name
),
actual_columns AS (
SELECT column_name as col_name
FROM information_schema.columns
WHERE table_name = 'customers'
AND table_schema = 'public'
),
missing AS (
SELECT col_name FROM expected_columns
EXCEPT
SELECT col_name FROM actual_columns
)
SELECT
'Missing column: ' || col_name as issue
FROM missing;
#+end
#+test sql EmailColumnType()
#+meta {
:doc "Ensure email column is correct data type"
}
#+begin
SELECT
'Email column has wrong type' as issue,
data_type as actual_type,
'character varying' as expected_type
FROM information_schema.columns
WHERE table_name = 'customers'
AND column_name = 'email'
AND data_type != 'character varying';
#+end
#+test sql OrdersColumnCount()
#+meta {
:doc "Verify orders table has expected number of columns"
}
#+begin
SELECT
'Column count mismatch' as issue,
COUNT(*) as actual_count,
8 as expected_count
FROM information_schema.columns
WHERE table_name = 'orders'
AND table_schema = 'public'
HAVING COUNT(*) != 8;
#+end
Run these tests:
#+import "std/test"
#+import "data_quality_tests/schema_tests"
{{ test.Run(packages=[schema_tests], onFailure=test.Stop) }}
What you should see:
- If all tests pass: No output (zero rows returned)
- If tests fail: Rows showing which columns are missing or incorrect
Step 2: Uniqueness Tests
Uniqueness tests prevent duplicate records and ensure primary keys remain unique.
File: uniqueness_tests
#+test sql NoDuplicateEmails()
#+meta {
:doc "Customers should have unique email addresses"
}
#+begin
SELECT
email,
COUNT(*) as duplicate_count,
STRING_AGG(customer_id::TEXT, ', ') as conflicting_ids
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;
#+end
#+test sql UniqueCustomerIds()
#+meta {
:doc "Customer IDs must be unique"
}
#+begin
SELECT
customer_id,
COUNT(*) as occurrence_count
FROM customers
GROUP BY customer_id
HAVING COUNT(*) > 1;
#+end
#+test sql UniqueOrderNumbers()
#+meta {
:doc "Order numbers should be globally unique"
}
#+begin
SELECT
order_number,
COUNT(*) as duplicate_count,
ARRAY_AGG(order_id) as order_ids
FROM orders
GROUP BY order_number
HAVING COUNT(*) > 1;
#+end
#+test sql CompositeKeyUnique()
#+meta {
:doc "Each product can only appear once per order"
}
#+begin
SELECT
order_id,
product_id,
COUNT(*) as duplicate_count
FROM order_items
GROUP BY order_id, product_id
HAVING COUNT(*) > 1;
#+end
#+test sql EmailUniquenessRatio()
#+meta {
:doc "At least 98% of emails should be unique"
}
#+begin
SELECT
'Email uniqueness below threshold' as issue,
COUNT(DISTINCT email) as unique_emails,
COUNT(*) as total_customers,
ROUND(100.0 * COUNT(DISTINCT email) / COUNT(*), 2) as uniqueness_pct
FROM customers
HAVING CAST(COUNT(DISTINCT email) AS FLOAT) / COUNT(*) < 0.98;
#+end
Run uniqueness tests:
#+import "std/test"
#+import "data_quality_tests/uniqueness_tests"
{{ test.Run(packages=[uniqueness_tests], onFailure=test.Stop) }}
Interpreting results:
- Failed tests show the duplicate values and which records are affected
- Use the
conflicting_idsto investigate why duplicates exist
Step 3: Completeness Tests
Completeness tests ensure required data is present and null values only appear where expected.
File: completeness_tests
#+test sql RequiredCustomerFields()
#+meta {
:doc "Critical customer fields must not be null"
}
#+begin
SELECT
customer_id,
CASE
WHEN email IS NULL THEN 'email'
WHEN first_name IS NULL THEN 'first_name'
WHEN created_at IS NULL THEN 'created_at'
END as missing_field
FROM customers
WHERE email IS NULL
OR first_name IS NULL
OR created_at IS NULL;
#+end
#+test sql RequiredOrderFields()
#+meta {
:doc "Every order must have required fields populated"
}
#+begin
SELECT
order_id,
customer_id,
order_date,
status
FROM orders
WHERE customer_id IS NULL
OR order_date IS NULL
OR status IS NULL;
#+end
#+test sql PhoneNumberCompleteness()
#+meta {
:doc "At least 70% of customers should have phone numbers"
}
#+begin
SELECT
'Phone number completeness below threshold' as issue,
COUNT(phone) as customers_with_phone,
COUNT(*) as total_customers,
ROUND(100.0 * COUNT(phone) / COUNT(*), 2) as completeness_pct
FROM customers
HAVING CAST(COUNT(phone) AS FLOAT) / COUNT(*) < 0.70;
#+end
#+test sql ConditionalCompleteness()
#+meta {
:doc "Shipped orders must have tracking numbers"
}
#+begin
SELECT
order_id,
status,
tracking_number
FROM orders
WHERE status IN ('shipped', 'delivered')
AND tracking_number IS NULL;
#+end
#+test sql AtLeastOneContactMethod()
#+meta {
:doc "Customers must have email or phone"
}
#+begin
SELECT
customer_id,
email,
phone
FROM customers
WHERE email IS NULL
AND phone IS NULL;
#+end
Run completeness tests:
#+import "std/test"
#+import "data_quality_tests/completeness_tests"
{{ test.Run(packages=[completeness_tests], onFailure=test.Stop) }}
Debugging tip: When tests fail, check if the issue is:
- Missing data at the source
- Transformation logic dropping values
- Unrealistic completeness thresholds
Step 4: Data Integrity Tests
Integrity tests validate relationships between columns and ensure consistency across tables.
File: integrity_tests
#+test sql OrderBelongsToCustomer()
#+meta {
:doc "Every order must reference a valid customer (no orphans)"
}
#+begin
SELECT
o.order_id,
o.customer_id
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
#+end
#+test sql OrderItemsHaveValidProducts()
#+meta {
:doc "All order items must reference existing products"
}
#+begin
SELECT
oi.order_item_id,
oi.product_id
FROM order_items oi
LEFT JOIN products p ON oi.product_id = p.product_id
WHERE p.product_id IS NULL;
#+end
#+test sql OrderTotalMatchesLineItems()
#+meta {
:doc "Order total should equal sum of line items"
}
#+begin
SELECT
o.order_id,
o.order_total,
COALESCE(SUM(oi.quantity * oi.unit_price), 0) as calculated_total,
ABS(o.order_total - COALESCE(SUM(oi.quantity * oi.unit_price), 0)) as difference
FROM orders o
LEFT JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.order_id, o.order_total
HAVING ABS(o.order_total - COALESCE(SUM(oi.quantity * oi.unit_price), 0)) > 0.01;
#+end
#+test sql DateLogicConsistency()
#+meta {
:doc "Shipped date must be after order date"
}
#+begin
SELECT
order_id,
order_date,
shipped_date
FROM orders
WHERE shipped_date IS NOT NULL
AND shipped_date < order_date;
#+end
#+test sql PositiveQuantitiesAndPrices()
#+meta {
:doc "Quantities and prices must be positive"
}
#+begin
SELECT
order_item_id,
quantity,
unit_price
FROM order_items
WHERE quantity <= 0
OR unit_price < 0;
#+end
Run integrity tests:
#+import "std/test"
#+import "data_quality_tests/integrity_tests"
{{ test.Run(packages=[integrity_tests], onFailure=test.Stop) }}
Common integrity issues:
- Orphaned records from failed deletions
- Rounding errors in calculated fields
- Timezone issues in date comparisons
Step 5: Volume Tests
Volume tests validate that row counts are within expected ranges and detect anomalies.
File: volume_tests
#+test sql DailyOrderVolume()
#+meta {
:doc "Daily orders should be between 100 and 10000"
}
#+begin
SELECT
'Daily order volume out of range' as issue,
COUNT(*) as order_count,
CURRENT_DATE as check_date
FROM orders
WHERE DATE(order_date) = CURRENT_DATE
HAVING COUNT(*) NOT BETWEEN 100 AND 10000;
#+end
#+test sql NewCustomerRate()
#+meta {
:doc "Expect 10-500 new customers daily"
}
#+begin
SELECT
'New customer count out of range' as issue,
COUNT(*) as new_customers
FROM customers
WHERE DATE(created_at) = CURRENT_DATE
HAVING COUNT(*) NOT BETWEEN 10 AND 500;
#+end
#+test sql OrderItemsMatchOrders()
#+meta {
:doc "Number of orders should match distinct orders in order_items"
}
#+begin
SELECT
'Mismatch between orders and order_items' as issue,
orders_count,
items_orders_count,
ABS(orders_count - items_orders_count) as difference
FROM (
SELECT COUNT(DISTINCT order_id) as orders_count
FROM orders
WHERE DATE(order_date) = CURRENT_DATE
) o
CROSS JOIN (
SELECT COUNT(DISTINCT order_id) as items_orders_count
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
WHERE DATE(o.order_date) = CURRENT_DATE
) oi
WHERE orders_count != items_orders_count;
#+end
#+test sql VolumeGrowthCheck()
#+meta {
:doc "Daily order volume should not drop more than 50% or spike more than 200%"
}
#+begin
WITH today AS (
SELECT COUNT(*) as today_count
FROM orders
WHERE DATE(order_date) = CURRENT_DATE
),
yesterday AS (
SELECT COUNT(*) as yesterday_count
FROM orders
WHERE DATE(order_date) = CURRENT_DATE - INTERVAL '1 day'
)
SELECT
'Unexpected volume change' as issue,
today_count,
yesterday_count,
ROUND(100.0 * (today_count - yesterday_count) / NULLIF(yesterday_count, 0), 2) as pct_change
FROM today
CROSS JOIN yesterday
WHERE today_count < yesterday_count * 0.5
OR today_count > yesterday_count * 2.0;
#+end
#+test sql ProductCategoryCoverage()
#+meta {
:doc "Each product category should have at least 5 products"
}
#+begin
SELECT
category,
COUNT(*) as product_count
FROM products
GROUP BY category
HAVING COUNT(*) < 5;
#+end
Run volume tests:
#+import "std/test"
#+import "data_quality_tests/volume_tests"
{{ test.Run(packages=[volume_tests], onFailure=test.Stop) }}
Volume test tips:
- Adjust thresholds based on business seasonality
- Use wider ranges for new systems
- Monitor trends over time to refine ranges
Step 6: Advanced Pattern Tests
Add sophisticated tests for complex business rules.
File: advanced_tests
#+test sql EmailFormatValidation()
#+meta {
:doc "Email addresses must match valid format"
}
#+begin
SELECT
customer_id,
email
FROM customers
WHERE email NOT LIKE '%_@__%.__%'
OR email LIKE '%..%'
OR email LIKE '%@%@%'
OR email LIKE '.%@%'
OR email LIKE '%@.%';
#+end
#+test sql OrderStatusValidation()
#+meta {
:doc "Order status must be valid enum value"
}
#+begin
SELECT
order_id,
status
FROM orders
WHERE status NOT IN ('pending', 'processing', 'shipped', 'delivered', 'cancelled', 'refunded');
#+end
#+test sql SequenceGapDetection()
#+meta {
:doc "Detect gaps in order numbers (potential missing data)"
}
#+begin
WITH sequence_check AS (
SELECT
order_number,
LEAD(order_number) OVER (ORDER BY order_number) as next_number
FROM orders
WHERE DATE(order_date) = CURRENT_DATE
)
SELECT
order_number,
next_number,
(next_number - order_number) as gap_size
FROM sequence_check
WHERE next_number - order_number > 1;
#+end
#+test sql SuspiciousActivityDetection()
#+meta {
:doc "Flag customers with unusually high order rate"
}
#+begin
SELECT
customer_id,
COUNT(*) as order_count,
MIN(order_date) as first_order,
MAX(order_date) as last_order
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '1 day'
GROUP BY customer_id
HAVING COUNT(*) > 50;
#+end
#+test sql PriceReasonabilityCheck()
#+meta {
:doc "Product prices should be within reasonable ranges"
}
#+begin
SELECT
product_id,
product_name,
price,
category
FROM products
WHERE price < 0.01
OR price > 100000
OR (category = 'digital' AND price > 1000);
#+end
#+test sql StatisticalOutlierDetection()
#+meta {
:doc "Find order amounts more than 3 standard deviations from mean"
}
#+begin
WITH stats AS (
SELECT
AVG(order_total) as mean,
STDDEV(order_total) as std
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
)
SELECT
o.order_id,
o.order_total,
ROUND(ABS(o.order_total - s.mean) / NULLIF(s.std, 0), 2) as z_score
FROM orders o
CROSS JOIN stats s
WHERE DATE(o.order_date) = CURRENT_DATE
AND ABS(o.order_total - s.mean) / NULLIF(s.std, 0) > 3;
#+end
Step 7: Orchestrating All Tests
Create a main test runner that executes all test suites.
File: run_all_tests
#+import "std/test"
#+import "data_quality_tests/schema_tests"
#+import "data_quality_tests/uniqueness_tests"
#+import "data_quality_tests/completeness_tests"
#+import "data_quality_tests/integrity_tests"
#+import "data_quality_tests/volume_tests"
#+import "data_quality_tests/advanced_tests"
-- Run schema tests first (foundational)
{{ test.Run(packages=[schema_tests], onFailure=test.Stop) }}
-- Run data quality tests (fail fast on critical issues)
{{ test.Run(
packages=[
uniqueness_tests,
completeness_tests,
integrity_tests
],
onFailure=test.Stop
) }}
-- Run volume and advanced tests (monitoring)
{{ test.Run(
packages=[
volume_tests,
advanced_tests
],
onFailure=test.Continue
) }}
Run all tests:
#+import "data_quality_tests/run_all_tests"
{{ run_all_tests.Execute() }}
Execution strategy:
- Schema tests run first and stop on failure
- Critical data quality tests fail fast
- Volume/monitoring tests continue on failure (for reporting)
Testing Your Tests
Verify your tests work correctly by intentionally introducing bad data.
Test the uniqueness tests:
-- Insert duplicate email
INSERT INTO customers (customer_id, email, first_name, created_at)
VALUES (99999, 'existing@example.com', 'Test', CURRENT_TIMESTAMP);
-- Run uniqueness tests - should fail
#+import "std/test"
#+import "data_quality_tests/uniqueness_tests"
{{ test.Run(packages=[uniqueness_tests]) }}
-- Clean up
DELETE FROM customers WHERE customer_id = 99999;
Test the completeness tests:
-- Insert customer with missing required field
INSERT INTO customers (customer_id, first_name, created_at)
VALUES (99998, 'Test', CURRENT_TIMESTAMP);
-- Note: email is NULL
-- Run completeness tests - should fail
#+import "std/test"
#+import "data_quality_tests/completeness_tests"
{{ test.Run(packages=[completeness_tests]) }}
-- Clean up
DELETE FROM customers WHERE customer_id = 99998;
Test the integrity tests:
-- Insert order with non-existent customer
INSERT INTO orders (order_id, customer_id, order_date, status, order_total)
VALUES (99997, 99999, CURRENT_DATE, 'pending', 100.00);
-- customer_id 99999 doesn't exist
-- Run integrity tests - should fail
#+import "std/test"
#+import "data_quality_tests/integrity_tests"
{{ test.Run(packages=[integrity_tests]) }}
-- Clean up
DELETE FROM orders WHERE order_id = 99997;
Scheduling Tests
For automated data quality monitoring, schedule your test suites:
Daily full test run:
-- Schedule: Daily at 2 AM
#+import "data_quality_tests/run_all_tests"
{{ run_all_tests.Execute() }}
Hourly critical tests:
-- Schedule: Every hour
#+import "std/test"
#+import "data_quality_tests/uniqueness_tests"
#+import "data_quality_tests/integrity_tests"
{{ test.Run(
packages=[uniqueness_tests, integrity_tests],
onFailure=test.Stop
) }}
Real-time validation:
Integrate tests into your data pipelines (see WAP/AWAP tutorial).
Best Practices
Writing Effective Tests
-
Be specific: Return the problematic rows with diagnostic info
-- Good: Shows which records and why
SELECT customer_id, email, 'duplicate' as issue
FROM customers WHERE ...
-- Bad: Just indicates failure
SELECT 'Test failed' WHERE EXISTS (...) -
Use clear test names: Name should explain what's being validated
-- Good
#+test sql NoDuplicateEmails()
-- Bad
#+test sql Test1() -
Add helpful documentation:
#+meta {
:doc "Customers should have unique email addresses. Duplicates may indicate data quality issues in source system or merge logic errors."
} -
Include context in results:
SELECT
order_id,
customer_id,
order_total,
'Order total negative' as issue,
order_date
FROM orders
WHERE order_total < 0
Organizing Tests
- Group by domain: Separate tests by subject area (customers, orders, etc.)
- Layer by severity: Critical tests separate from monitoring tests
- Version control: Track test changes alongside schema changes
- Document thresholds: Explain why specific values are used
Performance Optimization
- Index test columns: Ensure WHERE/GROUP BY columns are indexed
- Limit scope: Test recent data first for faster feedback
WHERE created_at >= CURRENT_DATE - INTERVAL '7 days' - Use EXPLAIN: Understand query performance
- Partition large tables: Use date partitioning for time-series data
Handling Edge Cases
-
Account for NULLs: Remember
NULL != NULLin SQL-- Handle nulls explicitly
WHERE column IS NULL OR column = '' -
Avoid division by zero:
-- Safe percentage calculation
CAST(COUNT(field) AS FLOAT) / NULLIF(COUNT(*), 0) -
Use appropriate precision:
-- Allow for rounding errors
WHERE ABS(calculated - expected) > 0.01 -
Consider timezones:
-- Be explicit about timezone
WHERE DATE(created_at AT TIME ZONE 'UTC') = CURRENT_DATE
Troubleshooting
Tests failing unexpectedly
Check recent schema changes:
-- Review table structure
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'your_table'
ORDER BY ordinal_position;
Verify data volumes:
-- Compare today vs yesterday
SELECT
DATE(order_date) as date,
COUNT(*) as order_count
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY DATE(order_date)
ORDER BY date DESC;
Check for data pipeline issues:
-- Find gaps in data loading
SELECT
DATE(created_at) as date,
COUNT(*) as records,
MIN(created_at) as first_record,
MAX(created_at) as last_record
FROM your_table
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;
Tests passing but data quality is poor
Your tests may have coverage gaps. Add tests for:
- Statistical anomalies (outliers, unexpected distributions)
- Cross-field validation
- Trend analysis
- Business rule violations
Performance issues
Optimize slow tests:
- Add appropriate indexes
- Reduce data range (test recent data)
- Use table samples for statistical tests
- Consider materialized views for complex aggregations
What You've Learned
You now know how to:
- Write SQL-based data quality tests using CoginitiScript
- Validate schema, uniqueness, completeness, and data integrity
- Organize tests by domain and severity
- Run and orchestrate test suites
- Debug test failures effectively
- Schedule automated data quality monitoring
Next Steps
- Integrate tests into your WAP/AWAP pipelines
- Create reusable test libraries for common patterns
- Build custom tests for your business rules
- Set up alerting for test failures
- Track data quality metrics over time
- Explore the Great Expectations conversion guide for more test patterns