Skip to main content

Building Data Quality Tests with CoginitiScript

Learn how to build a comprehensive data quality testing suite using CoginitiScript's SQL-based testing framework. This tutorial guides you through creating tests that validate data integrity, completeness, and consistency in a realistic e-commerce database.

What You'll Learn:

  • Write SQL-based data quality tests using CoginitiScript's assertion model
  • Validate schema, uniqueness, completeness, and data integrity
  • Organize and run test suites effectively
  • Debug test failures and interpret results

Prerequisites:

  • Basic SQL knowledge (SELECT, WHERE, GROUP BY, aggregations)
  • Familiarity with CoginitiScript basics
  • Access to a data warehouse with sample data
  • Understanding of common data quality issues

Understanding CoginitiScript Tests

CoginitiScript uses an assertion-based testing model where:

  • Test passes: Query returns zero rows
  • Test fails: Query returns one or more rows (the failing records)

This approach provides immediate diagnostic information by showing exactly which records violate your quality rules.

Basic test structure:

#+test sql TestName()
#+meta {
:doc "Description of what this test validates"
}
#+begin
SELECT columns_that_help_debug
FROM table_name
WHERE condition_that_should_never_be_true;
#+end

Project Setup

We'll build tests for an e-commerce database with these tables:

  • customers - Customer information
  • orders - Order records
  • order_items - Individual line items
  • products - Product catalog

Create this directory structure:

data_quality_tests/
├── schema_tests # Schema validation
├── uniqueness_tests # Uniqueness and duplicate checks
├── completeness_tests # Missing values and nulls
├── integrity_tests # Cross-table relationships
├── volume_tests # Row counts and volume checks
└── run_all_tests # Test orchestration

Step 1: Schema Validation Tests

Start by validating that your tables have the expected structure. Schema tests catch breaking changes from upstream systems.

File: schema_tests

#+test sql CustomersTableStructure()
#+meta {
:doc "Verify customers table has all required columns"
}
#+begin
WITH expected_columns AS (
SELECT unnest(ARRAY[
'customer_id',
'email',
'first_name',
'last_name',
'created_at',
'updated_at'
]) as col_name
),
actual_columns AS (
SELECT column_name as col_name
FROM information_schema.columns
WHERE table_name = 'customers'
AND table_schema = 'public'
),
missing AS (
SELECT col_name FROM expected_columns
EXCEPT
SELECT col_name FROM actual_columns
)
SELECT
'Missing column: ' || col_name as issue
FROM missing;
#+end

#+test sql EmailColumnType()
#+meta {
:doc "Ensure email column is correct data type"
}
#+begin
SELECT
'Email column has wrong type' as issue,
data_type as actual_type,
'character varying' as expected_type
FROM information_schema.columns
WHERE table_name = 'customers'
AND column_name = 'email'
AND data_type != 'character varying';
#+end

#+test sql OrdersColumnCount()
#+meta {
:doc "Verify orders table has expected number of columns"
}
#+begin
SELECT
'Column count mismatch' as issue,
COUNT(*) as actual_count,
8 as expected_count
FROM information_schema.columns
WHERE table_name = 'orders'
AND table_schema = 'public'
HAVING COUNT(*) != 8;
#+end

Run these tests:

#+import "std/test"
#+import "data_quality_tests/schema_tests"

{{ test.Run(packages=[schema_tests], onFailure=test.Stop) }}

What you should see:

  • If all tests pass: No output (zero rows returned)
  • If tests fail: Rows showing which columns are missing or incorrect

Step 2: Uniqueness Tests

Uniqueness tests prevent duplicate records and ensure primary keys remain unique.

File: uniqueness_tests

#+test sql NoDuplicateEmails()
#+meta {
:doc "Customers should have unique email addresses"
}
#+begin
SELECT
email,
COUNT(*) as duplicate_count,
STRING_AGG(customer_id::TEXT, ', ') as conflicting_ids
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;
#+end

#+test sql UniqueCustomerIds()
#+meta {
:doc "Customer IDs must be unique"
}
#+begin
SELECT
customer_id,
COUNT(*) as occurrence_count
FROM customers
GROUP BY customer_id
HAVING COUNT(*) > 1;
#+end

#+test sql UniqueOrderNumbers()
#+meta {
:doc "Order numbers should be globally unique"
}
#+begin
SELECT
order_number,
COUNT(*) as duplicate_count,
ARRAY_AGG(order_id) as order_ids
FROM orders
GROUP BY order_number
HAVING COUNT(*) > 1;
#+end

#+test sql CompositeKeyUnique()
#+meta {
:doc "Each product can only appear once per order"
}
#+begin
SELECT
order_id,
product_id,
COUNT(*) as duplicate_count
FROM order_items
GROUP BY order_id, product_id
HAVING COUNT(*) > 1;
#+end

#+test sql EmailUniquenessRatio()
#+meta {
:doc "At least 98% of emails should be unique"
}
#+begin
SELECT
'Email uniqueness below threshold' as issue,
COUNT(DISTINCT email) as unique_emails,
COUNT(*) as total_customers,
ROUND(100.0 * COUNT(DISTINCT email) / COUNT(*), 2) as uniqueness_pct
FROM customers
HAVING CAST(COUNT(DISTINCT email) AS FLOAT) / COUNT(*) < 0.98;
#+end

Run uniqueness tests:

#+import "std/test"
#+import "data_quality_tests/uniqueness_tests"

{{ test.Run(packages=[uniqueness_tests], onFailure=test.Stop) }}

Interpreting results:

  • Failed tests show the duplicate values and which records are affected
  • Use the conflicting_ids to investigate why duplicates exist

Step 3: Completeness Tests

Completeness tests ensure required data is present and null values only appear where expected.

File: completeness_tests

#+test sql RequiredCustomerFields()
#+meta {
:doc "Critical customer fields must not be null"
}
#+begin
SELECT
customer_id,
CASE
WHEN email IS NULL THEN 'email'
WHEN first_name IS NULL THEN 'first_name'
WHEN created_at IS NULL THEN 'created_at'
END as missing_field
FROM customers
WHERE email IS NULL
OR first_name IS NULL
OR created_at IS NULL;
#+end

#+test sql RequiredOrderFields()
#+meta {
:doc "Every order must have required fields populated"
}
#+begin
SELECT
order_id,
customer_id,
order_date,
status
FROM orders
WHERE customer_id IS NULL
OR order_date IS NULL
OR status IS NULL;
#+end

#+test sql PhoneNumberCompleteness()
#+meta {
:doc "At least 70% of customers should have phone numbers"
}
#+begin
SELECT
'Phone number completeness below threshold' as issue,
COUNT(phone) as customers_with_phone,
COUNT(*) as total_customers,
ROUND(100.0 * COUNT(phone) / COUNT(*), 2) as completeness_pct
FROM customers
HAVING CAST(COUNT(phone) AS FLOAT) / COUNT(*) < 0.70;
#+end

#+test sql ConditionalCompleteness()
#+meta {
:doc "Shipped orders must have tracking numbers"
}
#+begin
SELECT
order_id,
status,
tracking_number
FROM orders
WHERE status IN ('shipped', 'delivered')
AND tracking_number IS NULL;
#+end

#+test sql AtLeastOneContactMethod()
#+meta {
:doc "Customers must have email or phone"
}
#+begin
SELECT
customer_id,
email,
phone
FROM customers
WHERE email IS NULL
AND phone IS NULL;
#+end

Run completeness tests:

#+import "std/test"
#+import "data_quality_tests/completeness_tests"

{{ test.Run(packages=[completeness_tests], onFailure=test.Stop) }}

Debugging tip: When tests fail, check if the issue is:

  1. Missing data at the source
  2. Transformation logic dropping values
  3. Unrealistic completeness thresholds

Step 4: Data Integrity Tests

Integrity tests validate relationships between columns and ensure consistency across tables.

File: integrity_tests

#+test sql OrderBelongsToCustomer()
#+meta {
:doc "Every order must reference a valid customer (no orphans)"
}
#+begin
SELECT
o.order_id,
o.customer_id
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
#+end

#+test sql OrderItemsHaveValidProducts()
#+meta {
:doc "All order items must reference existing products"
}
#+begin
SELECT
oi.order_item_id,
oi.product_id
FROM order_items oi
LEFT JOIN products p ON oi.product_id = p.product_id
WHERE p.product_id IS NULL;
#+end

#+test sql OrderTotalMatchesLineItems()
#+meta {
:doc "Order total should equal sum of line items"
}
#+begin
SELECT
o.order_id,
o.order_total,
COALESCE(SUM(oi.quantity * oi.unit_price), 0) as calculated_total,
ABS(o.order_total - COALESCE(SUM(oi.quantity * oi.unit_price), 0)) as difference
FROM orders o
LEFT JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.order_id, o.order_total
HAVING ABS(o.order_total - COALESCE(SUM(oi.quantity * oi.unit_price), 0)) > 0.01;
#+end

#+test sql DateLogicConsistency()
#+meta {
:doc "Shipped date must be after order date"
}
#+begin
SELECT
order_id,
order_date,
shipped_date
FROM orders
WHERE shipped_date IS NOT NULL
AND shipped_date < order_date;
#+end

#+test sql PositiveQuantitiesAndPrices()
#+meta {
:doc "Quantities and prices must be positive"
}
#+begin
SELECT
order_item_id,
quantity,
unit_price
FROM order_items
WHERE quantity <= 0
OR unit_price < 0;
#+end

Run integrity tests:

#+import "std/test"
#+import "data_quality_tests/integrity_tests"

{{ test.Run(packages=[integrity_tests], onFailure=test.Stop) }}

Common integrity issues:

  • Orphaned records from failed deletions
  • Rounding errors in calculated fields
  • Timezone issues in date comparisons

Step 5: Volume Tests

Volume tests validate that row counts are within expected ranges and detect anomalies.

File: volume_tests

#+test sql DailyOrderVolume()
#+meta {
:doc "Daily orders should be between 100 and 10000"
}
#+begin
SELECT
'Daily order volume out of range' as issue,
COUNT(*) as order_count,
CURRENT_DATE as check_date
FROM orders
WHERE DATE(order_date) = CURRENT_DATE
HAVING COUNT(*) NOT BETWEEN 100 AND 10000;
#+end

#+test sql NewCustomerRate()
#+meta {
:doc "Expect 10-500 new customers daily"
}
#+begin
SELECT
'New customer count out of range' as issue,
COUNT(*) as new_customers
FROM customers
WHERE DATE(created_at) = CURRENT_DATE
HAVING COUNT(*) NOT BETWEEN 10 AND 500;
#+end

#+test sql OrderItemsMatchOrders()
#+meta {
:doc "Number of orders should match distinct orders in order_items"
}
#+begin
SELECT
'Mismatch between orders and order_items' as issue,
orders_count,
items_orders_count,
ABS(orders_count - items_orders_count) as difference
FROM (
SELECT COUNT(DISTINCT order_id) as orders_count
FROM orders
WHERE DATE(order_date) = CURRENT_DATE
) o
CROSS JOIN (
SELECT COUNT(DISTINCT order_id) as items_orders_count
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
WHERE DATE(o.order_date) = CURRENT_DATE
) oi
WHERE orders_count != items_orders_count;
#+end

#+test sql VolumeGrowthCheck()
#+meta {
:doc "Daily order volume should not drop more than 50% or spike more than 200%"
}
#+begin
WITH today AS (
SELECT COUNT(*) as today_count
FROM orders
WHERE DATE(order_date) = CURRENT_DATE
),
yesterday AS (
SELECT COUNT(*) as yesterday_count
FROM orders
WHERE DATE(order_date) = CURRENT_DATE - INTERVAL '1 day'
)
SELECT
'Unexpected volume change' as issue,
today_count,
yesterday_count,
ROUND(100.0 * (today_count - yesterday_count) / NULLIF(yesterday_count, 0), 2) as pct_change
FROM today
CROSS JOIN yesterday
WHERE today_count < yesterday_count * 0.5
OR today_count > yesterday_count * 2.0;
#+end

#+test sql ProductCategoryCoverage()
#+meta {
:doc "Each product category should have at least 5 products"
}
#+begin
SELECT
category,
COUNT(*) as product_count
FROM products
GROUP BY category
HAVING COUNT(*) < 5;
#+end

Run volume tests:

#+import "std/test"
#+import "data_quality_tests/volume_tests"

{{ test.Run(packages=[volume_tests], onFailure=test.Stop) }}

Volume test tips:

  • Adjust thresholds based on business seasonality
  • Use wider ranges for new systems
  • Monitor trends over time to refine ranges

Step 6: Advanced Pattern Tests

Add sophisticated tests for complex business rules.

File: advanced_tests

#+test sql EmailFormatValidation()
#+meta {
:doc "Email addresses must match valid format"
}
#+begin
SELECT
customer_id,
email
FROM customers
WHERE email NOT LIKE '%_@__%.__%'
OR email LIKE '%..%'
OR email LIKE '%@%@%'
OR email LIKE '.%@%'
OR email LIKE '%@.%';
#+end

#+test sql OrderStatusValidation()
#+meta {
:doc "Order status must be valid enum value"
}
#+begin
SELECT
order_id,
status
FROM orders
WHERE status NOT IN ('pending', 'processing', 'shipped', 'delivered', 'cancelled', 'refunded');
#+end

#+test sql SequenceGapDetection()
#+meta {
:doc "Detect gaps in order numbers (potential missing data)"
}
#+begin
WITH sequence_check AS (
SELECT
order_number,
LEAD(order_number) OVER (ORDER BY order_number) as next_number
FROM orders
WHERE DATE(order_date) = CURRENT_DATE
)
SELECT
order_number,
next_number,
(next_number - order_number) as gap_size
FROM sequence_check
WHERE next_number - order_number > 1;
#+end

#+test sql SuspiciousActivityDetection()
#+meta {
:doc "Flag customers with unusually high order rate"
}
#+begin
SELECT
customer_id,
COUNT(*) as order_count,
MIN(order_date) as first_order,
MAX(order_date) as last_order
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '1 day'
GROUP BY customer_id
HAVING COUNT(*) > 50;
#+end

#+test sql PriceReasonabilityCheck()
#+meta {
:doc "Product prices should be within reasonable ranges"
}
#+begin
SELECT
product_id,
product_name,
price,
category
FROM products
WHERE price < 0.01
OR price > 100000
OR (category = 'digital' AND price > 1000);
#+end

#+test sql StatisticalOutlierDetection()
#+meta {
:doc "Find order amounts more than 3 standard deviations from mean"
}
#+begin
WITH stats AS (
SELECT
AVG(order_total) as mean,
STDDEV(order_total) as std
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
)
SELECT
o.order_id,
o.order_total,
ROUND(ABS(o.order_total - s.mean) / NULLIF(s.std, 0), 2) as z_score
FROM orders o
CROSS JOIN stats s
WHERE DATE(o.order_date) = CURRENT_DATE
AND ABS(o.order_total - s.mean) / NULLIF(s.std, 0) > 3;
#+end

Step 7: Orchestrating All Tests

Create a main test runner that executes all test suites.

File: run_all_tests

#+import "std/test"
#+import "data_quality_tests/schema_tests"
#+import "data_quality_tests/uniqueness_tests"
#+import "data_quality_tests/completeness_tests"
#+import "data_quality_tests/integrity_tests"
#+import "data_quality_tests/volume_tests"
#+import "data_quality_tests/advanced_tests"

-- Run schema tests first (foundational)
{{ test.Run(packages=[schema_tests], onFailure=test.Stop) }}

-- Run data quality tests (fail fast on critical issues)
{{ test.Run(
packages=[
uniqueness_tests,
completeness_tests,
integrity_tests
],
onFailure=test.Stop
) }}

-- Run volume and advanced tests (monitoring)
{{ test.Run(
packages=[
volume_tests,
advanced_tests
],
onFailure=test.Continue
) }}

Run all tests:

#+import "data_quality_tests/run_all_tests"

{{ run_all_tests.Execute() }}

Execution strategy:

  • Schema tests run first and stop on failure
  • Critical data quality tests fail fast
  • Volume/monitoring tests continue on failure (for reporting)

Testing Your Tests

Verify your tests work correctly by intentionally introducing bad data.

Test the uniqueness tests:

-- Insert duplicate email
INSERT INTO customers (customer_id, email, first_name, created_at)
VALUES (99999, 'existing@example.com', 'Test', CURRENT_TIMESTAMP);

-- Run uniqueness tests - should fail
#+import "std/test"
#+import "data_quality_tests/uniqueness_tests"
{{ test.Run(packages=[uniqueness_tests]) }}

-- Clean up
DELETE FROM customers WHERE customer_id = 99999;

Test the completeness tests:

-- Insert customer with missing required field
INSERT INTO customers (customer_id, first_name, created_at)
VALUES (99998, 'Test', CURRENT_TIMESTAMP);
-- Note: email is NULL

-- Run completeness tests - should fail
#+import "std/test"
#+import "data_quality_tests/completeness_tests"
{{ test.Run(packages=[completeness_tests]) }}

-- Clean up
DELETE FROM customers WHERE customer_id = 99998;

Test the integrity tests:

-- Insert order with non-existent customer
INSERT INTO orders (order_id, customer_id, order_date, status, order_total)
VALUES (99997, 99999, CURRENT_DATE, 'pending', 100.00);
-- customer_id 99999 doesn't exist

-- Run integrity tests - should fail
#+import "std/test"
#+import "data_quality_tests/integrity_tests"
{{ test.Run(packages=[integrity_tests]) }}

-- Clean up
DELETE FROM orders WHERE order_id = 99997;

Scheduling Tests

For automated data quality monitoring, schedule your test suites:

Daily full test run:

-- Schedule: Daily at 2 AM
#+import "data_quality_tests/run_all_tests"

{{ run_all_tests.Execute() }}

Hourly critical tests:

-- Schedule: Every hour
#+import "std/test"
#+import "data_quality_tests/uniqueness_tests"
#+import "data_quality_tests/integrity_tests"

{{ test.Run(
packages=[uniqueness_tests, integrity_tests],
onFailure=test.Stop
) }}

Real-time validation:

Integrate tests into your data pipelines (see WAP/AWAP tutorial).

Best Practices

Writing Effective Tests

  1. Be specific: Return the problematic rows with diagnostic info

    -- Good: Shows which records and why
    SELECT customer_id, email, 'duplicate' as issue
    FROM customers WHERE ...

    -- Bad: Just indicates failure
    SELECT 'Test failed' WHERE EXISTS (...)
  2. Use clear test names: Name should explain what's being validated

    -- Good
    #+test sql NoDuplicateEmails()

    -- Bad
    #+test sql Test1()
  3. Add helpful documentation:

    #+meta {
    :doc "Customers should have unique email addresses. Duplicates may indicate data quality issues in source system or merge logic errors."
    }
  4. Include context in results:

    SELECT
    order_id,
    customer_id,
    order_total,
    'Order total negative' as issue,
    order_date
    FROM orders
    WHERE order_total < 0

Organizing Tests

  1. Group by domain: Separate tests by subject area (customers, orders, etc.)
  2. Layer by severity: Critical tests separate from monitoring tests
  3. Version control: Track test changes alongside schema changes
  4. Document thresholds: Explain why specific values are used

Performance Optimization

  1. Index test columns: Ensure WHERE/GROUP BY columns are indexed
  2. Limit scope: Test recent data first for faster feedback
    WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'
  3. Use EXPLAIN: Understand query performance
  4. Partition large tables: Use date partitioning for time-series data

Handling Edge Cases

  1. Account for NULLs: Remember NULL != NULL in SQL

    -- Handle nulls explicitly
    WHERE column IS NULL OR column = ''
  2. Avoid division by zero:

    -- Safe percentage calculation
    CAST(COUNT(field) AS FLOAT) / NULLIF(COUNT(*), 0)
  3. Use appropriate precision:

    -- Allow for rounding errors
    WHERE ABS(calculated - expected) > 0.01
  4. Consider timezones:

    -- Be explicit about timezone
    WHERE DATE(created_at AT TIME ZONE 'UTC') = CURRENT_DATE

Troubleshooting

Tests failing unexpectedly

Check recent schema changes:

-- Review table structure
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'your_table'
ORDER BY ordinal_position;

Verify data volumes:

-- Compare today vs yesterday
SELECT
DATE(order_date) as date,
COUNT(*) as order_count
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY DATE(order_date)
ORDER BY date DESC;

Check for data pipeline issues:

-- Find gaps in data loading
SELECT
DATE(created_at) as date,
COUNT(*) as records,
MIN(created_at) as first_record,
MAX(created_at) as last_record
FROM your_table
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;

Tests passing but data quality is poor

Your tests may have coverage gaps. Add tests for:

  • Statistical anomalies (outliers, unexpected distributions)
  • Cross-field validation
  • Trend analysis
  • Business rule violations

Performance issues

Optimize slow tests:

  1. Add appropriate indexes
  2. Reduce data range (test recent data)
  3. Use table samples for statistical tests
  4. Consider materialized views for complex aggregations

What You've Learned

You now know how to:

  • Write SQL-based data quality tests using CoginitiScript
  • Validate schema, uniqueness, completeness, and data integrity
  • Organize tests by domain and severity
  • Run and orchestrate test suites
  • Debug test failures effectively
  • Schedule automated data quality monitoring

Next Steps

  • Integrate tests into your WAP/AWAP pipelines
  • Create reusable test libraries for common patterns
  • Build custom tests for your business rules
  • Set up alerting for test failures
  • Track data quality metrics over time
  • Explore the Great Expectations conversion guide for more test patterns

Additional Resources