Building Data Quality Tests with CoginitiScript

Learn how to build a comprehensive data quality testing suite using CoginitiScript's SQL-based testing framework. This tutorial guides you through creating tests that validate data integrity, completeness, and consistency in a realistic e-commerce database.

What You'll Learn:

Write SQL-based data quality tests using CoginitiScript's assertion model
Validate schema, uniqueness, completeness, and data integrity
Organize and run test suites effectively
Debug test failures and interpret results

Prerequisites:

Basic SQL knowledge (SELECT, WHERE, GROUP BY, aggregations)
Familiarity with CoginitiScript basics
Access to a data warehouse with sample data
Understanding of common data quality issues

Understanding CoginitiScript Tests

CoginitiScript uses an assertion-based testing model where:

Test passes: Query returns zero rows
Test fails: Query returns one or more rows (the failing records)

This approach provides immediate diagnostic information by showing exactly which records violate your quality rules.

Basic test structure:

#+test sql TestName()
#+meta {
  :doc "Description of what this test validates"
}
#+begin
  SELECT columns_that_help_debug
  FROM table_name
  WHERE condition_that_should_never_be_true;
#+end

Project Setup

We'll build tests for an e-commerce database with these tables:

customers - Customer information
orders - Order records
order_items - Individual line items
products - Product catalog

Create this directory structure:

data_quality_tests/
├── schema_tests          # Schema validation
├── uniqueness_tests      # Uniqueness and duplicate checks
├── completeness_tests    # Missing values and nulls
├── integrity_tests       # Cross-table relationships
├── volume_tests          # Row counts and volume checks
└── run_all_tests         # Test orchestration

Step 1: Schema Validation Tests

Start by validating that your tables have the expected structure. Schema tests catch breaking changes from upstream systems.

File: schema_tests

#+test sql CustomersTableStructure()
#+meta {
  :doc "Verify customers table has all required columns"
}
#+begin
  WITH expected_columns AS (
    SELECT unnest(ARRAY[
      'customer_id',
      'email',
      'first_name',
      'last_name',
      'created_at',
      'updated_at'
    ]) as col_name
  ),
  actual_columns AS (
    SELECT column_name as col_name
    FROM information_schema.columns
    WHERE table_name = 'customers'
      AND table_schema = 'public'
  ),
  missing AS (
    SELECT col_name FROM expected_columns
    EXCEPT
    SELECT col_name FROM actual_columns
  )
  SELECT
    'Missing column: ' || col_name as issue
  FROM missing;
#+end

#+test sql EmailColumnType()
#+meta {
  :doc "Ensure email column is correct data type"
}
#+begin
  SELECT
    'Email column has wrong type' as issue,
    data_type as actual_type,
    'character varying' as expected_type
  FROM information_schema.columns
  WHERE table_name = 'customers'
    AND column_name = 'email'
    AND data_type != 'character varying';
#+end

#+test sql OrdersColumnCount()
#+meta {
  :doc "Verify orders table has expected number of columns"
}
#+begin
  SELECT
    'Column count mismatch' as issue,
    COUNT(*) as actual_count,
    8 as expected_count
  FROM information_schema.columns
  WHERE table_name = 'orders'
    AND table_schema = 'public'
  HAVING COUNT(*) != 8;
#+end

Run these tests:

#+import "std/test"
#+import "data_quality_tests/schema_tests"

{{ test.Run(packages=[schema_tests], onFailure=test.Stop) }}

What you should see:

If all tests pass: No output (zero rows returned)
If tests fail: Rows showing which columns are missing or incorrect

Step 2: Uniqueness Tests

Uniqueness tests prevent duplicate records and ensure primary keys remain unique.

File: uniqueness_tests

#+test sql NoDuplicateEmails()
#+meta {
  :doc "Customers should have unique email addresses"
}
#+begin
  SELECT
    email,
    COUNT(*) as duplicate_count,
    STRING_AGG(customer_id::TEXT, ', ') as conflicting_ids
  FROM customers
  GROUP BY email
  HAVING COUNT(*) > 1;
#+end

#+test sql UniqueCustomerIds()
#+meta {
  :doc "Customer IDs must be unique"
}
#+begin
  SELECT
    customer_id,
    COUNT(*) as occurrence_count
  FROM customers
  GROUP BY customer_id
  HAVING COUNT(*) > 1;
#+end

#+test sql UniqueOrderNumbers()
#+meta {
  :doc "Order numbers should be globally unique"
}
#+begin
  SELECT
    order_number,
    COUNT(*) as duplicate_count,
    ARRAY_AGG(order_id) as order_ids
  FROM orders
  GROUP BY order_number
  HAVING COUNT(*) > 1;
#+end

#+test sql CompositeKeyUnique()
#+meta {
  :doc "Each product can only appear once per order"
}
#+begin
  SELECT
    order_id,
    product_id,
    COUNT(*) as duplicate_count
  FROM order_items
  GROUP BY order_id, product_id
  HAVING COUNT(*) > 1;
#+end

#+test sql EmailUniquenessRatio()
#+meta {
  :doc "At least 98% of emails should be unique"
}
#+begin
  SELECT
    'Email uniqueness below threshold' as issue,
    COUNT(DISTINCT email) as unique_emails,
    COUNT(*) as total_customers,
    ROUND(100.0 * COUNT(DISTINCT email) / COUNT(*), 2) as uniqueness_pct
  FROM customers
  HAVING CAST(COUNT(DISTINCT email) AS FLOAT) / COUNT(*) < 0.98;
#+end

Run uniqueness tests:

#+import "std/test"
#+import "data_quality_tests/uniqueness_tests"

{{ test.Run(packages=[uniqueness_tests], onFailure=test.Stop) }}

Interpreting results:

Failed tests show the duplicate values and which records are affected
Use the conflicting_ids to investigate why duplicates exist

Step 3: Completeness Tests

Completeness tests ensure required data is present and null values only appear where expected.

File: completeness_tests

#+test sql RequiredCustomerFields()
#+meta {
  :doc "Critical customer fields must not be null"
}
#+begin
  SELECT
    customer_id,
    CASE
      WHEN email IS NULL THEN 'email'
      WHEN first_name IS NULL THEN 'first_name'
      WHEN created_at IS NULL THEN 'created_at'
    END as missing_field
  FROM customers
  WHERE email IS NULL
     OR first_name IS NULL
     OR created_at IS NULL;
#+end

#+test sql RequiredOrderFields()
#+meta {
  :doc "Every order must have required fields populated"
}
#+begin
  SELECT
    order_id,
    customer_id,
    order_date,
    status
  FROM orders
  WHERE customer_id IS NULL
     OR order_date IS NULL
     OR status IS NULL;
#+end

#+test sql PhoneNumberCompleteness()
#+meta {
  :doc "At least 70% of customers should have phone numbers"
}
#+begin
  SELECT
    'Phone number completeness below threshold' as issue,
    COUNT(phone) as customers_with_phone,
    COUNT(*) as total_customers,
    ROUND(100.0 * COUNT(phone) / COUNT(*), 2) as completeness_pct
  FROM customers
  HAVING CAST(COUNT(phone) AS FLOAT) / COUNT(*) < 0.70;
#+end

#+test sql ConditionalCompleteness()
#+meta {
  :doc "Shipped orders must have tracking numbers"
}
#+begin
  SELECT
    order_id,
    status,
    tracking_number
  FROM orders
  WHERE status IN ('shipped', 'delivered')
    AND tracking_number IS NULL;
#+end

#+test sql AtLeastOneContactMethod()
#+meta {
  :doc "Customers must have email or phone"
}
#+begin
  SELECT
    customer_id,
    email,
    phone
  FROM customers
  WHERE email IS NULL
    AND phone IS NULL;
#+end

Run completeness tests:

#+import "std/test"
#+import "data_quality_tests/completeness_tests"

{{ test.Run(packages=[completeness_tests], onFailure=test.Stop) }}

Debugging tip: When tests fail, check if the issue is:

Missing data at the source
Transformation logic dropping values
Unrealistic completeness thresholds

Step 4: Data Integrity Tests

Integrity tests validate relationships between columns and ensure consistency across tables.

File: integrity_tests

#+test sql OrderBelongsToCustomer()
#+meta {
  :doc "Every order must reference a valid customer (no orphans)"
}
#+begin
  SELECT
    o.order_id,
    o.customer_id
  FROM orders o
  LEFT JOIN customers c ON o.customer_id = c.customer_id
  WHERE c.customer_id IS NULL;
#+end

#+test sql OrderItemsHaveValidProducts()
#+meta {
  :doc "All order items must reference existing products"
}
#+begin
  SELECT
    oi.order_item_id,
    oi.product_id
  FROM order_items oi
  LEFT JOIN products p ON oi.product_id = p.product_id
  WHERE p.product_id IS NULL;
#+end

#+test sql OrderTotalMatchesLineItems()
#+meta {
  :doc "Order total should equal sum of line items"
}
#+begin
  SELECT
    o.order_id,
    o.order_total,
    COALESCE(SUM(oi.quantity * oi.unit_price), 0) as calculated_total,
    ABS(o.order_total - COALESCE(SUM(oi.quantity * oi.unit_price), 0)) as difference
  FROM orders o
  LEFT JOIN order_items oi ON o.order_id = oi.order_id
  GROUP BY o.order_id, o.order_total
  HAVING ABS(o.order_total - COALESCE(SUM(oi.quantity * oi.unit_price), 0)) > 0.01;
#+end

#+test sql DateLogicConsistency()
#+meta {
  :doc "Shipped date must be after order date"
}
#+begin
  SELECT
    order_id,
    order_date,
    shipped_date
  FROM orders
  WHERE shipped_date IS NOT NULL
    AND shipped_date < order_date;
#+end

#+test sql PositiveQuantitiesAndPrices()
#+meta {
  :doc "Quantities and prices must be positive"
}
#+begin
  SELECT
    order_item_id,
    quantity,
    unit_price
  FROM order_items
  WHERE quantity <= 0
     OR unit_price < 0;
#+end

Run integrity tests:

#+import "std/test"
#+import "data_quality_tests/integrity_tests"

{{ test.Run(packages=[integrity_tests], onFailure=test.Stop) }}

Common integrity issues:

Orphaned records from failed deletions
Rounding errors in calculated fields
Timezone issues in date comparisons

Step 5: Volume Tests

Volume tests validate that row counts are within expected ranges and detect anomalies.

File: volume_tests

#+test sql DailyOrderVolume()
#+meta {
  :doc "Daily orders should be between 100 and 10000"
}
#+begin
  SELECT
    'Daily order volume out of range' as issue,
    COUNT(*) as order_count,
    CURRENT_DATE as check_date
  FROM orders
  WHERE DATE(order_date) = CURRENT_DATE
  HAVING COUNT(*) NOT BETWEEN 100 AND 10000;
#+end

#+test sql NewCustomerRate()
#+meta {
  :doc "Expect 10-500 new customers daily"
}
#+begin
  SELECT
    'New customer count out of range' as issue,
    COUNT(*) as new_customers
  FROM customers
  WHERE DATE(created_at) = CURRENT_DATE
  HAVING COUNT(*) NOT BETWEEN 10 AND 500;
#+end

#+test sql OrderItemsMatchOrders()
#+meta {
  :doc "Number of orders should match distinct orders in order_items"
}
#+begin
  SELECT
    'Mismatch between orders and order_items' as issue,
    orders_count,
    items_orders_count,
    ABS(orders_count - items_orders_count) as difference
  FROM (
    SELECT COUNT(DISTINCT order_id) as orders_count
    FROM orders
    WHERE DATE(order_date) = CURRENT_DATE
  ) o
  CROSS JOIN (
    SELECT COUNT(DISTINCT order_id) as items_orders_count
    FROM order_items oi
    JOIN orders o ON oi.order_id = o.order_id
    WHERE DATE(o.order_date) = CURRENT_DATE
  ) oi
  WHERE orders_count != items_orders_count;
#+end

#+test sql VolumeGrowthCheck()
#+meta {
  :doc "Daily order volume should not drop more than 50% or spike more than 200%"
}
#+begin
  WITH today AS (
    SELECT COUNT(*) as today_count
    FROM orders
    WHERE DATE(order_date) = CURRENT_DATE
  ),
  yesterday AS (
    SELECT COUNT(*) as yesterday_count
    FROM orders
    WHERE DATE(order_date) = CURRENT_DATE - INTERVAL '1 day'
  )
  SELECT
    'Unexpected volume change' as issue,
    today_count,
    yesterday_count,
    ROUND(100.0 * (today_count - yesterday_count) / NULLIF(yesterday_count, 0), 2) as pct_change
  FROM today
  CROSS JOIN yesterday
  WHERE today_count < yesterday_count * 0.5
     OR today_count > yesterday_count * 2.0;
#+end

#+test sql ProductCategoryCoverage()
#+meta {
  :doc "Each product category should have at least 5 products"
}
#+begin
  SELECT
    category,
    COUNT(*) as product_count
  FROM products
  GROUP BY category
  HAVING COUNT(*) < 5;
#+end

Run volume tests:

#+import "std/test"
#+import "data_quality_tests/volume_tests"

{{ test.Run(packages=[volume_tests], onFailure=test.Stop) }}

Volume test tips:

Adjust thresholds based on business seasonality
Use wider ranges for new systems
Monitor trends over time to refine ranges

Step 6: Advanced Pattern Tests

Add sophisticated tests for complex business rules.

File: advanced_tests

#+test sql EmailFormatValidation()
#+meta {
  :doc "Email addresses must match valid format"
}
#+begin
  SELECT
    customer_id,
    email
  FROM customers
  WHERE email NOT LIKE '%_@__%.__%'
     OR email LIKE '%..%'
     OR email LIKE '%@%@%'
     OR email LIKE '.%@%'
     OR email LIKE '%@.%';
#+end

#+test sql OrderStatusValidation()
#+meta {
  :doc "Order status must be valid enum value"
}
#+begin
  SELECT
    order_id,
    status
  FROM orders
  WHERE status NOT IN ('pending', 'processing', 'shipped', 'delivered', 'cancelled', 'refunded');
#+end

#+test sql SequenceGapDetection()
#+meta {
  :doc "Detect gaps in order numbers (potential missing data)"
}
#+begin
  WITH sequence_check AS (
    SELECT
      order_number,
      LEAD(order_number) OVER (ORDER BY order_number) as next_number
    FROM orders
    WHERE DATE(order_date) = CURRENT_DATE
  )
  SELECT
    order_number,
    next_number,
    (next_number - order_number) as gap_size
  FROM sequence_check
  WHERE next_number - order_number > 1;
#+end

#+test sql SuspiciousActivityDetection()
#+meta {
  :doc "Flag customers with unusually high order rate"
}
#+begin
  SELECT
    customer_id,
    COUNT(*) as order_count,
    MIN(order_date) as first_order,
    MAX(order_date) as last_order
  FROM orders
  WHERE order_date >= CURRENT_DATE - INTERVAL '1 day'
  GROUP BY customer_id
  HAVING COUNT(*) > 50;
#+end

#+test sql PriceReasonabilityCheck()
#+meta {
  :doc "Product prices should be within reasonable ranges"
}
#+begin
  SELECT
    product_id,
    product_name,
    price,
    category
  FROM products
  WHERE price < 0.01
     OR price > 100000
     OR (category = 'digital' AND price > 1000);
#+end

#+test sql StatisticalOutlierDetection()
#+meta {
  :doc "Find order amounts more than 3 standard deviations from mean"
}
#+begin
  WITH stats AS (
    SELECT
      AVG(order_total) as mean,
      STDDEV(order_total) as std
    FROM orders
    WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
  )
  SELECT
    o.order_id,
    o.order_total,
    ROUND(ABS(o.order_total - s.mean) / NULLIF(s.std, 0), 2) as z_score
  FROM orders o
  CROSS JOIN stats s
  WHERE DATE(o.order_date) = CURRENT_DATE
    AND ABS(o.order_total - s.mean) / NULLIF(s.std, 0) > 3;
#+end

Step 7: Orchestrating All Tests

Create a main test runner that executes all test suites.

File: run_all_tests

#+import "std/test"
#+import "data_quality_tests/schema_tests"
#+import "data_quality_tests/uniqueness_tests"
#+import "data_quality_tests/completeness_tests"
#+import "data_quality_tests/integrity_tests"
#+import "data_quality_tests/volume_tests"
#+import "data_quality_tests/advanced_tests"

-- Run schema tests first (foundational)
{{ test.Run(packages=[schema_tests], onFailure=test.Stop) }}

-- Run data quality tests (fail fast on critical issues)
{{ test.Run(
    packages=[
        uniqueness_tests,
        completeness_tests,
        integrity_tests
    ],
    onFailure=test.Stop
) }}

-- Run volume and advanced tests (monitoring)
{{ test.Run(
    packages=[
        volume_tests,
        advanced_tests
    ],
    onFailure=test.Continue
) }}

Run all tests:

#+import "data_quality_tests/run_all_tests"

{{ run_all_tests.Execute() }}

Execution strategy:

Schema tests run first and stop on failure
Critical data quality tests fail fast
Volume/monitoring tests continue on failure (for reporting)

Testing Your Tests

Verify your tests work correctly by intentionally introducing bad data.

Test the uniqueness tests:

-- Insert duplicate email
INSERT INTO customers (customer_id, email, first_name, created_at)
VALUES (99999, 'existing@example.com', 'Test', CURRENT_TIMESTAMP);

-- Run uniqueness tests - should fail
#+import "std/test"
#+import "data_quality_tests/uniqueness_tests"
{{ test.Run(packages=[uniqueness_tests]) }}

-- Clean up
DELETE FROM customers WHERE customer_id = 99999;

Test the completeness tests:

-- Insert customer with missing required field
INSERT INTO customers (customer_id, first_name, created_at)
VALUES (99998, 'Test', CURRENT_TIMESTAMP);
-- Note: email is NULL

-- Run completeness tests - should fail
#+import "std/test"
#+import "data_quality_tests/completeness_tests"
{{ test.Run(packages=[completeness_tests]) }}

-- Clean up
DELETE FROM customers WHERE customer_id = 99998;

Test the integrity tests:

-- Insert order with non-existent customer
INSERT INTO orders (order_id, customer_id, order_date, status, order_total)
VALUES (99997, 99999, CURRENT_DATE, 'pending', 100.00);
-- customer_id 99999 doesn't exist

-- Run integrity tests - should fail
#+import "std/test"
#+import "data_quality_tests/integrity_tests"
{{ test.Run(packages=[integrity_tests]) }}

-- Clean up
DELETE FROM orders WHERE order_id = 99997;

Scheduling Tests

For automated data quality monitoring, schedule your test suites:

Daily full test run:

-- Schedule: Daily at 2 AM
#+import "data_quality_tests/run_all_tests"

{{ run_all_tests.Execute() }}

Hourly critical tests:

-- Schedule: Every hour
#+import "std/test"
#+import "data_quality_tests/uniqueness_tests"
#+import "data_quality_tests/integrity_tests"

{{ test.Run(
    packages=[uniqueness_tests, integrity_tests],
    onFailure=test.Stop
) }}

Real-time validation:

Integrate tests into your data pipelines (see WAP/AWAP tutorial).

Best Practices

Writing Effective Tests

Be specific: Return the problematic rows with diagnostic info

-- Good: Shows which records and why
SELECT customer_id, email, 'duplicate' as issue
FROM customers WHERE ...

-- Bad: Just indicates failure
SELECT 'Test failed' WHERE EXISTS (...)

Use clear test names: Name should explain what's being validated

-- Good
#+test sql NoDuplicateEmails()

-- Bad
#+test sql Test1()

Add helpful documentation:

#+meta {
  :doc "Customers should have unique email addresses. Duplicates may indicate data quality issues in source system or merge logic errors."
}

Include context in results:

SELECT
  order_id,
  customer_id,
  order_total,
  'Order total negative' as issue,
  order_date
FROM orders
WHERE order_total < 0

Organizing Tests

Group by domain: Separate tests by subject area (customers, orders, etc.)
Layer by severity: Critical tests separate from monitoring tests
Version control: Track test changes alongside schema changes
Document thresholds: Explain why specific values are used

Performance Optimization

Index test columns: Ensure WHERE/GROUP BY columns are indexed

Limit scope: Test recent data first for faster feedback

WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'

Use EXPLAIN: Understand query performance
Partition large tables: Use date partitioning for time-series data

Handling Edge Cases

Account for NULLs: Remember NULL != NULL in SQL

-- Handle nulls explicitly
WHERE column IS NULL OR column = ''

Avoid division by zero:

-- Safe percentage calculation
CAST(COUNT(field) AS FLOAT) / NULLIF(COUNT(*), 0)

Use appropriate precision:

-- Allow for rounding errors
WHERE ABS(calculated - expected) > 0.01

Consider timezones:

-- Be explicit about timezone
WHERE DATE(created_at AT TIME ZONE 'UTC') = CURRENT_DATE

Troubleshooting

Tests failing unexpectedly

Check recent schema changes:

-- Review table structure
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'your_table'
ORDER BY ordinal_position;

Verify data volumes:

-- Compare today vs yesterday
SELECT
  DATE(order_date) as date,
  COUNT(*) as order_count
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY DATE(order_date)
ORDER BY date DESC;

Check for data pipeline issues:

-- Find gaps in data loading
SELECT
  DATE(created_at) as date,
  COUNT(*) as records,
  MIN(created_at) as first_record,
  MAX(created_at) as last_record
FROM your_table
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;

Tests passing but data quality is poor

Your tests may have coverage gaps. Add tests for:

Statistical anomalies (outliers, unexpected distributions)
Cross-field validation
Trend analysis
Business rule violations

Performance issues

Optimize slow tests:

Add appropriate indexes
Reduce data range (test recent data)
Use table samples for statistical tests
Consider materialized views for complex aggregations

What You've Learned

You now know how to:

Write SQL-based data quality tests using CoginitiScript
Validate schema, uniqueness, completeness, and data integrity
Organize tests by domain and severity
Run and orchestrate test suites
Debug test failures effectively
Schedule automated data quality monitoring

Next Steps

Integrate tests into your WAP/AWAP pipelines
Create reusable test libraries for common patterns
Build custom tests for your business rules
Set up alerting for test failures
Track data quality metrics over time
Explore the Great Expectations conversion guide for more test patterns

Understanding CoginitiScript Tests​

Project Setup​

Step 1: Schema Validation Tests​

Step 2: Uniqueness Tests​

Step 3: Completeness Tests​

Step 4: Data Integrity Tests​

Step 5: Volume Tests​

Step 6: Advanced Pattern Tests​

Step 7: Orchestrating All Tests​

Testing Your Tests​

Test the uniqueness tests:​

Test the completeness tests:​

Test the integrity tests:​

Scheduling Tests​

Daily full test run:​

Hourly critical tests:​

Real-time validation:​

Best Practices​

Writing Effective Tests​

Organizing Tests​

Performance Optimization​

Handling Edge Cases​

Troubleshooting​

Tests failing unexpectedly​

Tests passing but data quality is poor​

Performance issues​

What You've Learned​

Next Steps​

Additional Resources​