Building an AI-Powered Data Pipeline with LLM Blocks

In this tutorial, you'll build a complete data pipeline that uses CoginitiScript's LLM blocks to generate synthetic test data, enrich it with AI-driven analysis, and publish the results to database tables — all within a single CoginitiScript project.

What You'll Build

A Product Review Analytics Pipeline for a fictional online bookstore called PageTurn Books. The pipeline will:

Generate synthetic product and review data using LLM blocks
Use AI to classify review sentiment and extract key themes
Combine AI-generated insights with SQL transformations
Publish the final analytics tables to the database

What You'll Learn

Defining LLM blocks with typed output schemas
Referencing LLM block results in SQL queries
Passing SQL data into LLM prompts with print.Csv()
Chaining LLM blocks using iterator()
Publishing LLM results to database tables
Controlling LLM result caching

Prerequisites

A Coginiti Team or Enterprise account with AI Assistant configured
A PostgreSQL (or Greenplum, Snowflake, BigQuery, Redshift) database connection
Basic familiarity with CoginitiScript (see CoginitiScript Reference)

Project Setup

Create a new project called pageturn in your catalog. The project.toml defines the project name used in import paths:

[general]
name = "pageturn"
description = "PageTurn Books analytics pipeline"
version = "1.0.0"

Create the following package structure inside the project:

pageturn/
  project.toml
  seed/
    seed              -- Source data setup
  enrichment/
    enrichment        -- LLM-powered analysis
    enrichment_test   -- Data quality tests
  analytics/
    analytics         -- SQL transformations
  pipeline/
    pipeline          -- Publication and orchestration

Step 1: Generate Seed Data with LLM Blocks

Start by creating sample book and review data. This demonstrates the most basic LLM block pattern — generating structured data from a prompt.

Create the seed script inside your project:

#+import "std/publication"

#+src llm Books()
#+meta {
  :schema {
    :columns [{:name "book_id", :type "INT", :description "Unique book ID from 1 to 10"},
              {:name "title", :type "STRING", :description "Book title"},
              {:name "author", :type "STRING", :description "Author full name"},
              {:name "genre", :type "STRING", :description "One of: Fiction, Non-Fiction, Science, History, Technology"},
              {:name "price", :type "DECIMAL(6,2)", :description "Price between 9.99 and 49.99"}]
  },
  :publication {:type "table", :name "raw_books"}
}
#+begin
Generate 10 realistic book records for an online bookstore.
Include a mix of genres. Use well-known real book titles and authors.
#+end

#+src llm Reviews()
#+meta {
  :schema {
    :columns [{:name "review_id", :type "INT", :description "Unique review ID from 1 to 30"},
              {:name "book_id", :type "INT", :description "References a book_id between 1 and 10"},
              {:name "reviewer", :type "STRING", :description "Reviewer display name"},
              {:name "rating", :type "INT", :description "Rating from 1 to 5"},
              {:name "review_text", :type "STRING", :description "2-3 sentence review"},
              {:name "review_date", :type "DATE", :description "Date within the last 90 days"}]
  },
  :publication {:type "table", :name "raw_reviews"}
}
#+begin
Generate 30 realistic book reviews. Distribute reviews across book_id values 1-10.
Include a realistic mix of ratings:
- 5% rating 1 (very negative)
- 10% rating 2 (negative)
- 20% rating 3 (neutral)
- 35% rating 4 (positive)
- 30% rating 5 (very positive)
Review text should match the rating sentiment.
#+end

{{
  publication.Run(
    blocks=[Books, Reviews],
    parallelism=1
  )
}}

Try it: Open the seed file in Coginiti. Place your cursor on publication.Run() and execute. The publication framework will publish both LLM block results to database tables. Verify the tables are created by running:

SELECT * FROM raw_books;
SELECT * FROM raw_reviews;

What Just Happened?

Each LLM block sent its prompt (along with the :schema definition) to the AI model
The AI returned CSV-formatted data matching the declared schema
The publication.Run() call published each LLM block's results to the database tables declared in :publication metadata (raw_books and raw_reviews)

Step 2: Enrich Data with AI Analysis

Now use LLM blocks to analyze the review data. This demonstrates passing SQL data into LLM prompts using print.Csv().

Create the enrichment script inside your project:

#+import "std/print"
#+import "pageturn/seed"

#+src sql ReviewsForAnalysis()
#+begin
SELECT r.review_id, r.book_id, b.title, r.rating, r.review_text
FROM {{ seed.Reviews() }}  r
JOIN {{ seed.Books() }}  b ON r.book_id = b.book_id
ORDER BY r.review_id
#+end

#+src llm SentimentAnalysis()
#+meta {
  :schema {
    :columns [{:name "review_id", :type "INT", :description "The review_id from the input data"},
              {:name "sentiment", :type "STRING", :description "POSITIVE, NEUTRAL, or NEGATIVE"},
              {:name "confidence", :type "DECIMAL(3,2)", :description "Confidence score from 0.00 to 1.00"},
              {:name "key_theme", :type "STRING", :description "Primary theme: quality, value, delivery, content, or other"}]
  },
  :cache_results true
}
#+begin
Analyze each book review below and classify its sentiment.
For each review, determine:
- Sentiment: POSITIVE, NEUTRAL, or NEGATIVE (based on the overall tone, not just the rating)
- Confidence: How confident you are in the classification (0.00 to 1.00)
- Key theme: The primary topic the reviewer discusses

Reviews to analyze:
{{ print.Csv(ReviewsForAnalysis()) }}
#+end

Try it: Execute SentimentAnalysis() directly to see the AI's analysis. Then try querying it from SQL:

#+src sql SentimentReport()
#+begin
SELECT
  r.review_id,
  b.title,
  r.rating,
  s.sentiment,
  s.confidence,
  s.key_theme
FROM raw_reviews r
JOIN raw_books b ON r.book_id = b.book_id
JOIN {{ SentimentAnalysis() }} s ON r.review_id = s.review_id
ORDER BY s.confidence DESC
#+end

Key Technique: `print.Csv()`

The print.Csv() function executes the SQL block and formats the result as CSV text directly in the LLM prompt. This is the bridge between your database data and AI analysis — the AI sees the actual data and can reason about it row by row.

Step 3: Chain LLM Blocks for Multi-Step Analysis

Build on the sentiment analysis by chaining LLM blocks. This demonstrates using iterator() to pass data between LLM blocks.

Add the following to the enrichment script:

#+src llm GenreSummary()
#+meta {
  :schema {
    :columns [{:name "genre", :type "STRING"},
              {:name "avg_sentiment", :type "STRING", :description "POSITIVE, NEUTRAL, or NEGATIVE"},
              {:name "top_theme", :type "STRING"},
              {:name "recommendation", :type "STRING", :description "A brief recommendation for the bookstore"}]
  }
}
#+begin
Based on the following individual review analyses, provide a summary per genre.

Review analyses:
#+for row : iterator(SentimentAnalysis()) separator "\n" do
- Review {{ row["review_id"] }}: sentiment={{ row["sentiment"] }}, theme={{ row["key_theme"] }}
#+end

Cross-reference with the book data to group by genre:
{{ print.Csv(seed.Books()) }}

For each genre, determine the overall sentiment trend, the most common theme,
and provide a one-sentence recommendation for the bookstore.
#+end

Try it: Execute GenreSummary() to see AI-generated strategic recommendations per genre.

Key Technique: LLM-to-LLM Data Passing

LLM blocks cannot use {{ llmBlock() }} as a table reference from another LLM block. Instead, use:

print.Csv() — embeds the entire result as CSV text in the prompt (best for bulk data)
iterator() — returns rows as a list you can loop over or index into

The iterator() function is useful when you need row-level access:

Loop over all rows with #+for
Access a specific row by index: iterator(block())[0]
Access columns by name: row["column"] or by position: row[0]

Both print.Csv() and iterator() work for passing data from SQL blocks and LLM blocks into an LLM prompt.

Step 4: Build SQL Analytics on Top of LLM Data

Create the analytics layer that combines LLM insights with SQL transformations.

Create the analytics script inside your project:

#+import "pageturn/enrichment"

#+src sql BookScorecard()
#+begin
SELECT
  b.book_id,
  b.title,
  b.author,
  b.genre,
  COUNT(r.review_id) AS review_count,
  ROUND(AVG(r.rating), 2) AS avg_rating,
  SUM(CASE WHEN s.sentiment = 'POSITIVE' THEN 1 ELSE 0 END) AS positive_count,
  SUM(CASE WHEN s.sentiment = 'NEGATIVE' THEN 1 ELSE 0 END) AS negative_count,
  ROUND(
    SUM(CASE WHEN s.sentiment = 'POSITIVE' THEN 1.0 ELSE 0.0 END) / COUNT(*) * 100, 1
  ) AS positive_pct
FROM raw_books b
JOIN raw_reviews r ON b.book_id = r.book_id
JOIN {{ enrichment.SentimentAnalysis() }} s ON r.review_id = s.review_id
GROUP BY b.book_id, b.title, b.author, b.genre
ORDER BY positive_pct DESC
#+end

#+src sql ThemeDistribution()
#+begin
SELECT
  s.key_theme,
  COUNT(*) AS review_count,
  ROUND(AVG(s.confidence), 2) AS avg_confidence,
  SUM(CASE WHEN s.sentiment = 'POSITIVE' THEN 1 ELSE 0 END) AS positive,
  SUM(CASE WHEN s.sentiment = 'NEGATIVE' THEN 1 ELSE 0 END) AS negative
FROM {{ enrichment.SentimentAnalysis() }} s
GROUP BY s.key_theme
ORDER BY review_count DESC
#+end

Try it: Execute BookScorecard() to see each book ranked by AI-determined sentiment, or ThemeDistribution() to understand what reviewers are talking about.

Key Technique: Caching

Notice that SentimentAnalysis() is referenced by both BookScorecard() and ThemeDistribution(). Because we explicitly enabled caching with :cache_results true in the SentimentAnalysis() metadata (see Step 2), the LLM is called only once — both SQL queries use the same AI-generated result. Without caching, each reference would trigger a separate LLM call, which wastes time and can produce inconsistent results.

Step 5: Publish Results with LLM Publication

Set up automated publication so the pipeline writes results to database tables.

Create the pipeline script inside your project:

#+import "std/publication"
#+import "pageturn/seed"
#+import "pageturn/enrichment"
#+import "pageturn/analytics"

#+src llm DailyInsights()
#+meta {
  :schema {
    :columns [{:name "insight_date", :type "DATE", :description "Today's date"},
              {:name "category", :type "STRING", :description "sales, engagement, or quality"},
              {:name "insight", :type "STRING", :description "Key finding in one sentence"},
              {:name "action_item", :type "STRING", :description "Recommended next step"}]
  },
  :publication {
    :type "table",
    :name "daily_book_insights",
    :incremental "append"
  }
}
#+begin
Based on the current book catalog and review data, generate 3-5 actionable insights
for today's bookstore operations.

Book catalog:
#+for book : iterator(seed.Books()) separator "\n" do
- {{ book["title"] }} by {{ book["author"] }} ({{ book["genre"] }}, ${{ book["price"] }})
#+end

Genre-level analysis:
#+for genre : iterator(enrichment.GenreSummary()) separator "\n" do
- {{ genre["genre"] }}: {{ genre["avg_sentiment"] }} sentiment, top theme: {{ genre["top_theme"] }}
#+end
#+end

{{
  publication.Run(
    blocks=[DailyInsights],
    parallelism=1
  )
}}

Try it: Execute publication.Run(). Then verify the published table:

SELECT * FROM daily_book_insights ORDER BY insight_date DESC;

Run the pipeline again on another day — the "append" incremental strategy adds new rows without removing previous ones, building a history of AI-generated insights over time.

tip

Use Coginiti Actions to schedule this pipeline to run automatically on a recurring basis.

Step 6: Add Data Quality Tests

Validate AI-generated data with CoginitiScript tests. We recommend keeping tests within the same package as the code they test.

Create the enrichment_test script inside your project (alongside the enrichment script):

#+test sql TestSentimentValues()
#+begin
-- Verify sentiment is one of the expected values
SELECT review_id, sentiment
FROM {{ SentimentAnalysis() }}
WHERE sentiment NOT IN ('POSITIVE', 'NEUTRAL', 'NEGATIVE')
#+end

#+test sql TestConfidenceRange()
#+begin
-- Verify confidence is between 0 and 1
SELECT review_id, confidence
FROM {{ SentimentAnalysis() }}
WHERE confidence < 0.00 OR confidence > 1.00
#+end

#+test sql TestAllReviewsCovered()
#+begin
-- Verify every review got a sentiment analysis
SELECT r.review_id
FROM raw_reviews r
LEFT JOIN {{ SentimentAnalysis() }} s ON r.review_id = s.review_id
WHERE s.review_id IS NULL
#+end

Try it: Run all tests to verify the AI is generating valid data.

tip

Tests pass when they return an empty result set. Any rows returned indicate a failure — the rows show you which records violate the test condition.

Review: What You've Learned

Concept	Where Used
Basic LLM block with schema	Step 1: `Books()`, `Reviews()`
Publishing LLM results to tables	Step 1: `Books()` and `Reviews()` with `:publication` metadata
`print.Csv()` for data→LLM	Step 2: `SentimentAnalysis()`
`iterator()` + `print.Csv()` for LLM chaining	Step 3: `GenreSummary()`
Result caching (`:cache_results true`)	Step 2: `SentimentAnalysis()`, Step 4: shared across queries
LLM block arguments	Step 2: parameterized prompts
Table publication	Step 5: `DailyInsights()`
Incremental publication	Step 5: append strategy
Testing LLM output	Step 6: validation tests

Tips for Production Use

Be specific in prompts — Vague prompts produce inconsistent results. Include concrete constraints, value ranges, and format requirements.
Use schema descriptions — The :description field in each schema column helps the AI understand what values to produce.
Validate with tests — AI output can be unpredictable. Always write tests to catch unexpected values.
Cache when possible — Set :cache_results true when multiple SQL blocks reference the same LLM block to avoid redundant AI calls.
Start small — Generate small datasets first to verify the output matches your schema before scaling up.

Next Steps

CoginitiScript Reference: LLM Blocks — Full syntax reference
Working with LLM Blocks How-To Guide — More usage patterns
Building an E-commerce Analytics Pipeline — See how traditional SQL pipelines work, then enhance them with LLM blocks

What You'll Build​

What You'll Learn​

Prerequisites​

Project Setup​

Step 1: Generate Seed Data with LLM Blocks​

What Just Happened?​

Step 2: Enrich Data with AI Analysis​

Key Technique: print.Csv()​

Step 3: Chain LLM Blocks for Multi-Step Analysis​

Key Technique: LLM-to-LLM Data Passing​

Step 4: Build SQL Analytics on Top of LLM Data​

Key Technique: Caching​

Step 5: Publish Results with LLM Publication​

Step 6: Add Data Quality Tests​

Review: What You've Learned​

Tips for Production Use​

Next Steps​