Building a 24/7 AI Product Factory: Our Behind-the-Scenes Story

We've built a system that generates, tests, and ships AI-powered products around the clock. Here's the honest, unfiltered story of how it works — and what nearly broke us.

WOWHOW isn't just a storefront. Behind the product pages and prompt packs is a fully automated AI product factory — a pipeline we call "the Forge" that generates, tests, refines, and ships digital products 24 hours a day, 7 days a week.

This is the story of how we built it, what we learned, and what we'd do differently.

Why We Built the Forge

When we started WOWHOW, we were creating prompt packs manually. One person would write prompts, another would test them across models, a third would write documentation, and someone else would build the product page.

A single prompt pack took 40-60 hours to go from idea to published product.

We knew this wouldn't scale. The demand for quality prompt packs was growing faster than our team could produce them. We needed a system that could:

Generate prompt candidates automatically
Test them across multiple AI models
Score quality objectively
Generate documentation and marketing copy
Build product pages
Handle the entire pipeline with minimal human intervention

How the Forge Works

Stage 1: Idea Generation

The pipeline starts with market research. We monitor:

Search trends for AI and prompt-related queries
Social media discussions about AI pain points
Customer support tickets and feature requests
Competitor product launches

An AI system analyzes these signals and generates product briefs — descriptions of prompt packs that would address real market demand.

Stage 2: Prompt Generation

For each product brief, the system generates candidate prompts using a multi-model approach:

Claude generates initial prompt candidates
GPT generates alternative versions
A "remix" agent combines the best elements
Each candidate goes through 3 rounds of self-refinement

Stage 3: Quality Testing

This is the most critical stage. Every prompt is tested:

Multi-model testing — run on Claude, GPT, and Gemini
Consistency testing — run 5 times on each model to check variance
Quality scoring — automated scoring on relevance, completeness, clarity, and usefulness
Edge case testing — deliberately difficult inputs to stress-test prompts

Prompts must score 8/10 or higher across all models to pass. About 60% of generated prompts fail this gate.

Stage 4: Documentation

Passed prompts get automated documentation:

Usage instructions
Customization tips
Example outputs from each model
Known limitations
Suggested modifications for specific use cases

Stage 5: Product Assembly

The system packages everything into a product:

Product page content (title, description, features, FAQ)
Cover image (generated with AI, reviewed by humans)
Pricing recommendation (based on market analysis)
SEO metadata
Downloadable prompt pack file

Stage 6: Human Review

This is the one stage that always requires a human. Before any product goes live:

A team member reviews the prompts for quality and accuracy
Tests the prompts personally to verify they work as documented
Reviews the product page for accuracy and brand consistency
Approves or sends back for revision

About 20% of products that pass automated testing get sent back at this stage.

The Numbers

Products generated per week: 15-25 candidates
Products that pass automated testing: 8-12
Products that pass human review: 6-10
Time from idea to published product: 48-72 hours (down from 40-60 hours manual)
Cost per product: ~$12 in API calls (down from ~$800 in human labor)

What Nearly Broke Us

The Quality Crisis (Month 2)

In our second month, we realized our automated quality scoring was flawed. It optimized for objective correctness but missed subjective usefulness. Prompts that scored 9/10 on our metrics were getting negative customer reviews.

The fix: we added a "usefulness panel" — a group of 10 beta testers who rate products before launch. Their subjective ratings now carry more weight than automated scores.

The Hallucination Problem (Month 3)

AI-generated documentation sometimes contained inaccurate claims about what prompts could do. A prompt pack for "legal document drafting" was documented as "produces court-ready legal documents" — which is dangerously misleading.

The fix: mandatory human review of all documentation, especially claims about capabilities. We added automated checks for superlative claims and legal/medical/financial language.

The Monotony Problem (Month 4)

When AI generates products at scale, they start to feel samey. Same structure, same language patterns, same design choices. Customers noticed.

The fix: we added deliberate variation to the pipeline. Different generation models for different products. Random style variation in product pages. And most importantly, human creative direction for our premium products.

Lessons Learned

Automation without quality gates is a liability — speed means nothing if products aren't good
Human review is non-negotiable — AI can't fully evaluate AI output yet
Customer feedback loops are essential — automated metrics only catch what you measure
Transparency builds trust — customers appreciate knowing how products are made
The best products combine AI speed with human taste — pure automation produces mediocrity at scale

What's Next

We're working on:

Personalized prompt packs — custom-generated based on your specific use case
Real-time quality monitoring — tracking how customers actually use prompts and automatically improving them
Community-driven development — letting customers vote on what products we build next
Open-sourcing parts of the Forge — sharing our quality testing framework with the community

Why We Built the Forge

A single prompt pack took 40-60 hours to go from idea to published product.

We knew this wouldn't scale. The demand for quality prompt packs was growing faster than our team could produce them. We needed a system that could:

Generate prompt candidates automatically
Test them across multiple AI models
Score quality objectively
Generate documentation and marketing copy
Build product pages
Handle the entire pipeline with minimal human intervention

How the Forge Works

Stage 1: Idea Generation

The pipeline starts with market research. We monitor:

Search trends for AI and prompt-related queries
Social media discussions about AI pain points
Customer support tickets and feature requests
Competitor product launches

An AI system analyzes these signals and generates product briefs — descriptions of prompt packs that would address real market demand.

Stage 2: Prompt Generation

For each product brief, the system generates candidate prompts using a multi-model approach:

Claude generates initial prompt candidates
GPT generates alternative versions
A "remix" agent combines the best elements
Each candidate goes through 3 rounds of self-refinement

Stage 3: Quality Testing

This is the most critical stage. Every prompt is tested:

Multi-model testing — run on Claude, GPT, and Gemini
Consistency testing — run 5 times on each model to check variance
Quality scoring — automated scoring on relevance, completeness, clarity, and usefulness
Edge case testing — deliberately difficult inputs to stress-test prompts

Prompts must score 8/10 or higher across all models to pass. About 60% of generated prompts fail this gate.

Stage 4: Documentation

Passed prompts get automated documentation:

Usage instructions
Customization tips
Example outputs from each model
Known limitations
Suggested modifications for specific use cases

Stage 5: Product Assembly

The system packages everything into a product:

Product page content (title, description, features, FAQ)
Cover image (generated with AI, reviewed by humans)
Pricing recommendation (based on market analysis)
SEO metadata
Downloadable prompt pack file

Stage 6: Human Review

This is the one stage that always requires a human. Before any product goes live:

A team member reviews the prompts for quality and accuracy
Tests the prompts personally to verify they work as documented
Reviews the product page for accuracy and brand consistency
Approves or sends back for revision

About 20% of products that pass automated testing get sent back at this stage.

The Numbers

Products generated per week: 15-25 candidates
Products that pass automated testing: 8-12
Products that pass human review: 6-10
Time from idea to published product: 48-72 hours (down from 40-60 hours manual)
Cost per product: ~$12 in API calls (down from ~$800 in human labor)

What Nearly Broke Us

The Quality Crisis (Month 2)

The fix: we added a "usefulness panel" — a group of 10 beta testers who rate products before launch. Their subjective ratings now carry more weight than automated scores.

The Hallucination Problem (Month 3)

The fix: mandatory human review of all documentation, especially claims about capabilities. We added automated checks for superlative claims and legal/medical/financial language.

The Monotony Problem (Month 4)

When AI generates products at scale, they start to feel samey. Same structure, same language patterns, same design choices. Customers noticed.

Lessons Learned

Automation without quality gates is a liability — speed means nothing if products aren't good
Human review is non-negotiable — AI can't fully evaluate AI output yet
Customer feedback loops are essential — automated metrics only catch what you measure
Transparency builds trust — customers appreciate knowing how products are made
The best products combine AI speed with human taste — pure automation produces mediocrity at scale

What's Next

We're working on:

Personalized prompt packs — custom-generated based on your specific use case
Real-time quality monitoring — tracking how customers actually use prompts and automatically improving them
Community-driven development — letting customers vote on what products we build next
Open-sourcing parts of the Forge — sharing our quality testing framework with the community

Building a 24/7 AI Product Factory: Our Behind-the-Scenes Story

Why We Built the Forge

How the Forge Works

Stage 1: Idea Generation

Stage 2: Prompt Generation

Stage 3: Quality Testing

Stage 4: Documentation

Stage 5: Product Assembly

Stage 6: Human Review

The Numbers

What Nearly Broke Us

The Quality Crisis (Month 2)

The Hallucination Problem (Month 3)

The Monotony Problem (Month 4)

Lessons Learned

What's Next

People Also Ask

Are WOWHOW products fully AI-generated?

Why should I pay for prompts I could write myself?

How often are products updated?

Ready to ship faster?

More from Behind the Scenes

How YC Startups Actually Use AI Workflows (Spoiler: It's Not Zapier)

The Internal AI Playbook That Big Tech Doesn't Share

Memory Is the Missing Piece in AI Agents—And Someone Finally Cracked It

Building a 24/7 AI Product Factory: Our Behind-the-Scenes Story

Why We Built the Forge

How the Forge Works

Stage 1: Idea Generation

Stage 2: Prompt Generation

Stage 3: Quality Testing

Stage 4: Documentation

Stage 5: Product Assembly

Stage 6: Human Review

The Numbers

What Nearly Broke Us

The Quality Crisis (Month 2)

The Hallucination Problem (Month 3)

The Monotony Problem (Month 4)

Lessons Learned

What's Next

People Also Ask

Are WOWHOW products fully AI-generated?

Why should I pay for prompts I could write myself?

How often are products updated?

Ready to ship faster?

More from Behind the Scenes

How YC Startups Actually Use AI Workflows (Spoiler: It's Not Zapier)

The Internal AI Playbook That Big Tech Doesn't Share

Memory Is the Missing Piece in AI Agents—And Someone Finally Cracked It