AI-Prose generation and how to keep it in check

I gave the same chapter specification to two different AI models. Same plot beats. Same character states. Same dialogue requirements. Same writing style guide. Same everything.

I got back two completely different chapters.

One read like a novel. The other read like a screenplay treatment. Same skeleton, wildly different flesh. And the gap between them taught me more about my own spec system than months of iteration had.

Quick context

If you haven't read the first post in this series, the short version: I'm building a five-season science fiction fantasy series called The Drakenhart Saga using AI-assisted prose generation. Not AI-authored. AI-assisted. The distinction matters, and this post is going to show you exactly why.

I've built a four-tier specification system that constrains what the AI generates. Canonical worldbuilding facts. Series arc progression. Individual chapter specifications with plot beats, character states, and dialogue requirements. A writing style guide that controls everything from em-dash frequency to how emotions should be rendered on the page.

The specs are the creative work. The AI generates words against those constraints. I review, revise, and quality control the output. That's the workflow.

This post is about what happens when the constraints meet reality.

The chapter in question

Chapter 3, "Royal Blood." Sera Drakenhart wakes up on her ship, avoids making a decision she knows she needs to make, tries to teach herself fire magic in the cargo hold, nearly kills herself and her android partner Prime, breaks down, and finally accepts she needs help from the dragon elder Thornwick.

The spec for this chapter runs about two thousand words. It tracks three characters across nine plot beats with detailed starting and ending emotional states. It specifies which information must be conveyed in dialogue without dictating exact words. It defines the emotional arc: denial, frustration, terror, breakdown, acceptance, determination.

Both models received all of this. Both hit every beat. Both produced chapters around seven thousand words.

And then everything diverged.

Show versus tell: where the gap lives

My writing style guide has an entire section on this. It says: don't name emotions directly. Show them through action, dialogue, physical sensation. Let the reader do the work.

Here's how Claude handled Prime's concern for Sera:

Data was how he said I'm worried about you when he thought the direct version would make her defensive.

That's subtext. Sera is interpreting Prime's behavior through three years of partnership. The reader understands that Prime is worried without anyone saying "Prime was worried." The relationship is demonstrated through the specific way these two people communicate around difficult subjects.

Here's Gemini's approach:

It was the simplest promise in the world, and it broke her heart a little.

That's an instruction to the reader. "Feel sad now." The emotion is named, delivered, and moved past. There's nothing left for the reader to discover on their own.

This pattern repeated across every emotional beat in the chapter. One model trusted the reader. The other explained everything. Same spec. Same constraints. Completely different execution.

The spec works. The execution varies.

This is the part that surprised me.

I expected spec compliance differences. Maybe one model would skip a beat, or get a character's knowledge state wrong, or introduce something that contradicts canon. Those are structural failures that better specs can fix.

What I got instead was a quality gap in prose craft while the structural compliance was roughly equal. Both chapters hit all nine plot beats in order. Both tracked character states correctly. Both maintained the causal chain from denial through breakdown to acceptance.

The difference was in how the words landed on the page.

Claude wrote Sera's morning routine like this: the ponytail was described as "a decision made by muscle memory, and she let the muscle memory have it because the rest of her decision-making capacity was otherwise occupied." That's character work disguised as narration. The ponytail becomes a metaphor for cognitive overload without announcing itself as one.

Gemini wrote costume descriptions: "She grabbed the black bodysuit from the hook. It was automatic, a ritual of armor. She pulled it on, the fabric hugging her torso, the high collar framing her neck." This reads like a visual reference sheet for an animator. It tells you what Sera looks like, not what it feels like to be Sera.

Both are technically correct against the spec. The spec says she gets dressed. Both models delivered that. But the quality of how that beat is rendered determines whether the reader is inside the character or watching the character.

Where the spec actually failed

Not everything was an LLM execution difference. The analysis revealed four genuine gaps in my specification system, places where the constraints weren't tight enough to prevent misinterpretation.

The biggest one: Crimson's behavior during the fire escalation.

Crimson is a sentient dragon consciousness living in Sera's tattoo. Frustrated, proud, wants Sera to embrace her heritage. The spec captures all of that.

What the spec didn't say was what Crimson actually does when Sera's emotions start feeding uncontrolled fire in a feedback loop of emotion.

Claudes version had Crimson recognizing the danger and trying to warn Sera to stop. Gemini went a different direction and wrote Crimson as actively encouraging it: "Yes. Let it burn." Then "Good. More." Then pivoting to "Control it!" once the cargo hold was on fire.

Same character description. Opposite behaviors. And the second version fundamentally changes the scene's meaning. If Crimson egged her on, the lesson isn't "Sera's stubbornness caused the disaster." It's "Sera's dragon gave her bad advice." Different moral. Different character dynamic going forward.

The fix was simple: add one line to Crimson's behavioral constraints. "Crimson recognizes emotional channeling is dangerous untrained. Should warn, not encourage, during escalation."

That's a small addition to the spec that prevents a scene-level misinterpretation. And I never would have found it without running the same chapter through two different models.

What this means for spec-driven creation

The whole point of the specification system is that the creative decisions happen in the spec, not in the generation. I decide what the chapter does. I decide how characters evolve. I decide what themes are present and how they're expressed. The AI's job is to render those decisions into readable prose.

This experiment validated that approach, but with a caveat.

The spec controls what happens. It does a solid job of controlling how characters sound and what information gets conveyed. Where it has less leverage is on the sentence-by-sentence craft decisions that separate good prose from adequate prose. Things like: when to use subtext versus direct statement. When to describe a character's appearance versus their behavior. When a metaphor should announce itself versus dissolve into the narration.

Some of this is trainable through the style guide. My existing guide already has rules about showing versus telling, about filter words, about sentence rhythm. Claude internalized those rules deeply. Gemini followed them superficially.

The takeaway isn't that one model is better than the other as a universal statement. It's that the spec system needs to account for execution variance. If I'm running a chapter through a model that tends toward telling instead of showing, I need more explicit examples in the prompt. If a model tends toward visual description over internal experience, I need constraints that redirect that tendency.

The spec is a living system. It gets better every time it encounters a failure mode.

The three fixes

Here are the specific gaps this comparison revealed:

Character behavior during crisis scenes. If a character's personality description allows multiple valid interpretations of how they'd act in a specific scenario, the spec needs to constrain the interpretation. General personality traits aren't enough when the scene stakes are high.

Re-description of established characters. I had visual reference documents for image generation that bled into prose generation in Gemini. The fix: add explicit guidance that characters established in prior chapters are shown through behavior, not appearance. You don't describe what your partner of three years looks like every morning.

Formatting instructions buried in reference docs. Dragon telepathic dialogue has specific formatting rules (italicized, distinct from spoken dialogue). Those rules existed in the style guide but not in the chapter spec. Neither model followed them correctly. Lesson: critical formatting must appear in the document the model is actively generating from, not just in a reference file.

The creative act is constraint

There's a misconception that using AI for creative work means the AI is being creative on your behalf. That you describe a vibe and the machine produces art.

That's not what's happening here.

The creative act is building the constraints. Deciding that Sera's stubbornness is the cause of the fire, not Crimson's advice. Deciding that Prime communicates concern through data, not through direct statements of worry. Deciding that the emotional arc moves from denial through terror to acceptance, and that the acceptance arrives quietly rather than dramatically.

Those are authorial decisions. They live in the spec. The AI renders them, and the quality of that rendering varies by model, by prompt, by the phase of the moon apparently. But the creative vision is encoded in the constraints themselves.

A chapter specification is not a suggestion. It's a blueprint. And just like a blueprint, the building might look different depending on who's holding the hammer. But the rooms are where I put them. The load-bearing walls are where they need to be. The foundation is mine.

The AI is a very fast, very capable contractor who sometimes needs more detailed blueprints than I initially thought.

What's next

Season 1 is still in production. This comparison exercise has already improved the spec system for the remaining chapters. Every failure mode I find now is one I don't have to debug across nineteen more chapters.

I'll continue sharing what I learn. The process is as interesting as the product, and I suspect other creators working with AI tools are running into the same walls.

The specs are getting better. The prose is getting better. The system is getting better.

One constraint at a time.