Course → Module 11: Quality Control & The Human Gate
Session 5 of 7

From "This Feels Right" to "This Scores 38"

Subjective quality assessment does not scale. When you are the only reviewer, "I know good when I see it" works. When you add a second reviewer, your definitions diverge. When you batch-produce 20 pieces per week, your standards drift. A rubric fixes this by encoding your quality standards into measurable dimensions.

The New York Times built exactly this kind of framework. Their internal tool, Stet, codifies institutional editorial knowledge into a concrete rubric to score AI-generated copy. The principle is universal: if you can define what quality means in numbers, you can enforce it consistently.

Quality Rubric: A scoring framework with defined dimensions, each rated on a fixed scale, that converts subjective editorial judgment into a repeatable, auditable number. The rubric encodes your standards so they survive changes in mood, fatigue, and reviewer.

The Five Scoring Dimensions

Your rubric should have 5 dimensions. Fewer than 5 and you miss important quality signals. More than 7 and the rubric becomes a chore that reviewers skip. Five is the practical optimum.

The dimensions below are a starting point. Modify them to match your content type.

Dimension What It Measures Score 10 Score 0
Accuracy Factual correctness of all verifiable claims Every claim verified, sources cited, no hallucinations Multiple fabricated facts, invented sources, wrong numbers
Voice Consistency Match to target voice profile Indistinguishable from author's natural writing Generic AI voice with no personality markers
Structural Clarity Logical flow, section organization, argument progression Each section builds on the previous, clear transitions, no redundancy Random paragraph order, ideas repeated, no coherent argument
Originality of Insight Presence of ideas that could not be generated by prompting any model Contains practitioner knowledge, specific examples, and positions only the author could take Entirely generic advice available in any search result
AI Artifact Absence Freedom from the 15 forensic markers (inverse scale) Zero detectable AI markers More than 10 markers present across the piece

The Scoring Action Matrix

A score without an action is a decoration. Each score range maps to a specific editorial action.

flowchart LR A["Score Content
(5 dimensions × 0-10)"] --> B{Total Score?} B -->|"40-50"| C["Publish
Light proofread only"] B -->|"30-39"| D["Rework
Targeted edits on weak dimensions"] B -->|"20-29"| E["Major Revision
Structural and voice overhaul"] B -->|"Below 20"| F["Regenerate
Prompt revision required"] style C fill:#6b8f71,color:#111 style D fill:#c8a882,color:#111 style E fill:#c47a5a,color:#111 style F fill:#c47a5a,color:#111
Score Range Action Typical Time Investment Expected Output
40-50 Publish after proofread 5-10 minutes Ready for audience
30-39 Targeted rework on lowest-scoring dimensions 20-40 minutes Publishable after second review
20-29 Major revision: restructure, inject voice, verify facts 45-90 minutes Might reach publishable; consider regeneration
Below 20 Discard and regenerate with revised prompt Regeneration time + new review cycle New output from improved prompt

Calibration

A rubric is only useful if it produces consistent scores. To calibrate, score 5 pieces of content you already know the quality of: one piece of your own best writing, one piece of writing you admire from someone else, one good AI output, one mediocre AI output, and one obvious slop piece.

Your best writing should score 40+. The admired writing should score 40+. Good AI output should score 28-35. Mediocre AI output should score 18-27. Obvious slop should score below 18.

If the scores do not match your intuitive quality ranking, adjust the rubric. Either the dimension definitions are wrong, the scale anchors are wrong, or you are weighting dimensions incorrectly. Calibration is iterative. Expect 2-3 rounds before the rubric reliably matches your judgment.

Using the Rubric in Production

Every piece of content that exits your pipeline should have a score card attached. Not stored separately, not remembered vaguely, but recorded alongside the content in a simple log. Over time, this log reveals patterns: which content types consistently score low, which prompt templates produce the highest scores, and whether your quality is improving or degrading as you scale.

Google's own Search Quality Evaluator Guidelines use a similar approach. Human quality raters evaluate search results against defined rubrics with specific criteria for each rating level. E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) is a rubric. Your rubric is the content-production equivalent.

Further Reading

Assignment

Build your quality rubric. Define 5 scoring dimensions relevant to your content type (you can use the ones in this session or create your own). For each dimension, define what a 10 looks like and what a 0 looks like. Score 5 pieces of content of varying quality. If the scores do not match your intuitive ranking, adjust the rubric and score again. Repeat until the numbers reflect reality.