Quality Standards

This document defines the automated quality checks that run on every skill PR.

Quality Score

Every skill receives a composite score from 0-100. The score is calculated across five dimensions, each contributing a weighted portion.

Scoring Breakdown

Structure (15%)

CheckPointsCriteria
SKILL.md exists3Exact case match
Valid YAML frontmatter3Parses without error, has --- delimiters
No unexpected keys1Only allowed frontmatter keys
Name field valid2kebab-case, matches folder, <=64 chars
Name matches folder1Frontmatter name == directory name
Description field valid2Present, >10 chars
No angle brackets1No <> in frontmatter
Folder naming1kebab-case, no spaces/capitals
No README.md in skill1README belongs at repo level only
Test directory exists2tests/test-cases.yml present and valid
Status field valid1If present, must be: active, beta, deprecated, or archived

Description Quality (20%)

CheckPointsCriteria
Contains action verbs4”Creates”, “Analyzes”, “Manages”, etc.
Contains trigger phrases5”Use when user says…”, “Use for…”
Specific, not vague4Detailed, actionable descriptions
Mentions file types3If applicable
Under 1024 chars2Hard limit
Includes negative triggers2”Do NOT use for…”
Owner in metadata2metadata.owner or metadata.author present

Instruction Quality (25%)

CheckPointsCriteria
Non-empty body3Content after frontmatter
Has step structure4Numbered steps, ## headers, or clear sequence
Includes examples5Code blocks, user scenarios, expected outputs
Includes error handling4”If X fails…”, troubleshooting section
Uses progressive disclosure4References to references/ or scripts/ for detail
Actionable language3”Run X”, “Call Y”, “Check Z”
Word count under 50002Encourages conciseness
Instruction coherence3All referenced paths must actually exist

Test Coverage (25%)

CheckPointsCriteria
test-cases.yml exists3Valid YAML structure
>=3 should-trigger tests4Including paraphrased variations
>=2 should-not-trigger tests3Unrelated topics
>=2 functional tests5Input -> expected behavior pairs
>=1 negative test3What the skill should refuse
>=1 edge case test3Special characters, empty inputs
Performance baseline2Before/after comparison
Functional tests have assertions2Each test has >=2 expected_behavior items
Trigger diversity2No near-duplicate triggers
Assertion specificity2No vague assertions

Security (15%)

CheckPointsCriteria
No secrets detected5API keys, tokens, passwords
No angle brackets in frontmatter3Prevents prompt injection
No reserved names3”claude” or “anthropic” in skill name
No suspicious patterns2eval(), exec(), system()
No hardcoded URLs2Unless documented

Score Thresholds

ScoreResultAction
90-100ExcellentAuto-approved for maintainer review
70-89GoodApproved with suggestions
50-69Needs WorkBlocked — must address issues
0-49RejectedMajor issues — significant rework needed