Quality Standards

This document defines the automated quality checks that run on every skill PR.

Quality Score

Every skill receives a composite score from 0-100. The score is calculated across five dimensions, each contributing a weighted portion.

Scoring Breakdown

Structure (15%)

Check	Points	Criteria
SKILL.md exists	3	Exact case match
Valid YAML frontmatter	3	Parses without error, has `---` delimiters
No unexpected keys	1	Only allowed frontmatter keys
Name field valid	2	kebab-case, matches folder, <=64 chars
Name matches folder	1	Frontmatter name == directory name
Description field valid	2	Present, >10 chars
No angle brackets	1	No `<>` in frontmatter
Folder naming	1	kebab-case, no spaces/capitals
No README.md in skill	1	README belongs at repo level only
Test directory exists	2	`tests/test-cases.yml` present and valid
Status field valid	1	If present, must be: active, beta, deprecated, or archived

Description Quality (20%)

Check	Points	Criteria
Contains action verbs	4	”Creates”, “Analyzes”, “Manages”, etc.
Contains trigger phrases	5	”Use when user says…”, “Use for…”
Specific, not vague	4	Detailed, actionable descriptions
Mentions file types	3	If applicable
Under 1024 chars	2	Hard limit
Includes negative triggers	2	”Do NOT use for…”
Owner in metadata	2	`metadata.owner` or `metadata.author` present

Instruction Quality (25%)

Check	Points	Criteria
Non-empty body	3	Content after frontmatter
Has step structure	4	Numbered steps, ## headers, or clear sequence
Includes examples	5	Code blocks, user scenarios, expected outputs
Includes error handling	4	”If X fails…”, troubleshooting section
Uses progressive disclosure	4	References to `references/` or `scripts/` for detail
Actionable language	3	”Run X”, “Call Y”, “Check Z”
Word count under 5000	2	Encourages conciseness
Instruction coherence	3	All referenced paths must actually exist

Test Coverage (25%)

Check	Points	Criteria
test-cases.yml exists	3	Valid YAML structure
>=3 should-trigger tests	4	Including paraphrased variations
>=2 should-not-trigger tests	3	Unrelated topics
>=2 functional tests	5	Input -> expected behavior pairs
>=1 negative test	3	What the skill should refuse
>=1 edge case test	3	Special characters, empty inputs
Performance baseline	2	Before/after comparison
Functional tests have assertions	2	Each test has >=2 expected_behavior items
Trigger diversity	2	No near-duplicate triggers
Assertion specificity	2	No vague assertions

Security (15%)

Check	Points	Criteria
No secrets detected	5	API keys, tokens, passwords
No angle brackets in frontmatter	3	Prevents prompt injection
No reserved names	3	”claude” or “anthropic” in skill name
No suspicious patterns	2	eval(), exec(), system()
No hardcoded URLs	2	Unless documented

Score Thresholds

Score	Result	Action
90-100	Excellent	Auto-approved for maintainer review
70-89	Good	Approved with suggestions
50-69	Needs Work	Blocked — must address issues
0-49	Rejected	Major issues — significant rework needed