๐Ÿ“‹

Level 6 ยท The Meta Layer

The Speculation
of Spec Languages

What if you maintained the requirements, not the code? What if the LLM was the compiler, and a twelve-line spec compiled to two hundred lines of working software?

scroll

01 โ€” The Bet

Maintain specs, not code

Every language in this museum has a compiler or interpreter: a deterministic program that reads your source code and produces machine instructions. The compiler never changes on you. The same code produces the same binary. This reliability is the foundation of software engineering.

Spec languages make a different bet. They replace traditional source code with plain-text specifications โ€” structured descriptions of what the software should do, not how it does it. An LLM reads the spec and generates the implementation. You never write or maintain the generated code directly. You maintain the spec, and the system re-derives everything else.

"CodeSpeak calls itself a next-generation programming language powered by LLMs, shrinking codebases 5โ€“10ร— by replacing code with plain-text specs."

โ€” CodeSpeak project documentation, 2024
the_premise.txt
โ”€โ”€ Traditional development โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

spec.md    (requirements, may drift from reality)
    โ†“
code/      (1,000 lines โ€” what you maintain)
    โ†“
binary     (what runs)

โ”€โ”€ Spec language development โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

spec/      (80 lines โ€” what you maintain)
    โ†“
LLM        (the compiler โ€” reads spec, writes code)
    โ†“
code/      (1,000 lines โ€” generated, not hand-authored)
    โ†“
binary     (what runs)

# The spec is the source of truth.
# The code is the build artifact.
# You don't edit build artifacts. You edit the spec.

The inversion is radical: code becomes a build artifact, not a source. Just as you don't hand-edit a compiled binary, you wouldn't hand-edit the generated code. The spec is what gets committed, reviewed, and versioned.


02 โ€” The Shrinkage Claim

5 to 10 times smaller

CodeSpeak's headline claim is a 5โ€“10ร— reduction in codebase size. The argument is intuitive: code is verbose because computers need explicit instruction at every step. Specs are terse because they describe intent, not procedure. A single spec line โ€” "validate that the email field is a valid RFC 5322 address" โ€” might generate twenty lines of validation code, error handling, and test coverage.

The compression ratio depends heavily on the domain. CRUD applications โ€” where the business logic is thin and the boilerplate is thick โ€” compress dramatically. Complex algorithmic code, where the specification and the implementation are nearly the same density of thought, compresses less.

compression.txt
โ”€โ”€ The spec (12 lines) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

FEATURE: User Registration

  ENDPOINT: POST /api/users
  FIELDS:
    - email:    required, valid RFC 5322, unique in users table
    - password: required, min 8 chars, bcrypt hash (cost: 12)
    - name:     required, max 100 chars, trimmed
  ON SUCCESS:  201, return {id, email, name, created_at}
  ON FAILURE:  422, return field-level validation errors as JSON
  SIDE EFFECT: Send welcome email via queue (async, non-blocking)
  RATE LIMIT:  5 registrations per IP per hour

โ”€โ”€ What the LLM generates (โ‰ˆ 180 lines) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

# Route handler, input validation with detailed error messages,
# bcrypt hashing, database insertion with conflict handling,
# async email queue dispatch, rate limiting middleware,
# integration tests for all success and failure paths.
# You maintain 12 lines. The LLM maintains 180.

The registration spec describes every observable behaviour โ€” input rules, output shape, side effects, constraints. The generated code handles the implementation details. The ratio here is roughly 15:1. Across a full application, 5โ€“10ร— is realistic for CRUD-heavy domains.


03 โ€” What a Spec Looks Like

Plain language with precision

The challenge of spec language design is specificity. Natural language is ambiguous by default โ€” the same sentence can mean different things. A spec language that's too loose produces non-deterministic output: the same spec regenerates slightly different code each time, making diffs meaningless and debugging impossible.

The better spec language projects impose a constrained vocabulary โ€” a middle ground between full natural language and traditional code. Enough structure for the LLM to generate deterministically. Enough readability for a non-engineer to understand the intent.

auth.spec
FEATURE: JWT Authentication
VERSION: 2.1

LOGIN:
  INPUT:   email (string), password (string)
  OUTPUT:  access_token (JWT, expires 1h),
           refresh_token (opaque, expires 30d, stored in httpOnly cookie)
  ERRORS:
    - Invalid credentials  โ†’ 401 {code: "AUTH_INVALID"}
    - Account locked       โ†’ 423 {code: "AUTH_LOCKED", until: timestamp}
    - Too many attempts    โ†’ 429 {code: "AUTH_RATE_LIMITED"}
  SECURITY:
    - Lock account after 5 failed attempts within 10 minutes
    - Log all attempts: ip, user_agent, success, timestamp
    - Constant-time password comparison (prevent timing attacks)

REFRESH:
  INPUT:  refresh_token from httpOnly cookie
  OUTPUT: new access_token, rotate refresh_token
  ERRORS: expired/invalid token โ†’ 401 {code: "TOKEN_INVALID"}

The spec mentions timing attacks, token rotation, and audit logging โ€” security concerns that junior engineers often miss in hand-written code. A good spec is also a security checklist. The LLM generates the implementation; the spec encodes the requirements that the implementation must satisfy.


04 โ€” The LLM as Compiler

A probabilistic compiler

A traditional compiler is deterministic: the same source always produces the same output. The same C code compiled twice gives the same binary. This determinism is foundational to software engineering: version control, reproducible builds, diff-based code review โ€” all depend on it.

An LLM is not deterministic. Temperature, model version, and even sampling randomness mean the same prompt can produce different code. This creates an entirely new class of problem: how do you do code review when the diff might be noise? How do you track security regressions when the "compiler" changes its output month to month?

determinism_problem.txt
โ”€โ”€ Traditional compiler: deterministic โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

auth.spec  โ”€โ”€โ†’ [GCC 14.2] โ”€โ”€โ†’ auth.o
auth.spec  โ”€โ”€โ†’ [GCC 14.2] โ”€โ”€โ†’ auth.o  # identical

โ”€โ”€ LLM compiler: probabilistic โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

auth.spec  โ”€โ”€โ†’ [claude-sonnet-4-6] โ”€โ”€โ†’ auth.py  (version A)
auth.spec  โ”€โ”€โ†’ [claude-sonnet-4-6] โ”€โ”€โ†’ auth.py  (version B)
# Functionally equivalent? Probably. Identical? No.

auth.spec  โ”€โ”€โ†’ [claude-opus-4-6]   โ”€โ”€โ†’ auth.py  (version C)
# Different model, different code, same spec.
# git diff between A and C is noise + signal โ€” hard to distinguish.
# Which version had the timing attack protection? Check the spec.
# (This is the right answer: trust the spec, not the artifact.)

The solution is to pin the model version and use deterministic sampling (temperature=0) for generation runs. But model providers retire versions. At some point, you must regenerate โ€” and the "compiled binary" changes without the source changing. This is the unsolved infrastructure problem of spec languages.


05 โ€” The Open Questions

The problems we don't know how to solve

Spec languages are a compelling idea in search of engineering solutions. The productivity promise is real โ€” the compression ratio for CRUD-heavy code is demonstrably achievable. But the ecosystem assumptions that make traditional software engineering reliable have not yet been rebuilt for this paradigm.

Debugging is the hardest problem. When generated code has a bug, the fix is in the spec โ€” but tracing from a runtime error through generated code back to the relevant spec line is currently a manual, expert process. There are no stack traces that point to spec lines yet. There are no spec-level debuggers. These tools will need to be invented before spec languages can be used in critical production systems without senior engineering oversight.

open_questions.txt
โ”€โ”€ Questions that don't have good answers yet โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

DEBUGGING
  RuntimeError: NullPointerException at auth.py:147
  โ†’ auth.py is generated. Which spec line is responsible?
  โ†’ No tooling exists for spec-level debugging yet.

VERSION CONTROL
  git blame auth.py  # shows only the generation commit
  git blame auth.spec # shows who changed the requirement
  โ†’ Review must happen at spec level, not code level.
  โ†’ Code review tools don't understand specs yet.

MODEL RETIREMENT
  OpenAI retires GPT-4-turbo. Your spec was written for it.
  โ†’ Regenerate everything. How do you verify correctness?
  โ†’ Tests help โ€” if the tests are in the spec.

SECURITY AUDIT
  โ†’ Can a security auditor trust generated code?
  โ†’ Does SOC2 require auditing the spec, the code, or both?
  โ†’ Unknown. Regulations haven't caught up.

None of these questions are fatal. HTTP had no security model in 1991; HTTPS was invented later. Git didn't exist when the first distributed version control problems appeared. The tooling follows the paradigm โ€” but there is a gap, and it matters for production use today.


06 โ€” The Whole Picture

Why spec languages are worth watching

๐Ÿ“

The Core Bet

Maintain the spec, not the code. Code becomes a build artifact โ€” generated, not hand-authored. The spec is the source of truth.

๐Ÿ“‰

Real Compression

5โ€“10ร— is achievable for CRUD-heavy domains. A 12-line spec generating 180 lines of validated, tested, secure code is real today.

๐ŸŽฒ

The Determinism Gap

LLMs are probabilistic. The same spec, recompiled, may produce different code. Model retirement means forced recompilation. This is unsolved.

๐Ÿ›

Debugging is Hard

Errors trace to generated code, not specs. Spec-level debuggers don't exist yet. Expert oversight is still required in production.

๐Ÿ”ฌ

A Live Experiment

CodeSpeak and its peers are early bets on a paradigm shift. The productivity gains are real; the engineering infrastructure is not yet built.

๐Ÿ”ญ

A Window to the Future

If the tooling catches up, spec languages could compress software the way high-level languages compressed assembly. That's the size of the bet.