· Paul Kronlund-Drouault  · 12 min read

Truth by Construction

From Natural Language to Logic Machines Through Formal Constraints.

From Natural Language to Logic Machines Through Formal Constraints.

Grammar, Logic and Meaning

Recent developments in the training of Large Language Models (LLMs) have seen the rise of prover models, with deepseek-r1-prover being the pioneer in the domain, and OpenAI quickly following with o3-pro, and even the latest GPT-5-Thinking. These models are highly optimized for logic and proof through Chain-of-Thought (CoT) enabled Reinforcement Learning on hard math problems. While this approach has gathered good results, especially with smaller models developed by companies such as PleIAs or Moonshot AI, it has several fundamental flaws that decrease both accuracy and efficiency.

Under the current training paradigm, LLMs have to internalize three very different notions at the same time:

  • Grammar
  • Logic
  • Meaning/Purpose/Creativity ?

This structure makes the whole pipeline very hard to debug. Failures on benchmarks and evaluations can come from either syntax errors, logical inconsistencies (what Wittgenstein would call unsinnig1), or outright limitations of the models’s “intelligence”.

Some companies, like .txt (dottxt.co) are developing regex-based sampling solutions to fix the grammar aspect, in order to improve tool integration and augment reliability. This has proven successful, and has shown good results with a significant decrease in compute cost in some applications, but it doesnt go far enough.

For one, this approach only covers inference, and doesn’t integrate in training pipeline. And secondly, it it doesnt solve the main problem of logic.

Proposition 7 (P7)

At Unsuspicious Industries, we want to go several steps further in the process. In this post, we are introducing Proposition 7, our constrained generation system [WIP] that unifies typed programming, proof building, reasoning and natural language2.

Philosophy

Philosophy is the practice of asking questions. Mathematics is the branch of philosophy in which answers can be checked for truth or falsity.

Intelligence is best understood as the capacity to ask good questions, in other words, to make interesting choices. “Interesting” is not the same as “true”; it resists definition, though we recognize it when we see it3.

Being good at mathematics is less about the mechanics of proof (we can trivially prove statements, it’s just that most statements are boring) and more about choosing what to prove, and when, to advance a goal.

To build intelligent machines, we should optimize for interesting-ness: a coherent model of what is meaningful versus trivial, coupled with a search that stays within the space of statements that are true.

Humans naturally think with vague language that often outruns strict truth conditions; that makes sense in a messy world. In formal reasoning, the tradeoff is different. Some of what we find interesting is also absurd: we call it mysticism4. Mysticism is cool, and we dedicate a lot of time to it, through art, music, literature, and even love, but it’s not something machines can/should do.

In mathematics, our thinking systems must balance interesting-ness with formality: express which questions ought to be asked while respecting which questions can be asked. This is the role of constraint generation.

Our aim is simple: combine the broad priors of language models, as rough approximations of what is interesting, with formal frameworks that admit only true paths, so we can optimize directly for our real target: intelligence, expressed through taste5.

In such a setup, “hallucination” loses its meaning: truth is enforced by the formal layer, while open-ended association remains contained in latent representations, or alternative channels. By itself this is not enough, and freely pretrained models may stall, so Proposition 7 combines it with a training pipeline that applies a custom loss to formally verified outputs to shape intelligence.

In short, mathematics is less about method than about envisioning possibilities and choosing the ones that matter. It is creativity more than computation, and Proposition 7 is a way for machines to cultivate that creativity, build intuition without the noise of unconstrained natural language.

System Overview

  • Constraint Engine: unified grammar + typing/proof rules that produce only valid moves/ASTs within a formal system.
  • Translation system: from proof/reasoning to expression in the formal context.
  • Constrained pretrained LLM: ranks valid moves (tokens) using learnt criteria of intelligence: elegance, generality, reuse, surprise.
  • Mode-switching sampler: a natural-language “reasoning” mode for explanations, creative flow, and mystic inspiration and a formal reasoning mode for structured proof generation and logic reasoning.

Constraint Engine (System-Agnostic)

The engine and the core calculus are the same surface: define what well‑formed looks like and only valid steps exist. Grammar + typing = one contract. A rule simply licenses a next move and anything unlicensed is unrepresentable.

Sketch (rule shape)

Production:
  NonTerminal(rule) ::= parts[b] ...
Typing:
  premise_1, ..., premise_k
  ------------------------ (rule)
  judgment

Premise forms (closed):
  Γ[...] ⊢ e : τ         # typing
  x ∈ Γ                 # context membership
  τ₁ = τ₂               # convertibility / equality
  predicate(args)       # side condition (fresh, occurs_check, ...)

Minimal ingredients: a binding pattern, one inference line, premises from the four shapes, and an equality notion (plus optional fast predicates).

Meta‑typing (how we stay system‑agnostic):

  • Two levels: object level (terms / proofs) and meta level (the rule file). The rule file itself is parsed and checked before being admitted; rules are data.
  • Each rule undergoes a meta pass: (1) binding hygiene / scope sanity, (2) premise shape validation (only four primitive premise forms), (3) well‑scoped metas resolved or flagged, (4) side predicates whitelisted & sandboxed.
  • Equality / conversion is a pluggable module (β / δ / η / custom normalizers) declared in the rule file header; the engine never hard‑codes a calculus.
  • Adding features = supplying new constructor symbols + rules that relate them. If the meta checker accepts the file, the engine can generate with them without code changes.
  • Reflection boundary: the system never executes object rules at meta time; it simulates them structurally to guarantee they cannot smuggle imperative effects (soundness guard).

The current prototype already exercises this: intersection / union / negation / refinements / universes sit in one rule file; removing them is deleting lines, not recompiling logic. That illustrates the contract: expressiveness lives in data.

Generation loop:

  1. unify with rule conclusions (up to declared conversion)
  2. silently drop rules whose instantiated premises fail
  3. emit admissible refinements (rule id + bindings). Invalid states never materialize.

Meta benefit (why it matters):

  • All branches valid => capacity goes to ranking, not repair.
  • Canonical context shape => cheap caching of ValidMoves.
  • Swapping a logic fragment => swap a file, the generation engine reuses the same interface.

Translation is done in a multi-step routine: external proof/program → neutral IR → elaborated into active rules

One‑line philosophy:

  • correctness = admissibility
  • taste = preference over admissible futures.

Translations and isomorphisms

Using an instantiated core (e.g., a dependent fragment when needed), we can translate structured mathematical proofs and selected formalizable argumentative structures into derivations guaranteed correct by construction. This extends to typed program derivations, enabling generation of formally verified code and mitigating security / robustness issues in unconstrained LLM synthesis.

Interesting-ness: Learning Taste

All outputs are valid; learning only ranks among valid choices. This is the key to Proposition 7: we can focus on what is interesting, not what is true.

Signals (non-exclusive)

  • Structural (Elegance): brevity, abstraction level, lemma reuse, symmetry/patterning.
  • Graph-based: centrality in dependency graphs (children/descendants, betweenness, PageRank).
  • Human proxies: textbook frequency, named theorems, citations, curriculum presence.
  • Discovery difficulty: uniqueness, search depth, failed attempts before success, low prior probability (surprise).

Model sketch

  • Classic pretrained LLM whose own logits guide move selection; optional graph-aware embeddings can be supplied via a dedicated encoder, but the model’s output head is unchanged.
  • Formally constrained outputs, guiding the model towards valid reasoning paths.
  • Multi-state reasoning: formal stage (constraint engine) + informal stage (free language).

Training phases

  • Phase 1: Pre-train on generic natural language, literature, (TPOT) twitter threads, to build intuition and taste.
  • Phase 2: Learn from human proofs and heterogeneous reasoning sources, converted to our formal system of choice.
  • Phase 3: Self-play with constraints; generate multiple valid solutions, rank by interesting-ness, optimize preferences (RL).
  • Phase 4: Curiosity and diversity bonuses; encourage reusable abstractions and exploration of less-traveled paths.

Difficulties

  • How to port learnt language patterns to formal outputs ?
  • Meta-logic independence: how to ensure the model can reason about its own reasoning without falling into circularity, or dead-ends?

The mysticism Paradox

  • Constraints create freedom: removing correctness burden frees the model to explore style, and allows for more creative outputs, with mechanisms like temperature, or entropy sampling.
  • Form enables expression: like meter in poetry, rules focus creativity6.
  • Interesting-ness resists definition but emerges from examples and use.

Architecture

  • Constraint Engine
    • Grammar builder: typed AST only.
    • Type/proof checker: real-time premise satisfaction, unification.
    • Proof verifier: derivation attached to each node.
  • Language Core
    • parameterized core calculus (rule-file specified).
  • Taste Learner
    • Inputs: Mixed formal/informal context.
    • Outputs: tokens interpreted as value over moves; sampling balances exploration vs. elegance.
  • Modal sampling policy
    • Constrained formal path for logic reasoning and output
    • Free-language sampling for explanations, more open (vague) reasoning, creativity, and mysticism.
    • The model can still switch between modes of reasoning, based on variables such as entropy. This might also help with dead-ends in search space.

Data and Preparation (Lean/Mathlib)

  • Extract: goals, tactic/term steps, dependencies, proof lengths.
  • Build: dependency graph; compute centrality metrics.
  • Translate: Lean terms to active core calculus while preserving structure.
  • Label heuristics: brevity, generality, reuse; bootstrap with human/preference proxies.

Post-Training implementation overview

Unification-guided rule selection:

  • Unify goal type with rule conclusions to prune moves.
  • Instantiate premises and check satisfiability in Γ\Gamma.
  • Produce subgoals; attach derivation trees to typed ASTs.

Preference learning over valid trajectories:

  • Score trajectories by structural and graph metrics plus novelty.
  • Train a value model to rank move sequences; optimize sampling.

Algorithm sketch (taste-guided constrained generation)

Given:

  • Problem: (g0,Γ0)(g_0, \Gamma_0) with goal g0g_0 and context Γ0\Gamma_0
  • Engine:
    • ValidMoves(Γ,g)\text{ValidMoves}(\Gamma, g) \to finite set VV of moves mm that are admissible and whose premises are satisfiable
    • Apply(m;Γ,g,π)(Γ,g,π{m})\text{Apply}(m; \Gamma, g, \pi) \to (\Gamma', g', \pi \cup \{m\}) advancing the derivation path π\pi
    • Solved(g)g\text{Solved}(g) \leftrightarrow g is discharged
  • Model:
    • Score(mΓ,g,π)s(m)R\text{Score}(m \mid \Gamma, g, \pi) \to s(m) \in \mathbb{R}
    • Update(π,r)\text{Update}(\pi, r) adapts parameters using trajectory π\pi and scalar reward rr
  • Evaluator:
    • EvaluatePath(π)r\text{EvaluatePath}(\pi) \to r aggregating brevity, reuse, generality, surprise

Algorithm P7-TrainStep:

  1. Initialize gg0g \leftarrow g_0, ΓΓ0\Gamma \leftarrow \Gamma_0, π\pi \leftarrow \emptyset
  2. While ¬Solved(g)\neg\text{Solved}(g):
    • VValidMoves(Γ,g)V \leftarrow \text{ValidMoves}(\Gamma, g)
    • For each mVm \in V, compute s(m)Score(mΓ,g,π)s(m) \leftarrow \text{Score}(m \mid \Gamma, g, \pi)
    • Select a token, as a move mVm \in V, e.g. sample with P(m)exp(βs(m))P(m) \propto \exp(\beta \cdot s(m)) or choose argmaxs\arg \max s
    • (Γ,g,π)Apply(m;Γ,g,π)(\Gamma, g, \pi) \leftarrow \text{Apply}(m; \Gamma, g, \pi)
  3. rEvaluatePath(π)r \leftarrow \text{EvaluatePath}(\pi)
  4. Update(π,r)\text{Update}(\pi, r)

Invariant (correctness by construction):

  • For all iterations, all Γ\Gamma, gg, π\pi encountered are well-formed/typed, and every mVm \in V preserves validity: invalid states are unreachable. By eliminating the unsinnig (nonsensical) through constraints, we free the model to focus entirely on avoiding sinnlos (senseless) dead-ends and saying genuinely interesting things7. This is what training optimizes for.

Reward decomposition (example):

  • r=w1brevity(π)+w2reuse(π)+w3generality(π)+w4surprise(π)r = w_1 \cdot \text{brevity}(\pi) + w_2 \cdot \text{reuse}(\pi) + w_3 \cdot \text{generality}(\pi) + w_4 \cdot \text{surprise}(\pi), with wi0w_i \geq 0.

Open Questions

  • Exploration vs exploitation: scheduling diversity when all branches are already valid.
  • Expressiveness vs decidability: managing search with intersections, unions, negation, refinements before adding heavier features (inductives, quantifiers).
  • Refinement predicates: trusted boundary or verified? sandboxing / SMT side-check vs string predicate.
  • Normalization / equality: how far to push conversion (β, δ, η, simplifications) without exploding unification cost.
  • Entropy-driven mode switching: optimal triggers that surface genuine abstraction needs rather than noise.
  • Encoding logic symbols in tokens: Should our system use language-like tokens for logical symbols (e.g., “forall”, “exists”) or should it use a more compact representation? How does this affect the model’s ability to learn and generalize? Maybe special tokens?
  • Beyond math: Could it be possible to extend the system to other domains, with different context-bound constraints? How would we adapt the grammar and typing rules to handle different types of reasoning, such as legal reasoning or scientific hypothesis testing?

These questions represent the core research challenges that would need to be addressed in developing P7 from proposition to working system.

Practical Takeaways

  • Practitioners: Use constraints to eliminate invalid states; devote model capacity to style and strategy.
  • Decision-makers: Expect better efficiency (less wasted compute) and higher perceived quality (elegance, reuse) when correctness is guaranteed by construction.
  • Researchers: Treat interesting-ness as a learnable value function over valid moves; compare signals (structural, graph, human preference).

References

Philosophy & Foundations of Mathematics

  • Wittgenstein, L. (1921). Tractatus Logico-Philosophicus.
  • Wittgenstein, L. (1953). Philosophical Investigations.
  • Hardy, G. H. (1940). A Mathematician’s Apology.
  • Poincaré, H. (1905). Science and Hypothesis.
  • Kant, I. (1790). Critique of Judgment.

Core Type Theory & Foundations

  • Martin-Löf, P. (1984). Intuitionistic Type Theory.
  • Girard, J.-Y. (1972). Interprétation fonctionnelle… (System F / cut elimination).
  • Reynolds, J. C. (1974). Towards a theory of type structure. (Parametric polymorphism).
  • Barendregt, H. (1992). Lambda calculi with types.
  • Wadler, P. (2015). Propositions as Types.

Constrained Generation, Types & Synthesis

  • Muendler, N., Antonov, N., & Vechev, M. (2025). Type-aware constraint generation for LLM code synthesis (ETH Zurich).
  • Willard, B., & Louf, R. (2023). Efficient guided generation for large language models.
  • Beurer-Kellner, L., Fischer, M., & Vechev, M. (2022). Prompting is Programming.
  • Polikarpova, Kuraj, & Solar-Lezama (2016). Refinement type synthesis.
  • Knuth, D. E. (1968). Semantics of context-free languages.
  • Pierce, B. C., & Turner, D. N. (2000). Local type inference.

Notes

Footnotes

  1. In Wittgenstein’s Tractatus Logico-Philosophicus, unsinnig (nonsensical) refers to propositions that violate logical grammar and are therefore meaningless, distinct from sinnlos (senseless) which refers to tautologies and contradictions that are logically well-formed but say nothing about the world. See propositions 4.461-4.4611.

  2. The name references Wittgenstein’s final proposition in the Tractus: “Whereof one cannot speak, thereof one must be silent” (Wovon man nicht sprechen kann, darüber muß man schweigen). Our system aims to clearly delineate what can be formally expressed versus what belongs to the realm of the unsayable (mystical).

  3. This echoes the aesthetic theory of mathematical beauty found in Hardy’s A Mathematician’s Apology and Poincaré’s writings on mathematical intuition, where elegance and surprise are recognized as fundamental but indefinable qualities.

  4. Here we use “mysticism” in the sense of Wittgenstein’s later work, particularly the Philosophical Investigations, where he discusses the limits of what can be meaningfully said versus what can only be shown or experienced.

  5. The notion of “taste” as a form of intelligence draws from Kant’s Critique of Judgment, where aesthetic judgment is seen as a bridge between understanding and reason, involving both subjective experience and claims to universal validity.

  6. This paradox echoes the concept of “creative constraints” found in artistic traditions, like the formal restrictions of sonnets or George Perec’s La disparition (a book he wrote without using the letter “e”), where limitations paradoxically enable rather than restrict creative expression.

  7. Dead-ends in formal reasoning exemplify Wittgenstein’s sinnlos: they are logically well-formed and syntactically valid, but contribute nothing meaningful toward the proof goal. Unlike unsinnig statements that violate logical grammar, sinnlos paths waste computational resources on formally correct but purposeless exploration. By construction (eliminating the unsinnig), P7 can devote all learning capacity to the harder problem of recognizing and avoiding the sinnlos.

Back to Blog

Related Posts

View All Posts »