From Natural Language to Logic Machines Through Formal Constraints.
Grammar, Logic and Meaning
Recent developments in the training of Large Language Models (LLMs) have seen the rise of prover models, with deepseek-r1-prover being the pioneer in the domain, but OpenAI quickly following with o3-pro, and even the latest GPT-5-Thinking, highly optimized for logic and proof though Chain-of-Thought (CoT) enabled Reinforcement Learning on hard proof based math problems. While this approach has gathered good results, especially with small models, developed by companies such as PleIAs or Moonshot AI, this approach has several fundamental flaws that decrease both accuracy and efficiency.
Under the current training paradigm LLMs have to interiorise three very different notion at the same time:
- Grammar
- Logic
- Meaning/Purpose/Creativity ?
This structures makes the whole pipeline very hard to debug. Fails on benchamrks and evl can come from either syntax errors (semantics), inconsitencies/sinnlos or outright limitations of the actual capabilities of the model. With large models, and some entropy sampling, syntax is basically solved, but logical fallacies are harder to correct, often requiring very expensive CoT (o3-pro).
Proposition 7 (P7)
Building upon these encouraging successes, we want to go several steps further in the process. Introducing Proposition 7, our constrained generation system that unifies typed programming, proof building, and even natural language.
Philosophy
Philosophy is the practice of asking questions. Mathematics is the branch of philosophy in which answers can be checked for truth or falsity.
Intelligence is best understood as the capacity to pose good questions—in other words, to make interesting choices. “Interesting” is not the same as “true”; it resists definition, though we recognize it when we see it.
Being good at mathematics is less about the mechanics of proof (we can trivially prove statements, it’s just that most statements are boring) and more about choosing what to prove, and when, to advance a goal.
To build intelligent machines, we should optimize for interesting-ness: a coherent model of what is meaningful versus trivial, coupled with a search that stays within the space of statements that are true.
Humans naturally think with vague language that often outruns strict truth conditions; that makes sense in a messy world. In formal reasoning, the tradeoff is different. Some of what we find interesting is also absurd, we call it mysticism. Mysticism is cool, and we dedicate a lot of time to it, through art, music, literature, and even love, but it’s not something machines can/should do.
In mathematics, our thinking systems must balance interesting-ness with formality: express which questions ought to be asked while respecting which questions can be asked. This is the role of constraint generation.
Our aim is simple: combine the broad priors of language models, as rough approximations of what is interesting, with formal frameworks that admit only true paths, so we can optimize directly for our real target: intelligence, or as we can it, taste.
In such a setup, “hallucination” loses its meaning: truth is enforced by the formal layer, while open-ended association remains contained in latent representations, or alternative channels. By itself this is not enough, and freely pretrained models may stall, so Proposition 7 combines it with a training pipeline that applies a custom loss to formally verified outputs to shape intelligence.
In short, mathematics is less about method than about envisioning possibilities and choosing the ones that matter. It is creativity more than computation, and Proposition 7 is a way for machines to cultivate that creativity, build intuition without the noise of unconstrained natural language.
System Overview
- Constraint Engine: unified grammar + typing/proof rules that produce only valid moves/ASTs.
- Minimal Language Core: small typed calculus (System F or λπ) for expressive proofs/programs.
- Interesting-ness Learner: ranks valid moves and trajectories by elegance, generality, reuse, surprise.
- Dual Channel: a natural-language “reasoning” sidecar for explanations, creative flow, and mystic inspiration while the formal channel stays airtight.
Constraint Engine (Unified Grammar + Types)
Key innovation: parsing and typing are fused. A single specification yields:
- Deterministic productions with semantic bindings.
- Inference-style typing/proof rules.
- A generator that, given a goal type/judgment and context Γ, emits only rules that can conclude it and whose premises are satisfiable.
Sketch
NonTerminal(rule_name) ::= syntax_rhs
premises
---------------- (rule_name)
conclusion
Generation loop (conceptual):
- Filter rules whose conclusions unify with the goal type.
- Instantiate premises as subgoals if they are satisfiable in Γ.
- Yield applied rules; no invalid terms can be constructed.
Benefits
- No syntax errors, no type errors, no proof-rule violations.
- Every emitted path is “correct,” differing only in strategic value.
Minimal Language: λπ (or System F core)
We adopt a small, verification-friendly core that still captures rich mathematics.
Option A — λπ (dependently typed, 5 constructs)
- Terms/Types:
- x∣λx:T.t∣tt′∣Πx:T.T′∣Type
- Proofs-as-programs, propositions-as-types; dependent products encode general theorems.
- Strong expressiveness; direct route for Lean-like translations.
Option B — System F (polymorphic lambda calculus)
- Simpler checking, good sweet spot for typed programs and many proof patterns.
- Can serve as stepping stone with a translation path from richer sources.
Translation from Lean (sketch)
- Map
Forall
to Π / ∀ types (or Forall
in System F). - Map tactics/terms into core introduction/elimination steps.
- Preserve dependency graphs (lemmas, uses) for interest signals.
Translations and isomorphisms
Using the λπ core, we can translate any kind of structured thought, from lean proofs to some philosophical arguments, into a formal system that guarantees correctness by construction. This could be used not only for mathematical applications but also for structured reasoning for programming languages, allowing us to generate formally verified programs, that are guaranteed to be correct by construction, fixing major issues linked to security and robustness in LLM code.
In the end, we want to be able to train english-speaking models that produce natural language reasoning that is guaranteed to be correct by construction, using a framework for rationality in language and translating it to a formal system.
Interesting-ness: Learning Taste
All outputs are valid; learning only ranks among valid choices.
Signals (non-exclusive)
- Structural: brevity, abstraction level, lemma reuse, symmetry/patterning.
- Graph-based: centrality in dependency graphs (children/descendants, betweenness, PageRank).
- Human proxies: textbook frequency, named theorems, citations, curriculum presence.
- Discovery difficulty: uniqueness, search depth, failed attempts before success, low prior probability (surprise).
Model sketch
- Encoders for proof state and goal/context (transformer or graph-aware).
- Value head outputs an “interest” score over valid moves.
- Trained via preference learning/ranking over valid trajectories, with curiosity bonuses for novelty.
Training phases
- Phase 1: Pre-train on generic natural language, literature, (TPOT) twitter threads, to build intuition and taste.
- Phase 2: Learn from human proofs; contrast elegant vs. valid-but-verbose alternatives.
- Phase 3: Self-play with constraints; generate multiple valid solutions, rank by interesting-ness, optimize preferences.
- Phase 4: Curiosity and diversity bonuses; encourage reusable abstractions and exploration of less-traveled paths.
The mysticism Paradox
- Constraints create freedom: removing correctness burden frees the model to explore style.
- Form enables expression: like meter in poetry, rules focus creativity.
- Interesting-ness resists definition but emerges from examples and use.
Architecture
- Constraint Engine
- Grammar builder: typed AST only.
- Type/proof checker: real-time premise satisfaction, unification.
- Proof verifier: derivation attached to each node.
- Language Core
- λπ or System F variant with clear typing rules.
- Interesting-ness Learner
- Inputs: valid moves/paths, context, dependency stats.
- Outputs: value over moves; sampling balances exploration vs. elegance.
- Text Channel (mysticism-enabled)
- Formal path is constrained; a language model narrates “why this seems promising,” aiding human interpretability without affecting correctness.
Data and Preparation (Lean/Mathlib)
- Extract: goals, tactic/term steps, dependencies, proof lengths.
- Build: dependency graph; compute centrality metrics.
- Translate: Lean terms to λπ/System F while preserving structure.
- Label heuristics: brevity, generality, reuse; bootstrap with human/preference proxies.
Post-Training implementation overview
Unification-guided rule selection:
- Unify goal type with rule conclusions to prune moves.
- Instantiate premises and check satisfiability in Γ.
- Produce subgoals; attach derivation trees to typed ASTs.
Preference learning over valid trajectories:
- Score trajectories by structural and graph metrics plus novelty.
- Train a value model to rank move sequences; optimize sampling.
Algorithm sketch (taste-guided constrained generation)
Given:
- Problem: (g0,Γ0) with goal g0 and context Γ0
- Engine:
- ValidMoves(Γ,g)→ finite set V of moves m that are admissible and whose premises are satisfiable
- Apply(m;Γ,g,π)→(Γ′,g′,π∪{m}) advancing the derivation path π
- Solved(g)↔g is discharged
- Model:
- Score(m∣Γ,g,π)→s(m)∈R (interesting-ness only)
- Update(π,r) adapts parameters using trajectory π and scalar reward r
- Evaluator:
- EvaluatePath(π)→r aggregating brevity, reuse, generality, surprise
Algorithm P7-TrainStep:
- Initialize g←g0, Γ←Γ0, π←∅
- While ¬Solved(g):
- V←ValidMoves(Γ,g)
- For each m∈V, compute s(m)←Score(m∣Γ,g,π)
- Select a move m∈V, e.g. sample with P(m)∝exp(β⋅s(m)) or choose argmaxs
- (Γ,g,π)←Apply(m;Γ,g,π)
- r←EvaluatePath(π)
- Update(π,r)
Invariant (correctness by construction):
- For all iterations, all Γ, g, π encountered are well-formed/typed, and every m∈V preserves validity: invalid states are unreachable.
Reward decomposition (example):
- r=w1⋅brevity(π)+w2⋅reuse(π)+w3⋅generality(π)+w4⋅surprise(π), with wi≥0.
Open Questions
- What’s interesting-ness? How to formalize taste?
- How to balance exploration (diversity) vs exploitation (elegance)?
- Language core: λπ vs System F for tractability vs expressiveness; staged progression?
- Exploration: principled diversity/curriculum to avoid local optima in “taste.”
- Scaling laws: how does data/domain breadth affect learned intuition?
- Beyond math: applying P7 to typed programs, music with harmony constraints, and other rule-rich domains.
Practical Takeaways
- Practitioners: Use constraints to eliminate invalid states; devote model capacity to style and strategy.
- Decision-makers: Expect better efficiency (less wasted compute) and higher perceived quality (elegance, reuse) when correctness is guaranteed by construction.
- Researchers: Treat interesting-ness as a learnable value function over valid moves; compare signals (structural, graph, human preference).
References [WIP]
Foundational Type Theory & Proof Assistants
- Reynolds, J. C. (1974). Towards a theory of type structure. In Programming Symposium (pp. 408-425). Springer. [Seminal System F paper]
- Girard, J. Y. (1972). Interprétation fonctionnelle et élimination des coupures de l’arithmétique d’ordre supérieur. Thèse de doctorat, Université Paris VII.
- Martin-Löf, P. (1984). Intuitionistic type theory. Bibliopolis. [Foundation for dependent types/λπ]
- Barendregt, H. (1992). Lambda calculi with types. Handbook of logic in computer science, 2, 117-309.
- de Moura, L., Kong, S., Avigad, J., Van Doorn, F., & von Raumer, J. (2015). The Lean theorem prover. In International conference on automated deduction (pp. 378-388). Springer.
- The Coq Development Team. (2021). The Coq proof assistant reference manual. Version 8.13.
- Norell, U. (2007). Towards a practical programming language based on dependent type theory. PhD thesis, Chalmers University of Technology.
Constrained Generation & Type Systems
Vechev, M., et al. (Various years). Work on program synthesis and neural code generation under type constraints.
Muendler, N., Antonov, N., & Vechev, M. (2025). Type-aware constraint generation for LLM code synthesis. ETH Zurich. Find at https://www.sri.inf.ethz.ch/publications/muendler2025typeawareconstraint
Willard, B., & Louf, R. (2023). Efficient guided generation for large language models. arXiv preprint arXiv:2307.09702. [Outlines library mentioned as .txt company]
Beurer-Kellner, L., Fischer, M., & Vechev, M. (2022). Prompting is programming: A query language for large language models. Proceedings of the ACM on Programming Languages, 6(PLDI), 1-32.
Program Synthesis Under Type Constraints
- Polikarpova, N., Kuraj, I., & Solar-Lezama, A. (2016). Program synthesis from polymorphic refinement types. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (pp. 522-538).
- Frankle, J., Osera, P. M., Walker, D., & Zdancewic, S. (2016). Example-directed synthesis: a type-theoretic interpretation. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (pp. 802-815).
- Feng, Y., Martins, R., Wang, Y., Dillig, I., & Reps, T. W. (2017). Component-based synthesis for complex APIs. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages (pp. 599-612).
Preference Learning & Reinforcement Learning in Structured Domains
- Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., … & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730-27744.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Mathematical Theorem Proving & AI
- Irving, G., Szegedy, C., Alemi, A. A., Eén, N., Chollet, F., & Urban, J. (2016). DeepMath-deep sequence models for premise selection. Advances in neural information processing systems, 29.
- Polu, S., & Sutskever, I. (2020). Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393.
- Han, J. M., Rute, J., Wu, Y., Ayers, E. W., & Bansal, K. (2021). Proof artifact co-training for theorem proving with language models. arXiv preprint arXiv:2102.06203.
- Yang, K., Swope, A., Gu, A., Chalamala, R., Song, P., Yu, S., … & Yue, X. (2023). LeanDojo: Theorem proving with retrieval-augmented language models. arXiv preprint arXiv:2306.15626.
Recent Advances in Constrained Language Models
- Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., … & Odena, A. (2021). Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
- Fried, D., Aghajanyan, A., Lin, J., Wang, S., Wallace, E., Shi, F., … & Lewis, M. (2022). InCoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999.
- Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., … & Sutskever, I. (2022). Competition-level code generation with AlphaCode. Science, 378(6624), 1092-1097.