· Paul Kronlund-Drouault  · 9 min read

Completeability and Regular Expressions

Interesting challenges regarding two-level languages in Proposition 7

Interesting challenges regarding two-level languages in Proposition 7

A quick overview of RegEx

Regular Expressions (RegEx) are a very common tool in computer science, often used for search-and-replace operations, finding specific strings or patterns in large files, and validating user input (see the infamous email regex).

In introductory classes to theoretical computer science, we see regular expressions as an example application of automata theory, more precisely finite automata. The process of converting a RegEx into an automaton is called Thompson’s construction, and is behind grep’s unreal speed.

The problem at hand

When building Proposition 7, I was struggling with a very peculiar problem in program synthesis and input validation.

Correctness by complete-ability

The p7 expression parser is based on a very simple principle:

With Σ\Sigma an alphabet and LL a language, we can define a function

Φ:Σ{accept,reject}\Phi : \Sigma^* \rightarrow \{\text{accept},\text{reject}\}

as follows:

Φ(s)={acceptsΣ,ssLrejectotherwise\Phi(s) = \begin{cases} \text{accept} & \exists s' \in \Sigma^*,\, ss' \in L \\ \text{reject} & \text{otherwise} \end{cases}

Our goal here, for constrained generation, is to ensure that any prefix ss of a valid expression can be completed into a valid expression in LL and thus lead to a valid program of proof.

RegEx sub-problem

Since our language LL is defined by complex mechanisms, such as typing rules and recursive grammars, it must be handled by a Turing-complete model in the general case. However, a subset of this problem requires a generalized solution for RegEx.

Consider the following grammar as an example:

expr ::= /b(u|o|i)p/
term ::= expr ('and' expr)* 
prog ::= term 'at' 'bap'

expressions between slashes (/<regex>/) are regexes

The program bip at bap is a trivial valid expression, as is bip and bop at bap. However, bip and is not a valid expression, but can be completed into one by adding another expr + at bap after the and. We can call it Σ-completeable\Sigma\text{-completeable} in LL.

Now, consider that we are working with a system that produces output byte by byte. How can we validate the expression bi or bip an? Both are prefixes of completeable expressions, but neither is a Σ-completeable\Sigma\text{-completeable} expression.

Formally, we could say that our grammar defines a language LL over an alphabet Σ={/b(u|o|i)p/,’and’,’(’,’)’,’at’,’bap’}\Sigma = \{ \text{/b(u|o|i)p/}, \text{'and'}, \text{'('}, \text{')'}, \text{'at'}, \text{'bap'} \} and that this alphabet is itself a language LΣL_\Sigma over the alphabet of bytes Σbyte=[0..255]\Sigma_{\text{byte}} = [0..255] (or more generally Σvocab\Sigma_\text{vocab} corresponding to any vocabulary where ΣvocabΣbyte\Sigma_\text{vocab} \subseteq \Sigma_{\text{byte}}^* and ΣΣvocab\Sigma \subseteq \Sigma_\text{vocab}^*).

Disambiguation of LΣL_\Sigma

LΣL_\Sigma can be decomposed as the union of all keywords in Σ\Sigma. Each element lΣl \in \Sigma defines a language LlΣvocabL_l \subseteq \Sigma_\text{vocab}^*, where LlL_l is either a singleton set (for fixed keywords) or a regular language (for patterns like identifiers)1:

LΣ=lΣLlL_\Sigma = \bigcup_{l\in \Sigma} L_l

For example, in our grammar:

  • L’and’={’and’}L_{\text{'and'}} = \{\text{'and'}\} (a fixed keyword)
  • L/b(u|o|i)p/={’bup’,’bop’,’bip’}L_{\text{/b(u|o|i)p/}} = \{\text{'bup'}, \text{'bop'}, \text{'bip'}\} (a regex pattern)

Each LlL_l can be represented by a regular expression rlr_l such that L(rl)=LlL(r_l) = L_l. Therefore:

LΣ=lΣL(rl)L_\Sigma = \bigcup_{l\in \Sigma} L(r_l)

Following this, we can define rΣ=lΣrlr_\Sigma = \bigcup_{l \in \Sigma} r_l such that L(rΣ)=LΣL(r_\Sigma) = L_\Sigma

Validation from Σvocab\Sigma_\text{vocab} to LΣL_\Sigma

To validate partial inputs over Σvocab\Sigma_\text{vocab}, we can define a function

π(s):Σvocab{complete,partial,reject}\pi(s) : \Sigma_{\text{vocab}}^* \rightarrow \{\text{complete},\text{partial},\text{reject}\}

as follows:

π(s)={completesLΣpartialsLΣsΣvocab,ssLΣrejectotherwise\pi(s) = \begin{cases} \text{complete} & s \in L_\Sigma \\ \text{partial} & s \notin L_\Sigma \land \exists s' \in \Sigma_{\text{vocab}}^*,\, ss' \in L_\Sigma \\ \text{reject} & \text{otherwise} \end{cases}

This function bridges the gap between the formal language’s alphabet and the model vocabulary, allowing us to validate all possible partial (complete-able) inputs.

Σvocab-completeability\Sigma_{\text{vocab}}\text{-completeability} in LL

Now that we have a function relating our model’s vocabulary Σvocab\Sigma_\text{vocab} to the formal language’s alphabet Σ\Sigma, we can define a function that directly relates the model’s vocabulary to the completeability in LL noted Φ^\hat{\Phi}:

Φ^:Σvocab{accept,reject}\hat{\Phi}: \Sigma_{\text{vocab}}^* \rightarrow \{\text{accept},\text{reject}\}

Let t1tntt_1 \cdots t_n \cdot t' be a tokenization of input sΣvocabs \in \Sigma_{\text{vocab}}^* where t1,,tnΣt_1, \ldots, t_n \in \Sigma are complete tokens and tΣvocabt' \in \Sigma_{\text{vocab}}^* is the remaining partial input (noted ε\varepsilon if empty).

Φ^(s)={acceptt=εΦ(t1tn)=acceptaccepttεπ(t)=partial(qΣvocab,tqLΣ,Φ(t1tn(tq))=accept)rejectotherwise\hat{\Phi}(s) = \begin{cases} \text{accept} & t' = \varepsilon \land \Phi(t_1 \cdots t_n) = \text{accept} \\ \text{accept} & t' \neq \varepsilon \land \pi(t') = \text{partial} \land (\exists q \in \Sigma_{\text{vocab}}^*, t'q \in L_\Sigma, \Phi(t_1 \cdots t_{n}(t' q)) = \text{accept})\\ \text{reject} & \text{otherwise} \end{cases}

For disambiguation see2

This captures that we accept a partial input if either

  • It fully tokenizes into element of Σ\Sigma and parses correctly in LL
  • The partial token tt' can be completed by some qΣvocabq \in \Sigma_{\text{vocab}}^* into a valid token lΣl \in \Sigma such that and appending ll to the complete tokens yields an expression that is Σ-completeable\Sigma\text{-completeable} in LL.

Keeping with our example, bi and bip an are both Σvocab-completeable\Sigma_{\text{vocab}}\text{-completeable} in LL, as they can be completed into valid expressions in LΣL_\Sigma: bip and bip and that are themselves Σ-completeable\Sigma\text{-completeable} in LL. On the other hand, bip at bo is not Σvocab-completeable\Sigma_{\text{vocab}}\text{-completeable} in LL, because while it can be completed into the LΣL_\Sigma expression bip at bop, this expression is not Σ-completeable\Sigma\text{-completeable} in LL (it is even outright invalid).

Implementation

Focusing on the regex aspect of the program, we want to implement π(s)\pi(s) for the case where LΣL_\Sigma is defined by a regex rΣr_\Sigma. We cannot trivially do this by directly manipulating automata, but need a more general mechanism on the expressions themselves. In short, we need something that can tell us if a string sΣvocabs \in \Sigma_{\text{vocab}}^* is a valid prefix of a string in LΣL_\Sigma.

One approach is find the set of suffixes S={sΣvocabssLΣ}S = \{ s' \in \Sigma_{\text{vocab}}^* \mid ss' \in L_\Sigma \}, and check if SS is non-empty. The completing suffix of ss is known as the quotient of LΣL_\Sigma with respect to ss, and is written s1LΣs^{-1}L_\Sigma. The expression that can produce this set of quotients is known as the Brzozowski derivative of LΣL_\Sigma with respect to ss, and is written Ds(LΣ)D_s(L_\Sigma).

We work with a standard regular-expression syntax over the alphabet Σvocab\Sigma_{\text{vocab}} to define LΣL_\Sigma:

  • \emptyset (empty language), ε\varepsilon (empty word)
  • literals aΣvocaba \in \Sigma_{\text{vocab}} and character ranges [a..b]={xΣvocabaxb}[a..b] = \{ x \in \Sigma_{\text{vocab}} \mid a \le x \le b \}
  • union r+sr + s, concatenation rsrs and Kleene star rr^*

Let L(r)ΣvocabL(r) \subseteq \Sigma_{\text{vocab}}^* denote the language of regex rr, so that LΣ=L(rΣ)L_\Sigma = L(r_\Sigma) for some regex rΣr_\Sigma. The (left) quotient of a language by a character cΣvocabc \in \Sigma_{\text{vocab}} is

c1L  =  {wΣvocabcwL}.c^{-1}L \;=\; \{ w \in \Sigma_{\text{vocab}}^* \mid cw \in L \}.

We also introduce ν(r)\nu(r) (“nullable”) as the predicate that holds exactly when εL(r)\varepsilon \in L(r):

  • ν()=false\nu(\emptyset) = \text{false}
  • ν(ε)=true\nu(\varepsilon) = \text{true}
  • ν(a)=false\nu(a) = \text{false}
  • ν([a..b])=false\nu([a..b]) = \text{false}
  • ν(r+s)=ν(r)ν(s)\nu(r + s) = \nu(r) \lor \nu(s)
  • ν(rs)=ν(r)ν(s)\nu(rs) = \nu(r) \land \nu(s)
  • ν(r)=true\nu(r^*) = \text{true}

We can then define our Brzozowski derivative ( with cΣvocab c \in \Sigma_{\text{vocab}}~):

Dc()=Dc(ε)=Dc(a)={εif a=cotherwiseDc([a..b])={εif acbotherwiseDc(r+s)=Dc(r)+Dc(s)Dc(rs)=Dc(r)s    {Dc(s)if ν(r)otherwiseDc(r)=Dc(r)r\begin{aligned} D_c(\emptyset) &= \emptyset \\ D_c(\varepsilon) &= \emptyset \\ D_c(a) &= \begin{cases} \varepsilon & \text{if } a = c \\ \emptyset & \text{otherwise} \end{cases} \\ D_c([a..b]) &= \begin{cases} \varepsilon & \text{if } a \le c \le b \\ \emptyset & \text{otherwise} \end{cases} \\ D_c(r + s) &= D_c(r) + D_c(s) \\ D_c(r\cdot s) &= D_c(r)\cdot s \;\cup\; \begin{cases} D_c(s) & \text{if } \nu(r) \\ \emptyset & \text{otherwise} \end{cases} \\ D_c(r^*) &= D_c(r)\cdot r^* \end{aligned}

We can then extend the derivative to arbitrary strings by left-folding on characters (over Σvocab\Sigma_{\text{vocab}}):

Dε(r)=r,Dxc(r)=Dc(Dx(r)),xΣvocab,  cΣvocab.\begin{aligned} D_{\varepsilon}(r) &= r, \\ D_{xc}(r) &= D_c\bigl(D_x(r)\bigr), \quad x \in \Sigma_{\text{vocab}}^*,\; c \in \Sigma_{\text{vocab}}. \end{aligned}

Equivalently, we apply recursively for s=c1c2cnΣvocabs = c_1 c_2 \cdots c_n \in \Sigma_{\text{vocab}}^*,

Ds(r)=Dcn(Dc2(Dc1(r))).D_s(r) = D_{c_n}\bigl(\cdots D_{c_2}(D_{c_1}(r))\cdots\bigr).

Correctness connects completion to nullability. For the keyword-recognizing regex rΣr_\Sigma (where L(rΣ)=LΣL(r_\Sigma) = L_\Sigma):

ssL(r)    sL(Ds(r)).ss'\in L(r)\iff s'\in L(D_s​(r)).

This is precisely the test we need for complete-ability in π\pi: given a partial input sΣvocabs \in \Sigma_{\text{vocab}}^*, compute Ds(rΣ)D_s(r_\Sigma) and check Ds(rΣ)=D_s(r_\Sigma) = \emptyset:

π(s)={completeν(Ds(rΣ))partialDs(rΣ)rejectotherwise\pi(s) = \begin{cases} \text{complete} & \nu(D_s(r_\Sigma)) \\ \text{partial} & D_s(r_\Sigma) \neq \emptyset \\ \text{reject} & \text{otherwise} \end{cases}

cases evaluated in order for disambiguation

Correspondence to code

The Rust definition below implements DcD_c and then folds it over a string ss:

fn char_derivative(&self, c: char) -> Regex {
    use Regex::*;
    match self {
        Empty => Empty,
        Epsilon => Empty,
        Char(ch) => if *ch == c { Epsilon } else { Empty },
        Range(start, end) => if *start <= c && c <= *end { Epsilon } else { Empty },
        Union(a, b) => Union(Box::new((**a).char_derivative(c)), Box::new((**b).char_derivative(c))),
        Concat(a, b) => {
            let da = a.char_derivative(c);
            if a.nullable() {
                Union(
                    Box::new(Concat(Box::new(da), Box::new((**b).clone()))),
                    Box::new(b.char_derivative(c)),
                )
            } else {
                Concat(Box::new(da), Box::new((**b).clone()))
            }
        }
        Star(r) => Concat(Box::new((**r).char_derivative(c)), Box::new(Star(r.clone()))),
    }
}

pub fn derivative(&self, s: &str) -> Regex {
    s.chars().fold(self.clone(), |acc, c| acc.char_derivative(c))
}

In practice, one also applies algebraic simplifications after each construction (e.g., +r=r\emptyset + r = r, r=\emptyset r = \emptyset, εr=r\varepsilon r = r) to control expression growth.

In the real implementation, we use the Brzozowski derivative not only to implement π(s)\pi(s), but also as a policy for the sampling system in the constrained decoding core of Proposition 7.

Conclusion

In a real world implementaiton with grammar defined in EBNF-like format and an input ss over Σvocab\Sigma_\text{vocab}, we can implement Φ^(s)\hat{\Phi}(s) as follows:

  1. Tokenize ss into tokens t1,t2,,tnt_1, t_2, \ldots, t_n over Σ\Sigma.
  • If tokenization succeeds and ss is fully tokenized, return Φ(s)\Phi(s)
  • If tokenization fails with remaning input tt', continue to step 2
  1. Compute r=Dt(rΣ)r' = D_{t'}(r_\Sigma) where rΣ=lΣrlr_\Sigma = \sum_{l \in \Sigma} r_l
  2. Build the set Tn+1={yΣ (tn+2,tn+3,...tn+k)Σ,Φ(t1t2...tn y tn+2,tn+3,...tn+k)=accept}T_{n+1} = \{y \in \Sigma | ~\exists (t_{n+2},t_{n+3},...t_{n+k}) \in \Sigma^*, \Phi(t_1t_2...t_n~y~t_{n+2},t_{n+3},...t_{n+k}) = \text{accept}\} of possible next continuations in LL
  3. Check if xΣvocab,lΣ\exists x \in \Sigma_\text{vocab}^*,l \in \Sigma such that τTn+1,τ=l x\exists \tau \in T_{n+1}, \tau = l~x where ν(Dtx(rl))=true\nu(D_{t'x}(r_l)) = \text{true}.
    • If yes, return accept
    • Otherwise, return reject

For the precise use case of constrained generation, Tn+1T_{n+1} and rr' are passed to the sampling system and applied as filters on the model’s vocabulary Σvocab\Sigma_\text{vocab} to only allow tokens that can complete ss into a valid expression.

Obviously, this is only a small piece in the puzzle, and a lot of challenges remain with the implementation of Φ\Phi on arbitrary, context-dependent, languages, with context and type dependencies. It’s still cool to see how basic concepts like this have deep implications and practical value in real-world systems.


Footnotes

  1. For the purposes of this article, we assume tokenization follows the longest match principle: when multiple tokens could match a prefix of the input, the longest matching token is selected. This ensures deterministic tokenization and prevents ambiguity in the decomposition above. For instance, if both ‘a’ and ‘and’ are in Σ\Sigma, the input “and” tokenizes uniquely as the token ‘and’ rather than ‘a’ followed by remaining input “nd”.

  2. Since π(t)=partial\pi(t') = \text{partial} guarantees that some completion of tt' exists, the second case checks whether at least one such completion (noted here tqt'q) validates the Φ\Phi condition and yields a Σ\Sigma-completeable expression. For constrained sampling, finding any valid path forward suffices.

Back to Blog

Related Posts

View All Posts »