Options for Type-Level Safety
Since creating specialized data types identifies errors ahead of time,
I tried my hand at turning schemas into Lean types. But I became dissatisfied with how JSON Schema code generation
produces unergonomic types, no matter what I try. This mismatch pains me in several languages.
Meanwhile, most languages don't reason about runtime validation. If you know that a JSON passed validation, then the compiler
can't guarantee any properties of your data. But in Lean, we can use the validation as an assumption in a theorem.
Then we can prove that fields are populated, or go backwards and prove that validation would check if we ran it.
This idea wouldn't leave me, so I tried to implement it and came up with json-subtyping.
When attaching properties to our JSON data in Lean, it's convenient to emulate a kind of subtyping.
Although the standard sum and product types have many nice properties for theorem proving, plotting becomes
excessively verbose. When every union is implicitly tagged with a discriminator,
as in Lean, extensions and escape hatches for plots complicate every union construction. Subtyping is one approach
to minimizing the difficulty when you extend types. Furthermore, if you use some gradual typing,
you can move fast when you don't need generalization, like for some one-off plots.
I also have hope for similar schema-based subtyping to provide guarantees for dataframe libraries, SQL results, and
tensor manipulation. Those are places where traditional type systems have dropped the ball, where even validation
frameworks like Pandera for pandas have high costs
and limited utility. Using dependent types can invite "dependent type hell", but I had luck
with just attaching the validation check as a Prop. By avoiding inductive "indexed families"
in favor of a subtyping pattern,
we also can more easily vary the underlying implementation. That said, the theorem proving for subtyping
still require a significant amount of work, so I would only consider it when safety is a
requirement not a luxury. Usually I am plotting very specific data, where it is easy
to check after the fact.
By sharing an excessive level of detail below, I aim to encourage more people to work on libraries like
json-schema-lean--Lean needs more libraries in my opinion. I also want to explore the details
around code generation options in Lean. Finally, I want to show that subtyping is a viable pattern for type
safety in dependent type languages. But at the end, there are still several unexplored options for better
plotting in Lean I'll mention.
Step 1: JSON Schema in Lean
Starting out, I was lucky there was already a json-schema-lean library--a GSoC 2024 project.
The main branch was incomplete, but there was nice testing infrastructure.
I figured it wouldn't be too difficult to complete.
JSON Schema
JSON Schema is a schema spec; specs are written in JSON, and schema implementations let you validate JSON.
The schema language lets you constrain JSON values, take unions of schemas, reference external definitions, etc.
Tools like OpenAPI and openapi-generator generate stubs in different languages. Personally, the generated
code is substandard but more or less workable depending on the language and application.
I have primarily used JSON Schema Draft 7, which has a relatively short spec.
With it in hand, I implemented most of the features within a week or two, using a lot of Claude Code for most features,
especially once it got really repetitive. I discovered that there were a lot of features of Draft 7 I never use nor want to:
-
You can pattern match regexes in strings.
-
You can have loops in your schema references, which you have to guard against.
-
There are properties, pattern properties, additional properties, etc. These all interact with each other.
The schema turned into a big mutually inductive type,
since schemas have unusual options like just true or nested if-then-else schemas.
Bowtie's comprehensive tests saved me a bunch. There were quite a few parts around references where the spec confused me.
Lean has a few regex libraries to choose from like Regex. That also saved me a
lot of trouble since regex is not an isolated feature; it can be used to select schemas for properties. Sadly there was
another missing element leading to yet another side-quest in the form of URI resolution.
Step 1.1: URIs
Uniform Resource Identifiers are special strings that serve
as semantic IDs. They don't guarantee stability or uniqueness, but they often tell you something about some digital resource.
URLs are a subtype of URIs. A URL is a URI plus the protocol to retrieve the URL.
The URI spec (RFC 3986) breaks URI into URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment],
and it defines the percent-encoding and decoding (like why Google replaces spaces with %20 in the results URL).
Different schemes can then have special rules on top like URL domain resolution or base64-encoding. There's also relative URIs
to access "nearby" resources.
So what does this have to do with JSON Schema? Well, $ids and $refs follow the rules of URIs. You can use different
protocols, you can use relative paths, and fragments point to definitions within files. Since Lean didn't have
a URI library, I implemented one at lean-uri.
Since testing is not as well centralized compared to JSON Schema, I mainly picked examples from the RFC and looked at
Haskell equivalents. For implementation, I started with hoping Claude could read and operationalize the spec, then rewriting it when it got ugly.
Since I hold parser combinators close to heart, I started with some hand-written Std.Internal.Parsec.String,
and tried to work within that monad, making the code more uniform. The final interface provides
parsing, normalization, resolution, which all feature in JSON Schema.
#guard_msgs in
#eval match LeanUri.URI.parse "https://EXAMPLE.com/a/./b/../c?x#y" with
| .ok u => (LeanUri.URI.normalize u).toString
| .error e => s!"error: {e}"
Step 1.1.1: Testing
Since URIs (unlike URLs) didn't seem to have a nice testing framework already, I really wanted to count on my own testing.
Lean has a way to hook up an executable to lake test,
and that attracted me too. I didn't really like LSpec, which appears
to be the only testing library in Lean. My conclusion: time to write a little testing library. I decided to put it in a
Test monad based on IO, where you can log passed, failed, and skipped tests to monad state, group tests, etc.
inductive TestTree where
| test (name : String) (passed : Bool) (skipped : Bool := false) (message : Option String := none)
| group (name : String) (children : Array TestTree) (skipped : Bool := false)
deriving Inhabited, Repr
structure TestState where
stack : List (String × Array TestTree) := [ ("", #[]) ] -- (group name, children)
filter : String → Bool := fun _ => true
deriving Inhabited
abbrev TestM := StateT TestState IO
This worked fine for what I needed, and I reused the helpers for testing code gen on a branch of json-schema-lean.
Some tiny features like filtering or whatever are simple. I doubt I'll use this long-term, since having IO in all your
tests is a blessing and a curse. There are too many features I want in a truly excellent testing library to use my hacky version
forever.
-- This is nice, but could it be simpler?
def testHexConversion : TestM Unit := testFunction "Hex Conversion" do
testEq "hex 0" (uInt8toOneHexChar 0) '0' -- testEq registers the equality test under a name
Once I had the basic URI interface, it was mainly a matter of repeatedly testing the $ref and $id cases until I got tests
passing.
A slight detour: Json Pointer Fragments
We can refer to definitions within files as well with JSON "pointer fragments". The pointer fragment #/definitions/A/
refers to obj["definitions"]["A"], and similar notation is available for indexing lists, etc.
RFC 6901 describes the format for JSON pointer fragments trailing on a URI
anchor.
What doesn't work?
Several Bowtie tests require you to read referenced JSON schemas on the fly from a URL. Lean does not yet (as of January 2026) have a standard HTTP library to handle URLs,
so I just decided to give up on that part.
Step 2: Json Schema Code Gen
Now that I parse and understand all the features of JSON schema, I decided to try generating Lean structs from JSON schema.
My strategy plodded along, trying to parse Schema into the Except String Std.Format monad: first try to build simple
abbreviations, then structures, etc. I considered macros, but inserting comments eluded me. I settled on
Std.Formats, Lean's
way to handle indentation, precedence, etc. The API was far nicer than I deserved, even though it is string
manipulation.
Eventually, I needed more code generation context, recursion, and custom toJson/fromJson instances. I settled
on writing all the parsers into a SchemaGen TypeDefinition monad where SchemaGen had context and error handling
while TypeDefinition held all the parsed definitions. Using a custom monad name here is extremely useful, and I
understand now why so many libraries live in a custom named monad transformer.
structure TypeDefinition where
typeDecl : Std.Format
fromJsonImpl : Option Std.Format := none
toJsonImpl : Option Std.Format := none
dependencies : List TypeDefinition := []
extraDocComment : Option Std.Format := none
deriving Inhabited
structure Config where
sanitizeName : String → String
indent : String := " "
generateFromJson : Bool := false
generateToJson : Bool := false
includeBaseNamePrefix : Bool := false
structure SchemaID where
baseURI : LeanUri.URI
path : List String
deriving BEq, Hashable, Inhabited
structure CodeGenContext where
resolver : Resolver
nameMap : Std.HashMap SchemaID String
config : Config
baseURI : LeanUri.URI
currentTypeName : Option String := none
currentSchemaID : Option SchemaID := none
abbrev SchemaGen := ReaderT CodeGenContext (Except String ·)
def SchemaGen.run (gen : SchemaGen α) (ctx : CodeGenContext)
: Except String α := ReaderT.run gen ctx
Lean requires mutually recursive types to be put in mutual blocks, so
Tarjan's strongly connected components
lets me detect the cyclic dependencies in the schema reference graph.
The implementation can essentially be automated as there exist examples of it in Mathlib
and Lean.SCC.
So Claude could do it, taking away some of my fun ☹️.
Aside from that, there are so many different special cases, abbreviations, inductives, etc. The tension
between verbose inductives and the unintended consequences of abbreviations was quite annoying.
Step 3: Code Gen with Vega-Lite
How did code gen work with Vega-Lite?
-
Mutual inductive types don't generate right, oops?
-
It would take several minutes to compile, and it's >15,000 lines.
-
All the code is too ugly to use.
Once I saw how I'd have to construct a RectConfig, I realized the library would
not be fun to use as a whole.
structure RectConfig where
/-- The horizontal alignment of the text or ranged marks (area, bar, image, rect, rule). One of `"left"`, `"right"`, `"center"`.
__Note:__ Expression reference is *not* supported for range marks. -/
align : Option (Align ⊕ ExprRef) := none
/-- The rotation angle of the text, in degrees. -/
angle : Option (Float ⊕ ExprRef) := none
/-- A boolean flag indicating if [ARIA attributes](https://developer.mozilla.org/en-US/docs/Web/Accessibility/ARIA) should be included (SVG output only). If `false`, the "aria-hidden" attribute will be set on the output SVG element, removing the mark item from the ARIA accessibility tree. -/
aria : Option (Bool ⊕ ExprRef) := none
/-- Sets the type of user interface element of the mark item for [ARIA accessibility](https://developer.mozilla.org/en-US/docs/Web/Accessibility/ARIA) (SVG output only). If specified, this property determines the "role" attribute. Warning: this property is experimental and may be changed in the future. -/
ariaRole : Option (String ⊕ ExprRef) := none
/-- A human-readable, author-localized description for the role of the mark item for [ARIA accessibility](https://developer.mozilla.org/en-US/docs/Web/Accessibility/ARIA) (SVG output only). If specified, this property determines the "aria-roledescription" attribute. Warning: this property is experimental and may be changed in the future. -/
... -- It goes on like this for all the alphabetically sorted fields.
Now instead of {align: "left", ...}, we have to use .some (.inl .left). Specifying .inl or .inr and finding the
right enum name can be really annoying. The long list of fields makes it difficult to remember which one if any is required.
There are definitely ways to simplify this, but doing it automatically might be a bit much.
Back to the drawing board
You know what would be really convenient? If you could determine if the JSON schema validation passed during compile time.
If the entire JSON is known at compile time, we can prove it using by native_decide to
run the computation validate schema json and verify it equals true. What if a theorem about your JSON's structure proved validate schema json would pass? That
sounds great, but proving the most useful properties requires validate terminates--not an easy proof at all.
If we use a more restrictive schema language, we could more easily prove validation ahead of time.
I put this idea away for a week or so. After using TypeScript a bit professionally, I figured
that we could crib TypeScript's homework and use a fragment of it to check JSON. One day
I spent a few hours writing up a blueprint
in Typst (an up-and-coming LaTeX alternative). This was also more fun than improving JSON Schema validation.
I also considered building the schema checking right into the JSON constructors as an inductive "indexed family",
but using a schema.check function let me use the existing JSON data structure as a base.
This brought me to separating out the proof data into a subtype equivalent to { x : Json // schema.check x },
which feels like a breath of fresh air compared to carefully encoding properties inductively.
Overall, I believe that decoupling your evidence from your data is by default the correct approach,
so that you only depend on a minimal amount of structure for your data. For example,
we may want to one day switch to a more optimized tree format, which is easier the fewer
propositions are bundled in it. We will use a few indexed families later for specific proofs
in object construction and subtyping, but it is restricted to within those procedures.
TypeScript
JavaScript has 6 types: null, string, number, boolean, object, undefined. Many "types" in JavaScript are
really just special objects like "functions". TypeScript lets you gradually define types to assign properties of
variables at code locations, and then it will check properties "x has property 'id'" at compile-time by inspecting
the control flow graph connecting locations. TypeScript does not try much to be sound.
Blueprint
I've never taken any course on type theory, so I probably did not use proof tree rules correctly, but
I wrote a bunch of them in the blueprint anyway. I found it helpful especially for subtyping objects
correctly.

For objects, JSON objects are always allowed to have extra fields, which becomes a lot more annoying when you are
checking whether fields exist in big type unions. More on that later.
Strange challenges
Lean has a built-in JSON type at Lean.Json,
which is used for the language server among other applications. Since the JSON type is recursive, and the recursion is contained
in types like Array and maps, Lean identifies it as a "nested inductive type".
Nested inductives are massaged much more to make them appear normal. They are reorganized
as an ordinary inductive, and then new definitions are used to create the appearance of
the original nested inductive. The default induction rules are a real pain to use,
since they include the induction rules for Arrays and maps. Worse, you cannot include
dependent terms in inductive data types. The kernel will error with nested inductive datatypes parameters
cannot contain local variables, which retains some mystery to me.
Std.TreeMap.Raw
The local variables restriction prevents the use of Std.TreeMap,
a balanced binary search tree, in nested structures like JSON. Std.TreeMap contains
a proof that the tree is properly constructed. Without that "well-formed" predicate,
operations like insertion and lookup have fewer guarantees, but its inclusion does
not play well with the nested inductive. So instead we have to use Std.TreeMap.Raw,
which has no proof the data is "well-formed". The usual solution is to add the well-formedness again
in the inductive, but the Json type doesn't attach the well-formedness anywhere.
I carefully tried to work around potentially ill-formed trees by asserting properties about
.get?
calls or by constructing maps using ofList,
whose output is always well-formed.
Proving anything about JSON
Given that Json sheds so many invariants, it is somewhat surprising how elegant it is to prove JSON properties.
As long as you can prove termination of functions on JSON, you can write your own custom induction rules
by proving induction terminates.
I spent a day reading Lean's reference manual on "well-founded recursion"
and tried simplifying termination goals that appear in a Json.beq implementation implicitly using such
recurion. The reference manual lays out how termination_by and decreasing_by let you access
these normally hidden goals. As opposed to trying to use induction, we write a recursive function where
we only call on "smaller" values somehow. You can use any "well-founded" relation
you want, which guarantees that eventually an execution will reduce the data to nothing and no more
recursive calls are possible.
For instance, when we iterate .obj map : Lean.Json and recursively call on the values,
we want to know that (k, v) ∈ map implies the size of v is smaller than map, which is smaller than .obj map.
Lean provides a sizeOf definition automatically for most types where this property is practically guaranteed.
During automatic termination checking, Lean attaches the membership proofs to elements of lists with
List.attach (l : List α) : List { x // x ∈ l }.
We can then apply List.sizeOf_lt_of_mem {a : α} {as : List α} (h : a ∈ as) : sizeOf a < sizeOf as.
Although this "attach" framework works well (and automatically) for lists and arrays, we lack those theorems for Std.TreeMap.
After combing through Std.TreeMap's source code with Claude Code, I learned about Std.DTreeMap.Internal.Impl.toListModel.
This function inefficiently flattens into a list, but the reasoning is far simpler. It's also proved equivalent
to .toList, which is more efficient. Claude Code was essentially able to one-shot the sizeOf lemmas:
open Std.DTreeMap.Internal in
theorem Impl.sizeOf_lt_of_mem {α : Type u} {β : α → Type v} [SizeOf α] [(a : α) → SizeOf (β a)]
{t : Impl α β} {k : α} {v : β k} (h : ⟨k, v⟩ ∈ t.toListModel) :
sizeOf v < sizeOf t := α:Type uβ:α → Type vinst✝¹:SizeOf αinst✝:(a : α) → SizeOf (β a)t:Impl α βk:αv:β kh:⟨k, v⟩ ∈ t.toListModel⊢ sizeOf v < sizeOf t
induction t with
α:Type uβ:α → Type vinst✝¹:SizeOf αinst✝:(a : α) → SizeOf (β a)k:αv:β kh:⟨k, v⟩ ∈ Impl.leaf.toListModel⊢ sizeOf v < sizeOf Impl.leaf All goals completed! 🐙
α:Type uβ:α → Type vinst✝¹:SizeOf αinst✝:(a : α) → SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:⟨k, v⟩ ∈ l.toListModel → sizeOf v < sizeOf lihr:⟨k, v⟩ ∈ r.toListModel → sizeOf v < sizeOf rh:⟨k, v⟩ ∈ (Impl.inner sz k' v' l r).toListModel⊢ sizeOf v < sizeOf (Impl.inner sz k' v' l r)
α:Type uβ:α → Type vinst✝¹:SizeOf αinst✝:(a : α) → SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:⟨k, v⟩ ∈ l.toListModel → sizeOf v < sizeOf lihr:⟨k, v⟩ ∈ r.toListModel → sizeOf v < sizeOf rh:⟨k, v⟩ ∈ l.toListModel ∨ ⟨k, v⟩ = ⟨k', v'⟩ ∨ ⟨k, v⟩ ∈ r.toListModel⊢ sizeOf v < sizeOf (Impl.inner sz k' v' l r)
α:Type uβ:α → Type vinst✝¹:SizeOf αinst✝:(a : α) → SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:⟨k, v⟩ ∈ l.toListModel → sizeOf v < sizeOf lihr:⟨k, v⟩ ∈ r.toListModel → sizeOf v < sizeOf rh:⟨k, v⟩ ∈ l.toListModel ∨ ⟨k, v⟩ = ⟨k', v'⟩ ∨ ⟨k, v⟩ ∈ r.toListModel⊢ sizeOf v < 1 + sizeOf sz + sizeOf k' + sizeOf v' + sizeOf l + sizeOf r
α:Type uβ:α → Type vinst✝¹:SizeOf αinst✝:(a : α) → SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:⟨k, v⟩ ∈ l.toListModel → sizeOf v < sizeOf lihr:⟨k, v⟩ ∈ r.toListModel → sizeOf v < sizeOf rhl:⟨k, v⟩ ∈ l.toListModel⊢ sizeOf v < 1 + sizeOf sz + sizeOf k' + sizeOf v' + sizeOf l + sizeOf rα:Type uβ:α → Type vinst✝¹:SizeOf αinst✝:(a : α) → SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:⟨k, v⟩ ∈ l.toListModel → sizeOf v < sizeOf lihr:⟨k, v⟩ ∈ r.toListModel → sizeOf v < sizeOf rheq:⟨k, v⟩ = ⟨k', v'⟩⊢ sizeOf v < 1 + sizeOf sz + sizeOf k' + sizeOf v' + sizeOf l + sizeOf rα:Type uβ:α → Type vinst✝¹:SizeOf αinst✝:(a : α) → SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:⟨k, v⟩ ∈ l.toListModel → sizeOf v < sizeOf lihr:⟨k, v⟩ ∈ r.toListModel → sizeOf v < sizeOf rhr:⟨k, v⟩ ∈ r.toListModel⊢ sizeOf v < 1 + sizeOf sz + sizeOf k' + sizeOf v' + sizeOf l + sizeOf r
α:Type uβ:α → Type vinst✝¹:SizeOf αinst✝:(a : α) → SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:⟨k, v⟩ ∈ l.toListModel → sizeOf v < sizeOf lihr:⟨k, v⟩ ∈ r.toListModel → sizeOf v < sizeOf rhl:⟨k, v⟩ ∈ l.toListModel⊢ sizeOf v < 1 + sizeOf sz + sizeOf k' + sizeOf v' + sizeOf l + sizeOf r α:Type uβ:α → Type vinst✝¹:SizeOf αinst✝:(a : α) → SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:⟨k, v⟩ ∈ l.toListModel → sizeOf v < sizeOf lihr:⟨k, v⟩ ∈ r.toListModel → sizeOf v < sizeOf rhl:⟨k, v⟩ ∈ l.toListModelthis:?_mvar.17769 := @_fvar.16588 _fvar.17737⊢ sizeOf v < 1 + sizeOf sz + sizeOf k' + sizeOf v' + sizeOf l + sizeOf r; All goals completed! 🐙
α:Type uβ:α → Type vinst✝¹:SizeOf αinst✝:(a : α) → SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:⟨k, v⟩ ∈ l.toListModel → sizeOf v < sizeOf lihr:⟨k, v⟩ ∈ r.toListModel → sizeOf v < sizeOf rheq:⟨k, v⟩ = ⟨k', v'⟩⊢ sizeOf v < 1 + sizeOf sz + sizeOf k' + sizeOf v' + sizeOf l + sizeOf r α:Type uβ:α → Type vinst✝¹:SizeOf αinst✝:(a : α) → SizeOf (β a)k:αv:β ksz:Natl:Impl α βr:Impl α βihl:⟨k, v⟩ ∈ l.toListModel → sizeOf v < sizeOf lihr:⟨k, v⟩ ∈ r.toListModel → sizeOf v < sizeOf r⊢ sizeOf v < 1 + sizeOf sz + sizeOf k + sizeOf v + sizeOf l + sizeOf r; All goals completed! 🐙
α:Type uβ:α → Type vinst✝¹:SizeOf αinst✝:(a : α) → SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:⟨k, v⟩ ∈ l.toListModel → sizeOf v < sizeOf lihr:⟨k, v⟩ ∈ r.toListModel → sizeOf v < sizeOf rhr:⟨k, v⟩ ∈ r.toListModel⊢ sizeOf v < 1 + sizeOf sz + sizeOf k' + sizeOf v' + sizeOf l + sizeOf r α:Type uβ:α → Type vinst✝¹:SizeOf αinst✝:(a : α) → SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:⟨k, v⟩ ∈ l.toListModel → sizeOf v < sizeOf lihr:⟨k, v⟩ ∈ r.toListModel → sizeOf v < sizeOf rhr:⟨k, v⟩ ∈ r.toListModelthis:?_mvar.18128 := @_fvar.16589 _fvar.17765⊢ sizeOf v < 1 + sizeOf sz + sizeOf k' + sizeOf v' + sizeOf l + sizeOf r; All goals completed! 🐙
theorem TreeMap.Raw.sizeOf_lt_of_mem {α β : Type} {cmp : α → α → Ordering}
[SizeOf α] [SizeOf β] {t : Std.TreeMap.Raw α β cmp} {k : α} {v : β}
(h : ⟨k, v⟩ ∈ t.inner.inner.toListModel) :
sizeOf v < sizeOf t := α:Typeβ:Typecmp:α → α → Orderinginst✝¹:SizeOf αinst✝:SizeOf βt:Std.TreeMap.Raw α β cmpk:αv:βh:⟨k, v⟩ ∈ t.inner.inner.toListModel⊢ sizeOf v < sizeOf t
α:Typeβ:Typecmp:α → α → Orderinginst✝¹:SizeOf αinst✝:SizeOf βt:Std.TreeMap.Raw α β cmpk:αv:βh:⟨k, v⟩ ∈ t.inner.inner.toListModelh1:[Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.]
Eq.{?_uniq.18434} ?_uniq.18527 (SizeOf.sizeOf.{?_uniq.18435} ?_uniq.18436 ?_uniq.18437 _uniq.18428) (HAdd.hAdd.{?_uniq.18451, ?_uniq.18450, ?_uniq.18449} ?_uniq.18510 ?_uniq.18511 ?_uniq.18512 ?_uniq.18513 (OfNat.ofNat.{?_uniq.18454} ?_uniq.18453 1 ?_uniq.18455) (SizeOf.sizeOf.{?_uniq.18463} ?_uniq.18464 ?_uniq.18465 (Std.TreeMap.Raw.inner.{?_uniq.18467, ?_uniq.18466} ?_uniq.18468 ?_uniq.18469 ?_uniq.18470 _uniq.18428))) :=
[Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.]
Std.TreeMap.Raw.mk.sizeOf_spec.{?_uniq.18533, ?_uniq.18532} ?_uniq.18534 ?_uniq.18535 ?_uniq.18536 ?_uniq.18537 ?_uniq.18538 (Std.TreeMap.Raw.inner.{?_uniq.18540, ?_uniq.18539} ?_uniq.18541 ?_uniq.18542 ?_uniq.18543 _uniq.18428)⊢ sizeOf v < sizeOf t
α:Typeβ:Typecmp:α → α → Orderinginst✝¹:SizeOf αinst✝:SizeOf βt:Std.TreeMap.Raw α β cmpk:αv:βh:⟨k, v⟩ ∈ t.inner.inner.toListModelh1:[Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.]
Eq.{?_uniq.18434} ?_uniq.18527 (SizeOf.sizeOf.{?_uniq.18435} ?_uniq.18436 ?_uniq.18437 _uniq.18428) (HAdd.hAdd.{?_uniq.18451, ?_uniq.18450, ?_uniq.18449} ?_uniq.18510 ?_uniq.18511 ?_uniq.18512 ?_uniq.18513 (OfNat.ofNat.{?_uniq.18454} ?_uniq.18453 1 ?_uniq.18455) (SizeOf.sizeOf.{?_uniq.18463} ?_uniq.18464 ?_uniq.18465 (Std.TreeMap.Raw.inner.{?_uniq.18467, ?_uniq.18466} ?_uniq.18468 ?_uniq.18469 ?_uniq.18470 _uniq.18428))) :=
[Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.]
Std.TreeMap.Raw.mk.sizeOf_spec.{?_uniq.18533, ?_uniq.18532} ?_uniq.18534 ?_uniq.18535 ?_uniq.18536 ?_uniq.18537 ?_uniq.18538 (Std.TreeMap.Raw.inner.{?_uniq.18540, ?_uniq.18539} ?_uniq.18541 ?_uniq.18542 ?_uniq.18543 _uniq.18428)h2:[Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.]
Eq.{?_uniq.18562} ?_uniq.18644 (SizeOf.sizeOf.{?_uniq.18563} ?_uniq.18564 ?_uniq.18565 (Std.TreeMap.Raw.inner.{?_uniq.18567, ?_uniq.18566} ?_uniq.18568 ?_uniq.18569 ?_uniq.18570 _uniq.18428)) (HAdd.hAdd.{?_uniq.18573, ?_uniq.18572, ?_uniq.18571} ?_uniq.18627 ?_uniq.18628 ?_uniq.18629 ?_uniq.18630 (OfNat.ofNat.{?_uniq.18576} ?_uniq.18575 1 ?_uniq.18577) (SizeOf.sizeOf.{?_uniq.18585} ?_uniq.18586 ?_uniq.18587 (Std.DTreeMap.Raw.inner.{?_uniq.18594, ?_uniq.18593} ?_uniq.18595 ?_uniq.18596 ?_uniq.18597 (Std.TreeMap.Raw.inner.{?_uniq.18589, ?_uniq.18588} ?_uniq.18590 ?_uniq.18591 ?_uniq.18592 _uniq.18428)))) :=
[Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.]
Std.DTreeMap.Raw.mk.sizeOf_spec.{?_uniq.18646, ?_uniq.18645} ?_uniq.18647 ?_uniq.18648 ?_uniq.18649 ?_uniq.18650 ?_uniq.18651 (Std.DTreeMap.Raw.inner.{?_uniq.18658, ?_uniq.18657} ?_uniq.18659 ?_uniq.18660 ?_uniq.18661 (Std.TreeMap.Raw.inner.{?_uniq.18653, ?_uniq.18652} ?_uniq.18654 ?_uniq.18655 ?_uniq.18656 _uniq.18428))⊢ sizeOf v < sizeOf t
α:Typeβ:Typecmp:α → α → Orderinginst✝¹:SizeOf αinst✝:SizeOf βt:Std.TreeMap.Raw α β cmpk:αv:βh:⟨k, v⟩ ∈ t.inner.inner.toListModelh1:[Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.]
Eq.{?_uniq.18434} ?_uniq.18527 (SizeOf.sizeOf.{?_uniq.18435} ?_uniq.18436 ?_uniq.18437 _uniq.18428) (HAdd.hAdd.{?_uniq.18451, ?_uniq.18450, ?_uniq.18449} ?_uniq.18510 ?_uniq.18511 ?_uniq.18512 ?_uniq.18513 (OfNat.ofNat.{?_uniq.18454} ?_uniq.18453 1 ?_uniq.18455) (SizeOf.sizeOf.{?_uniq.18463} ?_uniq.18464 ?_uniq.18465 (Std.TreeMap.Raw.inner.{?_uniq.18467, ?_uniq.18466} ?_uniq.18468 ?_uniq.18469 ?_uniq.18470 _uniq.18428))) :=
[Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.]
Std.TreeMap.Raw.mk.sizeOf_spec.{?_uniq.18533, ?_uniq.18532} ?_uniq.18534 ?_uniq.18535 ?_uniq.18536 ?_uniq.18537 ?_uniq.18538 (Std.TreeMap.Raw.inner.{?_uniq.18540, ?_uniq.18539} ?_uniq.18541 ?_uniq.18542 ?_uniq.18543 _uniq.18428)h2:[Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.]
Eq.{?_uniq.18562} ?_uniq.18644 (SizeOf.sizeOf.{?_uniq.18563} ?_uniq.18564 ?_uniq.18565 (Std.TreeMap.Raw.inner.{?_uniq.18567, ?_uniq.18566} ?_uniq.18568 ?_uniq.18569 ?_uniq.18570 _uniq.18428)) (HAdd.hAdd.{?_uniq.18573, ?_uniq.18572, ?_uniq.18571} ?_uniq.18627 ?_uniq.18628 ?_uniq.18629 ?_uniq.18630 (OfNat.ofNat.{?_uniq.18576} ?_uniq.18575 1 ?_uniq.18577) (SizeOf.sizeOf.{?_uniq.18585} ?_uniq.18586 ?_uniq.18587 (Std.DTreeMap.Raw.inner.{?_uniq.18594, ?_uniq.18593} ?_uniq.18595 ?_uniq.18596 ?_uniq.18597 (Std.TreeMap.Raw.inner.{?_uniq.18589, ?_uniq.18588} ?_uniq.18590 ?_uniq.18591 ?_uniq.18592 _uniq.18428)))) :=
[Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.]
Std.DTreeMap.Raw.mk.sizeOf_spec.{?_uniq.18646, ?_uniq.18645} ?_uniq.18647 ?_uniq.18648 ?_uniq.18649 ?_uniq.18650 ?_uniq.18651 (Std.DTreeMap.Raw.inner.{?_uniq.18658, ?_uniq.18657} ?_uniq.18659 ?_uniq.18660 ?_uniq.18661 (Std.TreeMap.Raw.inner.{?_uniq.18653, ?_uniq.18652} ?_uniq.18654 ?_uniq.18655 ?_uniq.18656 _uniq.18428))h3:?_mvar.18681 := Impl.sizeOf_lt_of_mem _fvar.18431⊢ sizeOf v < sizeOf t
All goals completed! 🐙
So my strategy was to use .toList.attach to attach the membership proof, and then
extract the size lemma with .sizeOf_lt_of_mem when proving the size of recursive arguments decreases.
Compared to the verbosity of ordinary induction here, this is much briefer. Now we can implement better
JSON induction methods:
set_option linter.unusedVariables false in
open Lean (Json) in
def Json.recOn {motive : Json → Sort u}
(x : Json)
(null : motive .null)
(bool : ∀ b, motive (.bool b))
(num : ∀ n, motive (.num n))
(str : ∀ s, motive (.str s))
(arr : ∀ (a : Array Json), (∀ j, j ∈ a → motive j) → motive (.arr a))
(obj : ∀ (o : Std.TreeMap.Raw String Json compare),
(∀ field value, ⟨field, value⟩ ∈ o.inner.inner.toList → motive value) →
motive (.obj o)) :
motive x :=
match x with
| .null => null
| .bool b => bool b
| .num n => num n
| .str s => str s
| .arr a => arr a (fun j jMem => Json.recOn j null bool num str arr obj)
| .obj o => obj o (fun k v vMem => Json.recOn v null bool num str arr obj)
termination_by x
decreasing_by
a:Array Jsonj:JsonjMem:j ∈ a⊢ sizeOf j < sizeOf (Json.arr a) suffices sizeOf j < sizeOf a a:Array Jsonj:JsonjMem:j ∈ athis:sizeOf _fvar.20252 < sizeOf _fvar.20250 := ?_mvar.20760⊢ sizeOf j < sizeOf (Json.arr a) a:Array Jsonj:JsonjMem:j ∈ athis:sizeOf _fvar.20252 < sizeOf _fvar.20250 := ?_mvar.20760⊢ sizeOf j ≤ sizeOf a; All goals completed! 🐙
All goals completed! 🐙
suffices sizeOf v < sizeOf o o:Std.TreeMap.Raw String Json comparek:Stringv:JsonvMem:⟨k, v⟩ ∈ o.inner.inner.toListthis:sizeOf _fvar.20262 < sizeOf _fvar.20259 := ?_mvar.21399⊢ sizeOf v < sizeOf (Json.obj o) o:Std.TreeMap.Raw String Json comparek:Stringv:JsonvMem:⟨k, v⟩ ∈ o.inner.inner.toListthis:sizeOf _fvar.20262 < sizeOf _fvar.20259 := ?_mvar.21399⊢ sizeOf v ≤ sizeOf o; All goals completed! 🐙
o:Std.TreeMap.Raw String Json comparek:Stringv:JsonvMem:⟨k, v⟩ ∈ o.inner.inner.toList⊢ ⟨?k, v⟩ ∈ o.inner.inner.toListModelo:Std.TreeMap.Raw String Json comparek:Stringv:JsonvMem:⟨k, v⟩ ∈ o.inner.inner.toList⊢ String
o:Std.TreeMap.Raw String Json comparek:Stringv:JsonvMem:⟨k, v⟩ ∈ o.inner.inner.toList⊢ ⟨?k, v⟩ ∈ o.inner.inner.toListo:Std.TreeMap.Raw String Json comparek:Stringv:JsonvMem:⟨k, v⟩ ∈ o.inner.inner.toList⊢ String
All goals completed! 🐙
Custom "recursors" like this let you split up recursive definitions really easily as long as you are able
to take and apply hypotheses like j ∈ a. As long as you carry around membership information, you can
split up recursive definitions while still proving termination.
List equality was a pain
Lawful equality already exists for List α, but it expects an existing instance of equality for α.
For a particular subtyping proof, I needed a ReflBEq Json (∀x : Json, x == x); I hoped to use ReflBEq (List α).
At the time, I ran out of patience and cleverness trying to prove a termination goal,
so I reimplemented the lemma instead. Relatedly, writing inline definitions
in a typeclass instance makes it strangely difficult to unfold those definitions.
Subtyping
After handling all those JSON difficulties, I went back to my very helpful blueprint for
how subtyping would work: an incomplete set of rules to determine when one type implied another.
From the beginning, I wanted to bundle the "schema" t : JsonType,
the propositional check, and the JSON into an object TypedJson t. Then when you wanted to check if a typed json also satisfied t',
you could run a decision procedure t.subtype t' to prove ∀x, t.check x → t'.check x. This would allow you to use
JSON in more general contexts without having to care about the exact types involved.
I applied a bit of cleverness by bundling the proof into the return type of the subtype checker:
inductive DecideSubtype (t1 t2 : JsonType) where
| none : DecideSubtype t1 t2
| isSubtype : (∀j, t1.check j = true → t2.check j = true) → DecideSubtype t1 t2
This allowed me to reuse the convoluted logic of the subtype checking for the proof plumbing. Proving that subtyping
worked separately involved so much more casework; it was so much easier to piggyback off the existing casework. Truly, dependent
types are wonderful! Especially as long as you remember: "Programs that use these types will find themselves
forced to mirror the algorithm used in the type-level function so that their structure matches the
pattern-matching and recursive behavior of the type."
This is the blessing and curse of dependent types.
There was a lot of rewriting and sublemmas for all the different JsonType constructors, especially for object subtyping,
but I was mainly correcting the plan, ensuring that required and optional types behaved correctly, and trying a bunch
of examples. Once I got a correct human proof, the Lean version held no surprises.
Object Construction
Given some TypedJson, say x1 x2 : TypedJson t, I would like to be able to construct typed objects during compile time. This
might actually be the ideal use case for heterogeneous lists, a dependently typed list where each
element can have its own type.
I carefully put the key-values as type parameters, so I can access them at compile time more easily, then the actual typed
JSON is included as a field.
/-- Heterogeneous list of object fields indexed by schema.
Each field carries a TypedJson value with proof that it checks against its type. -/
inductive ObjectFields : List (String × JsonType) → Type where
| nil : ObjectFields []
| cons {ty : JsonType} {rest : List (String × JsonType)} (name : String) :
TypedJson ty → ObjectFields rest → ObjectFields ((name, ty) :: rest)
Then I could guarantee objects would pass validation once you ensure no duplicates.
/-- Construct an object from ObjectFields -/
def mkObj {req : List (String × JsonType)} (fields : ObjectFields req)
(noDups : req.Pairwise (fun a b => ¬compare a.1 b.1 = .eq) := by native_decide) :
TypedJson (.object req []) := sorry
I had Claude Code make a macro that would construct ObjectFields using obj{} notation, since I could just point
it at the existing macro for list syntax [a, b, c]. Now we have a more readable constructor:
def testMkObj (s : String) (n : Nat) : TypedJson (
.object [("name", .string), ("age", .number)] []
) := -- obj{} just created an ObjectFields, which mkObj converts to TypedJson
mkObj obj{"name": (s : TypedJson .string), "age": (n : TypedJson .number)}
Narrowing
When you are within if statements, TypeScript is capable of "narrowing"
unions to a variant:
type Person =
| { age: number, id: number }
| { dob: Date }
const person = getPerson();
if ("age" in person) {
// person's type is "narrowed" to {age : number, id: number}
console.log("person.age =", person.age);
console.log("person.id =", person.id);
}
This is really helpful, but it is also more fraught with difficulty than it appears. Since objects may have
extra fields in JSON, it is possible to have person = {dob: ..., age: 10}. Now TypeScript mistakenly
believes that person.id is defined. When I learned about this, I decided that I should stop looking at the TypeScript
source code. For a proof of concept, I only discriminate using a known property's values.
The blog post
"Flow Nodes: How Type Inference Is Implemented"
really helped me here too. For Lean, we have to start with filters for whether properties can match or not:
def JsonType.canMatchPropertyStr (t : JsonType) (key : String) (str : String) : Bool :=
t.canBeObject &&
match t.getKey? key with
| .some kt => kt.check str
| .none => true -- we have no information about key
If x has property key with value str, then I prove that t.check x = true implies t.canMatchPropertyStr key str = true.
Similarly, if x does not have property key with value str, then I can prove t.canMismatchPropertyStr key str = true.
If t = t1 ||| t2 ||| t3 ||| t4, then I can turn that into a list [t1, t2, t3, t4] for which at least one must check.
Now if t.check x = true, then I know some ti.check x = true, so ti.canMatchPropertyStr key str. The upshot
is that I can filter [t1, t2, t3, t4] for canMatchPropertyStr, and one of them will still match x.
theorem JsonType.filterUnion_correctness
{t : JsonType} {f : JsonType → Bool} {x : Json}
(h : t.check x = true)
(h' : ∀ t', t'.check x = true → f t' = true) :
(t.filterUnion f).check x = true := by sorry
Putting this all the two together in JsonType.narrowKeyStr, we can do narrowing:
def circleType : JsonType :=
.object [("type", .strLit "circle"), ("radius", .number)] []
def squareType : JsonType :=
.object [("type", .strLit "square"), ("side", .number)] []
def shapeType : JsonType := circleType.union squareType
def calculateArea (shape : TypedJson shapeType) : IO Unit := do
match h : shape.val.getObjVal? "type" with
| .ok "circle" =>
let circle : TypedJson circleType := (shape.narrowKeyStr "type" "circle" h).coe
IO.println "Computing circle area"
| .ok "square" =>
let square : TypedJson squareType := (shape.narrowKeyStr "type" "square" h).coe
IO.println "Computing square area"
| _ => IO.println "Unknown shape type"
Conclusion
Despite the potential, creating schemas for all of VegaLite might be more work than I'd
be willing to go through for plotting, where I only need correctness for my data. Even falling
back to JSON Schema, there are plenty of opportunities for usability improvements there that would
more meaningfully change how I use Lean. The syntax for constructing JSON in Lean is very unergonomic, since we can't combine string formatting
with JSON syntax easily. Creating custom string syntax for JavaScript in general might be more usable
for using JavaScript libraries in widgets.
String formatting can't be combined with raw strings like r##""##, and the default formatting uses {}, a common string in
JavaScript and JSON. Mustache templating with multi-line strings would make it fairly
elegant to include d3.js code inline in Lean.
We might be able to call out to TypeScript and even have nice VS Code support for the embedded language.