Safe JSON in Lean: json-schema-lean or subtyping

I wanted to explore using Vega-Lite for plotting in Lean, and I wanted to have some safety check. VegaLite has a JSON schema. One giant PR later, json-schema-lean can now validate plotting configs. It is ready for some broader testing of more diverse schema, and it should be helpful for statically known single-file schemas. Caveats:

  • Only supports JSON Schema Draft 7

  • Does not support automatic file/http remote reference resolution

  • Not battle-tested, only regular tested

Along the way, I also made a URI library lean-uri.

A Quick Demo

There is a single file for VegaLite's schema, so we load in the Vega-Lite schema with include_str, parse the schema, and validate our data.

import JsonSchema.Validation
def exampleVegaLite : Lean.Json := Option.get! <| Except.toOption <| Lean.Json.parse r##"{ "$schema": "https://vega.github.io/schema/vega-lite/v6.json", "data": {"url": "https://vega.github.io/vega-lite/data/seattle-weather.csv"}, "mark": "bar", "encoding": { "x": { "timeUnit": "month", "field": "date", "type": "ordinal", "title": "Month of the year" }, "y": { "aggregate": "count", "type": "quantitative" }, "color": { "field": "weather", "type": "nominal", "scale": { "domain": ["sun", "fog", "drizzle", "rain", "snow"], "range": ["#e7ba52", "#c7c7c7", "#aec7e8", "#1f77b4", "#9467bd"] }, "title": "Weather type" } } } "## def vegaliteSchemaStr : String := include_str "v6.4.1.json" def vegaliteSchema : JsonSchema.Schema := ((Lean.Json.parse vegaliteSchemaStr) >>= JsonSchema.schemaFromJson ).toOption.get! Except.ok ()#eval JsonSchema.validate vegaliteSchema exampleVegaLite

We can then embed the JSON into an HTML tag and display it as needed:

Options for Type-Level Safety

Since creating specialized data types identifies errors ahead of time, I tried my hand at turning schemas into Lean types. But I became dissatisfied with how JSON Schema code generation produces unergonomic types, no matter what I try. This mismatch pains me in several languages. Meanwhile, most languages don't reason about runtime validation. If you know that a JSON passed validation, then the compiler can't guarantee any properties of your data. But in Lean, we can use the validation as an assumption in a theorem. Then we can prove that fields are populated, or go backwards and prove that validation would check if we ran it. This idea wouldn't leave me, so I tried to implement it and came up with json-subtyping.

When attaching properties to our JSON data in Lean, it's convenient to emulate a kind of subtyping. Although the standard sum and product types have many nice properties for theorem proving, plotting becomes excessively verbose. When every union is implicitly tagged with a discriminator, as in Lean, extensions and escape hatches for plots complicate every union construction. Subtyping is one approach to minimizing the difficulty when you extend types. Furthermore, if you use some gradual typing, you can move fast when you don't need generalization, like for some one-off plots.

I also have hope for similar schema-based subtyping to provide guarantees for dataframe libraries, SQL results, and tensor manipulation. Those are places where traditional type systems have dropped the ball, where even validation frameworks like Pandera for pandas have high costs and limited utility. Using dependent types can invite "dependent type hell", but I had luck with just attaching the validation check as a Prop. By avoiding inductive "indexed families" in favor of a subtyping pattern, we also can more easily vary the underlying implementation. That said, the theorem proving for subtyping still require a significant amount of work, so I would only consider it when safety is a requirement not a luxury. Usually I am plotting very specific data, where it is easy to check after the fact.

By sharing an excessive level of detail below, I aim to encourage more people to work on libraries like json-schema-lean--Lean needs more libraries in my opinion. I also want to explore the details around code generation options in Lean. Finally, I want to show that subtyping is a viable pattern for type safety in dependent type languages. But at the end, there are still several unexplored options for better plotting in Lean I'll mention.

Step 1: JSON Schema in Lean

Starting out, I was lucky there was already a json-schema-lean library--a GSoC 2024 project. The main branch was incomplete, but there was nice testing infrastructure. I figured it wouldn't be too difficult to complete.

JSON Schema

JSON Schema is a schema spec; specs are written in JSON, and schema implementations let you validate JSON. The schema language lets you constrain JSON values, take unions of schemas, reference external definitions, etc. Tools like OpenAPI and openapi-generator generate stubs in different languages. Personally, the generated code is substandard but more or less workable depending on the language and application.

I have primarily used JSON Schema Draft 7, which has a relatively short spec. With it in hand, I implemented most of the features within a week or two, using a lot of Claude Code for most features, especially once it got really repetitive. I discovered that there were a lot of features of Draft 7 I never use nor want to:

  • You can pattern match regexes in strings.

  • You can have loops in your schema references, which you have to guard against.

  • There are properties, pattern properties, additional properties, etc. These all interact with each other.

The schema turned into a big mutually inductive type, since schemas have unusual options like just true or nested if-then-else schemas.

Bowtie's comprehensive tests saved me a bunch. There were quite a few parts around references where the spec confused me. Lean has a few regex libraries to choose from like Regex. That also saved me a lot of trouble since regex is not an isolated feature; it can be used to select schemas for properties. Sadly there was another missing element leading to yet another side-quest in the form of URI resolution.

Step 1.1: URIs

Uniform Resource Identifiers are special strings that serve as semantic IDs. They don't guarantee stability or uniqueness, but they often tell you something about some digital resource. URLs are a subtype of URIs. A URL is a URI plus the protocol to retrieve the URL. The URI spec (RFC 3986) breaks URI into URI = scheme ":" ["//" authority] path ["?" query] ["#" fragment], and it defines the percent-encoding and decoding (like why Google replaces spaces with %20 in the results URL). Different schemes can then have special rules on top like URL domain resolution or base64-encoding. There's also relative URIs to access "nearby" resources.

So what does this have to do with JSON Schema? Well, $ids and $refs follow the rules of URIs. You can use different protocols, you can use relative paths, and fragments point to definitions within files. Since Lean didn't have a URI library, I implemented one at lean-uri. Since testing is not as well centralized compared to JSON Schema, I mainly picked examples from the RFC and looked at Haskell equivalents. For implementation, I started with hoping Claude could read and operationalize the spec, then rewriting it when it got ugly. Since I hold parser combinators close to heart, I started with some hand-written Std.Internal.Parsec.String, and tried to work within that monad, making the code more uniform. The final interface provides parsing, normalization, resolution, which all feature in JSON Schema.

/-- info: "https://example.com/a/c?x#y" -/ #guard_msgs in #eval match LeanUri.URI.parse "https://EXAMPLE.com/a/./b/../c?x#y" with | .ok u => (LeanUri.URI.normalize u).toString | .error e => s!"error: {e}"

Step 1.1.1: Testing

Since URIs (unlike URLs) didn't seem to have a nice testing framework already, I really wanted to count on my own testing. Lean has a way to hook up an executable to lake test, and that attracted me too. I didn't really like LSpec, which appears to be the only testing library in Lean. My conclusion: time to write a little testing library. I decided to put it in a Test monad based on IO, where you can log passed, failed, and skipped tests to monad state, group tests, etc.

inductive TestTree where | test (name : String) (passed : Bool) (skipped : Bool := false) (message : Option String := none) | group (name : String) (children : Array TestTree) (skipped : Bool := false) deriving Inhabited, Repr structure TestState where stack : List (String × Array TestTree) := [ ("", #[]) ] -- (group name, children) filter : String Bool := fun _ => true deriving Inhabited abbrev TestM := StateT TestState IO

This worked fine for what I needed, and I reused the helpers for testing code gen on a branch of json-schema-lean. Some tiny features like filtering or whatever are simple. I doubt I'll use this long-term, since having IO in all your tests is a blessing and a curse. There are too many features I want in a truly excellent testing library to use my hacky version forever.

-- This is nice, but could it be simpler?
def testHexConversion : TestM Unit := testFunction "Hex Conversion" do
  testEq "hex 0" (uInt8toOneHexChar 0) '0' -- testEq registers the equality test under a name

Once I had the basic URI interface, it was mainly a matter of repeatedly testing the $ref and $id cases until I got tests passing.

A slight detour: Json Pointer Fragments

We can refer to definitions within files as well with JSON "pointer fragments". The pointer fragment #/definitions/A/ refers to obj["definitions"]["A"], and similar notation is available for indexing lists, etc. RFC 6901 describes the format for JSON pointer fragments trailing on a URI anchor.

What doesn't work?

Several Bowtie tests require you to read referenced JSON schemas on the fly from a URL. Lean does not yet (as of January 2026) have a standard HTTP library to handle URLs, so I just decided to give up on that part.

Step 2: Json Schema Code Gen

Now that I parse and understand all the features of JSON schema, I decided to try generating Lean structs from JSON schema. My strategy plodded along, trying to parse Schema into the Except String Std.Format monad: first try to build simple abbreviations, then structures, etc. I considered macros, but inserting comments eluded me. I settled on Std.Formats, Lean's way to handle indentation, precedence, etc. The API was far nicer than I deserved, even though it is string manipulation.

Eventually, I needed more code generation context, recursion, and custom toJson/fromJson instances. I settled on writing all the parsers into a SchemaGen TypeDefinition monad where SchemaGen had context and error handling while TypeDefinition held all the parsed definitions. Using a custom monad name here is extremely useful, and I understand now why so many libraries live in a custom named monad transformer.

/-- A complete type definition including the type declaration and optional JSON instances -/ structure TypeDefinition where /-- The main type declaration (structure, inductive, or abbrev) -/ typeDecl : Std.Format /-- Optional FromJson instance implementation -/ fromJsonImpl : Option Std.Format := none /-- Optional ToJson instance implementation -/ toJsonImpl : Option Std.Format := none /-- Nested type definitions that should be prepended before this definition -/ dependencies : List TypeDefinition := [] /-- Extra doc comment for use in structures and inductives -/ extraDocComment : Option Std.Format := none deriving Inhabited /-- Configuration for code generation -/ structure Config where /-- How to sanitize names to valid Lean identifiers -/ sanitizeName : String String /-- Indentation string -/ indent : String := " " /-- Whether to generate FromJson instances -/ generateFromJson : Bool := false /-- Whether to generate ToJson instances -/ generateToJson : Bool := false /-- Whether to include the base URI filename in generated type names. When true: user.json with definition "Address" → "UserAddress" When false: user.json with definition "Address" → "Address" -/ includeBaseNamePrefix : Bool := false /-- Identifies a schema by its canonical URI and path -/ structure SchemaID where baseURI : LeanUri.URI path : List String deriving BEq, Hashable, Inhabited /-- Extended code generation context with reference support -/ structure CodeGenContext where /-- Resolver for looking up schemas -/ resolver : Resolver /-- Mapping from SchemaID to generated type name -/ nameMap : Std.HashMap SchemaID String /-- Configuration -/ config : Config /-- Base URI which gets updated as we traverse -/ baseURI : LeanUri.URI /-- Current type name being generated (for error reporting) -/ currentTypeName : Option String := none /-- Current schema ID being processed (for error reporting) -/ currentSchemaID : Option SchemaID := none abbrev SchemaGen := ReaderT CodeGenContext (Except String ·) def SchemaGen.run (gen : SchemaGen α) (ctx : CodeGenContext) : Except String α := ReaderT.run gen ctx

Lean requires mutually recursive types to be put in mutual blocks, so Tarjan's strongly connected components lets me detect the cyclic dependencies in the schema reference graph. The implementation can essentially be automated as there exist examples of it in Mathlib and Lean.SCC. So Claude could do it, taking away some of my fun ☹️.

Aside from that, there are so many different special cases, abbreviations, inductives, etc. The tension between verbose inductives and the unintended consequences of abbreviations was quite annoying.

Step 3: Code Gen with Vega-Lite

How did code gen work with Vega-Lite?

  • Mutual inductive types don't generate right, oops?

  • It would take several minutes to compile, and it's >15,000 lines.

  • All the code is too ugly to use.

Once I saw how I'd have to construct a RectConfig, I realized the library would not be fun to use as a whole.

structure RectConfig where
  /-- The horizontal alignment of the text or ranged marks (area, bar, image, rect, rule). One of `"left"`, `"right"`, `"center"`.

    __Note:__ Expression reference is *not* supported for range marks. -/
  align : Option (Align ⊕ ExprRef) := none
  /-- The rotation angle of the text, in degrees. -/
  angle : Option (Float ⊕ ExprRef) := none
  /-- A boolean flag indicating if [ARIA attributes](https://developer.mozilla.org/en-US/docs/Web/Accessibility/ARIA) should be included (SVG output only). If `false`, the "aria-hidden" attribute will be set on the output SVG element, removing the mark item from the ARIA accessibility tree. -/
  aria : Option (Bool ⊕ ExprRef) := none
  /-- Sets the type of user interface element of the mark item for [ARIA accessibility](https://developer.mozilla.org/en-US/docs/Web/Accessibility/ARIA) (SVG output only). If specified, this property determines the "role" attribute. Warning: this property is experimental and may be changed in the future. -/
  ariaRole : Option (String ⊕ ExprRef) := none
  /-- A human-readable, author-localized description for the role of the mark item for [ARIA accessibility](https://developer.mozilla.org/en-US/docs/Web/Accessibility/ARIA) (SVG output only). If specified, this property determines the "aria-roledescription" attribute. Warning: this property is experimental and may be changed in the future. -/
  ... -- It goes on like this for all the alphabetically sorted fields.

Now instead of {align: "left", ...}, we have to use .some (.inl .left). Specifying .inl or .inr and finding the right enum name can be really annoying. The long list of fields makes it difficult to remember which one if any is required. There are definitely ways to simplify this, but doing it automatically might be a bit much.

Back to the drawing board

You know what would be really convenient? If you could determine if the JSON schema validation passed during compile time. If the entire JSON is known at compile time, we can prove it using by native_decide to run the computation validate schema json and verify it equals true. What if a theorem about your JSON's structure proved validate schema json would pass? That sounds great, but proving the most useful properties requires validate terminates--not an easy proof at all. If we use a more restrictive schema language, we could more easily prove validation ahead of time.

I put this idea away for a week or so. After using TypeScript a bit professionally, I figured that we could crib TypeScript's homework and use a fragment of it to check JSON. One day I spent a few hours writing up a blueprint in Typst (an up-and-coming LaTeX alternative). This was also more fun than improving JSON Schema validation.

I also considered building the schema checking right into the JSON constructors as an inductive "indexed family", but using a schema.check function let me use the existing JSON data structure as a base. This brought me to separating out the proof data into a subtype equivalent to { x : Json // schema.check x }, which feels like a breath of fresh air compared to carefully encoding properties inductively. Overall, I believe that decoupling your evidence from your data is by default the correct approach, so that you only depend on a minimal amount of structure for your data. For example, we may want to one day switch to a more optimized tree format, which is easier the fewer propositions are bundled in it. We will use a few indexed families later for specific proofs in object construction and subtyping, but it is restricted to within those procedures.

TypeScript

JavaScript has 6 types: null, string, number, boolean, object, undefined. Many "types" in JavaScript are really just special objects like "functions". TypeScript lets you gradually define types to assign properties of variables at code locations, and then it will check properties "x has property 'id'" at compile-time by inspecting the control flow graph connecting locations. TypeScript does not try much to be sound.

Blueprint

I've never taken any course on type theory, so I probably did not use proof tree rules correctly, but I wrote a bunch of them in the blueprint anyway. I found it helpful especially for subtyping objects correctly.

TypeScript lets you combine types to make array types, tuple types, union types, and intersection types

For objects, JSON objects are always allowed to have extra fields, which becomes a lot more annoying when you are checking whether fields exist in big type unions. More on that later.

Strange challenges

Lean has a built-in JSON type at Lean.Json, which is used for the language server among other applications. Since the JSON type is recursive, and the recursion is contained in types like Array and maps, Lean identifies it as a "nested inductive type".

Nested inductives are massaged much more to make them appear normal. They are reorganized as an ordinary inductive, and then new definitions are used to create the appearance of the original nested inductive. The default induction rules are a real pain to use, since they include the induction rules for Arrays and maps. Worse, you cannot include dependent terms in inductive data types. The kernel will error with nested inductive datatypes parameters cannot contain local variables, which retains some mystery to me.

Std.TreeMap.Raw

The local variables restriction prevents the use of Std.TreeMap, a balanced binary search tree, in nested structures like JSON. Std.TreeMap contains a proof that the tree is properly constructed. Without that "well-formed" predicate, operations like insertion and lookup have fewer guarantees, but its inclusion does not play well with the nested inductive. So instead we have to use Std.TreeMap.Raw, which has no proof the data is "well-formed". The usual solution is to add the well-formedness again in the inductive, but the Json type doesn't attach the well-formedness anywhere.

I carefully tried to work around potentially ill-formed trees by asserting properties about .get? calls or by constructing maps using ofList, whose output is always well-formed.

Proving anything about JSON

Given that Json sheds so many invariants, it is somewhat surprising how elegant it is to prove JSON properties. As long as you can prove termination of functions on JSON, you can write your own custom induction rules by proving induction terminates.

I spent a day reading Lean's reference manual on "well-founded recursion" and tried simplifying termination goals that appear in a Json.beq implementation implicitly using such recurion. The reference manual lays out how termination_by and decreasing_by let you access these normally hidden goals. As opposed to trying to use induction, we write a recursive function where we only call on "smaller" values somehow. You can use any "well-founded" relation you want, which guarantees that eventually an execution will reduce the data to nothing and no more recursive calls are possible.

For instance, when we iterate .obj map : Lean.Json and recursively call on the values, we want to know that (k, v) ∈ map implies the size of v is smaller than map, which is smaller than .obj map. Lean provides a sizeOf definition automatically for most types where this property is practically guaranteed. During automatic termination checking, Lean attaches the membership proofs to elements of lists with List.attach (l : List α) : List { x // x ∈ l }.

We can then apply List.sizeOf_lt_of_mem {a : α} {as : List α} (h : a ∈ as) : sizeOf a < sizeOf as.

Although this "attach" framework works well (and automatically) for lists and arrays, we lack those theorems for Std.TreeMap. After combing through Std.TreeMap's source code with Claude Code, I learned about Std.DTreeMap.Internal.Impl.toListModel. This function inefficiently flattens into a list, but the reasoning is far simpler. It's also proved equivalent to .toList, which is more efficient. Claude Code was essentially able to one-shot the sizeOf lemmas:

open Std.DTreeMap.Internal in theorem Impl.sizeOf_lt_of_mem {α : Type u} {β : α Type v} [SizeOf α] [(a : α) SizeOf (β a)] {t : Impl α β} {k : α} {v : β k} (h : k, v t.toListModel) : sizeOf v < sizeOf t := α:Type uβ:α Type vinst✝¹:SizeOf αinst✝:(a : α) SizeOf (β a)t:Impl α βk:αv:β kh:k, v t.toListModelsizeOf v < sizeOf t induction t with α:Type uβ:α Type vinst✝¹:SizeOf αinst✝:(a : α) SizeOf (β a)k:αv:β kh:k, v Impl.leaf.toListModelsizeOf v < sizeOf Impl.leaf All goals completed! 🐙 α:Type uβ:α Type vinst✝¹:SizeOf αinst✝:(a : α) SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:k, v l.toListModel sizeOf v < sizeOf lihr:k, v r.toListModel sizeOf v < sizeOf rh:k, v (Impl.inner sz k' v' l r).toListModelsizeOf v < sizeOf (Impl.inner sz k' v' l r) α:Type uβ:α Type vinst✝¹:SizeOf αinst✝:(a : α) SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:k, v l.toListModel sizeOf v < sizeOf lihr:k, v r.toListModel sizeOf v < sizeOf rh:k, v l.toListModel k, v = k', v' k, v r.toListModelsizeOf v < sizeOf (Impl.inner sz k' v' l r) α:Type uβ:α Type vinst✝¹:SizeOf αinst✝:(a : α) SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:k, v l.toListModel sizeOf v < sizeOf lihr:k, v r.toListModel sizeOf v < sizeOf rh:k, v l.toListModel k, v = k', v' k, v r.toListModelsizeOf v < 1 + sizeOf sz + sizeOf k' + sizeOf v' + sizeOf l + sizeOf r α:Type uβ:α Type vinst✝¹:SizeOf αinst✝:(a : α) SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:k, v l.toListModel sizeOf v < sizeOf lihr:k, v r.toListModel sizeOf v < sizeOf rhl:k, v l.toListModelsizeOf v < 1 + sizeOf sz + sizeOf k' + sizeOf v' + sizeOf l + sizeOf rα:Type uβ:α Type vinst✝¹:SizeOf αinst✝:(a : α) SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:k, v l.toListModel sizeOf v < sizeOf lihr:k, v r.toListModel sizeOf v < sizeOf rheq:k, v = k', v'sizeOf v < 1 + sizeOf sz + sizeOf k' + sizeOf v' + sizeOf l + sizeOf rα:Type uβ:α Type vinst✝¹:SizeOf αinst✝:(a : α) SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:k, v l.toListModel sizeOf v < sizeOf lihr:k, v r.toListModel sizeOf v < sizeOf rhr:k, v r.toListModelsizeOf v < 1 + sizeOf sz + sizeOf k' + sizeOf v' + sizeOf l + sizeOf r α:Type uβ:α Type vinst✝¹:SizeOf αinst✝:(a : α) SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:k, v l.toListModel sizeOf v < sizeOf lihr:k, v r.toListModel sizeOf v < sizeOf rhl:k, v l.toListModelsizeOf v < 1 + sizeOf sz + sizeOf k' + sizeOf v' + sizeOf l + sizeOf r α:Type uβ:α Type vinst✝¹:SizeOf αinst✝:(a : α) SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:k, v l.toListModel sizeOf v < sizeOf lihr:k, v r.toListModel sizeOf v < sizeOf rhl:k, v l.toListModelthis:?_mvar.17769 := @_fvar.16588 _fvar.17737sizeOf v < 1 + sizeOf sz + sizeOf k' + sizeOf v' + sizeOf l + sizeOf r; All goals completed! 🐙 α:Type uβ:α Type vinst✝¹:SizeOf αinst✝:(a : α) SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:k, v l.toListModel sizeOf v < sizeOf lihr:k, v r.toListModel sizeOf v < sizeOf rheq:k, v = k', v'sizeOf v < 1 + sizeOf sz + sizeOf k' + sizeOf v' + sizeOf l + sizeOf r α:Type uβ:α Type vinst✝¹:SizeOf αinst✝:(a : α) SizeOf (β a)k:αv:β ksz:Natl:Impl α βr:Impl α βihl:k, v l.toListModel sizeOf v < sizeOf lihr:k, v r.toListModel sizeOf v < sizeOf rsizeOf v < 1 + sizeOf sz + sizeOf k + sizeOf v + sizeOf l + sizeOf r; All goals completed! 🐙 α:Type uβ:α Type vinst✝¹:SizeOf αinst✝:(a : α) SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:k, v l.toListModel sizeOf v < sizeOf lihr:k, v r.toListModel sizeOf v < sizeOf rhr:k, v r.toListModelsizeOf v < 1 + sizeOf sz + sizeOf k' + sizeOf v' + sizeOf l + sizeOf r α:Type uβ:α Type vinst✝¹:SizeOf αinst✝:(a : α) SizeOf (β a)k:αv:β ksz:Natk':αv':β k'l:Impl α βr:Impl α βihl:k, v l.toListModel sizeOf v < sizeOf lihr:k, v r.toListModel sizeOf v < sizeOf rhr:k, v r.toListModelthis:?_mvar.18128 := @_fvar.16589 _fvar.17765sizeOf v < 1 + sizeOf sz + sizeOf k' + sizeOf v' + sizeOf l + sizeOf r; All goals completed! 🐙 theorem TreeMap.Raw.sizeOf_lt_of_mem {α β : Type} {cmp : α α Ordering} [SizeOf α] [SizeOf β] {t : Std.TreeMap.Raw α β cmp} {k : α} {v : β} (h : k, v t.inner.inner.toListModel) : sizeOf v < sizeOf t := α:Typeβ:Typecmp:α α Orderinginst✝¹:SizeOf αinst✝:SizeOf βt:Std.TreeMap.Raw α β cmpk:αv:βh:k, v t.inner.inner.toListModelsizeOf v < sizeOf t α:Typeβ:Typecmp:α α Orderinginst✝¹:SizeOf αinst✝:SizeOf βt:Std.TreeMap.Raw α β cmpk:αv:βh:k, v t.inner.inner.toListModelh1:[Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.] Eq.{?_uniq.18434} ?_uniq.18527 (SizeOf.sizeOf.{?_uniq.18435} ?_uniq.18436 ?_uniq.18437 _uniq.18428) (HAdd.hAdd.{?_uniq.18451, ?_uniq.18450, ?_uniq.18449} ?_uniq.18510 ?_uniq.18511 ?_uniq.18512 ?_uniq.18513 (OfNat.ofNat.{?_uniq.18454} ?_uniq.18453 1 ?_uniq.18455) (SizeOf.sizeOf.{?_uniq.18463} ?_uniq.18464 ?_uniq.18465 (Std.TreeMap.Raw.inner.{?_uniq.18467, ?_uniq.18466} ?_uniq.18468 ?_uniq.18469 ?_uniq.18470 _uniq.18428))) := [Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.] Std.TreeMap.Raw.mk.sizeOf_spec.{?_uniq.18533, ?_uniq.18532} ?_uniq.18534 ?_uniq.18535 ?_uniq.18536 ?_uniq.18537 ?_uniq.18538 (Std.TreeMap.Raw.inner.{?_uniq.18540, ?_uniq.18539} ?_uniq.18541 ?_uniq.18542 ?_uniq.18543 _uniq.18428)sizeOf v < sizeOf t α:Typeβ:Typecmp:α α Orderinginst✝¹:SizeOf αinst✝:SizeOf βt:Std.TreeMap.Raw α β cmpk:αv:βh:k, v t.inner.inner.toListModelh1:[Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.] Eq.{?_uniq.18434} ?_uniq.18527 (SizeOf.sizeOf.{?_uniq.18435} ?_uniq.18436 ?_uniq.18437 _uniq.18428) (HAdd.hAdd.{?_uniq.18451, ?_uniq.18450, ?_uniq.18449} ?_uniq.18510 ?_uniq.18511 ?_uniq.18512 ?_uniq.18513 (OfNat.ofNat.{?_uniq.18454} ?_uniq.18453 1 ?_uniq.18455) (SizeOf.sizeOf.{?_uniq.18463} ?_uniq.18464 ?_uniq.18465 (Std.TreeMap.Raw.inner.{?_uniq.18467, ?_uniq.18466} ?_uniq.18468 ?_uniq.18469 ?_uniq.18470 _uniq.18428))) := [Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.] Std.TreeMap.Raw.mk.sizeOf_spec.{?_uniq.18533, ?_uniq.18532} ?_uniq.18534 ?_uniq.18535 ?_uniq.18536 ?_uniq.18537 ?_uniq.18538 (Std.TreeMap.Raw.inner.{?_uniq.18540, ?_uniq.18539} ?_uniq.18541 ?_uniq.18542 ?_uniq.18543 _uniq.18428)h2:[Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.] Eq.{?_uniq.18562} ?_uniq.18644 (SizeOf.sizeOf.{?_uniq.18563} ?_uniq.18564 ?_uniq.18565 (Std.TreeMap.Raw.inner.{?_uniq.18567, ?_uniq.18566} ?_uniq.18568 ?_uniq.18569 ?_uniq.18570 _uniq.18428)) (HAdd.hAdd.{?_uniq.18573, ?_uniq.18572, ?_uniq.18571} ?_uniq.18627 ?_uniq.18628 ?_uniq.18629 ?_uniq.18630 (OfNat.ofNat.{?_uniq.18576} ?_uniq.18575 1 ?_uniq.18577) (SizeOf.sizeOf.{?_uniq.18585} ?_uniq.18586 ?_uniq.18587 (Std.DTreeMap.Raw.inner.{?_uniq.18594, ?_uniq.18593} ?_uniq.18595 ?_uniq.18596 ?_uniq.18597 (Std.TreeMap.Raw.inner.{?_uniq.18589, ?_uniq.18588} ?_uniq.18590 ?_uniq.18591 ?_uniq.18592 _uniq.18428)))) := [Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.] Std.DTreeMap.Raw.mk.sizeOf_spec.{?_uniq.18646, ?_uniq.18645} ?_uniq.18647 ?_uniq.18648 ?_uniq.18649 ?_uniq.18650 ?_uniq.18651 (Std.DTreeMap.Raw.inner.{?_uniq.18658, ?_uniq.18657} ?_uniq.18659 ?_uniq.18660 ?_uniq.18661 (Std.TreeMap.Raw.inner.{?_uniq.18653, ?_uniq.18652} ?_uniq.18654 ?_uniq.18655 ?_uniq.18656 _uniq.18428))sizeOf v < sizeOf t α:Typeβ:Typecmp:α α Orderinginst✝¹:SizeOf αinst✝:SizeOf βt:Std.TreeMap.Raw α β cmpk:αv:βh:k, v t.inner.inner.toListModelh1:[Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.] Eq.{?_uniq.18434} ?_uniq.18527 (SizeOf.sizeOf.{?_uniq.18435} ?_uniq.18436 ?_uniq.18437 _uniq.18428) (HAdd.hAdd.{?_uniq.18451, ?_uniq.18450, ?_uniq.18449} ?_uniq.18510 ?_uniq.18511 ?_uniq.18512 ?_uniq.18513 (OfNat.ofNat.{?_uniq.18454} ?_uniq.18453 1 ?_uniq.18455) (SizeOf.sizeOf.{?_uniq.18463} ?_uniq.18464 ?_uniq.18465 (Std.TreeMap.Raw.inner.{?_uniq.18467, ?_uniq.18466} ?_uniq.18468 ?_uniq.18469 ?_uniq.18470 _uniq.18428))) := [Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.] Std.TreeMap.Raw.mk.sizeOf_spec.{?_uniq.18533, ?_uniq.18532} ?_uniq.18534 ?_uniq.18535 ?_uniq.18536 ?_uniq.18537 ?_uniq.18538 (Std.TreeMap.Raw.inner.{?_uniq.18540, ?_uniq.18539} ?_uniq.18541 ?_uniq.18542 ?_uniq.18543 _uniq.18428)h2:[Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.] Eq.{?_uniq.18562} ?_uniq.18644 (SizeOf.sizeOf.{?_uniq.18563} ?_uniq.18564 ?_uniq.18565 (Std.TreeMap.Raw.inner.{?_uniq.18567, ?_uniq.18566} ?_uniq.18568 ?_uniq.18569 ?_uniq.18570 _uniq.18428)) (HAdd.hAdd.{?_uniq.18573, ?_uniq.18572, ?_uniq.18571} ?_uniq.18627 ?_uniq.18628 ?_uniq.18629 ?_uniq.18630 (OfNat.ofNat.{?_uniq.18576} ?_uniq.18575 1 ?_uniq.18577) (SizeOf.sizeOf.{?_uniq.18585} ?_uniq.18586 ?_uniq.18587 (Std.DTreeMap.Raw.inner.{?_uniq.18594, ?_uniq.18593} ?_uniq.18595 ?_uniq.18596 ?_uniq.18597 (Std.TreeMap.Raw.inner.{?_uniq.18589, ?_uniq.18588} ?_uniq.18590 ?_uniq.18591 ?_uniq.18592 _uniq.18428)))) := [Error pretty printing expression: unknown free variable `_uniq.18428`. Falling back to raw printer.] Std.DTreeMap.Raw.mk.sizeOf_spec.{?_uniq.18646, ?_uniq.18645} ?_uniq.18647 ?_uniq.18648 ?_uniq.18649 ?_uniq.18650 ?_uniq.18651 (Std.DTreeMap.Raw.inner.{?_uniq.18658, ?_uniq.18657} ?_uniq.18659 ?_uniq.18660 ?_uniq.18661 (Std.TreeMap.Raw.inner.{?_uniq.18653, ?_uniq.18652} ?_uniq.18654 ?_uniq.18655 ?_uniq.18656 _uniq.18428))h3:?_mvar.18681 := Impl.sizeOf_lt_of_mem _fvar.18431sizeOf v < sizeOf t All goals completed! 🐙

So my strategy was to use .toList.attach to attach the membership proof, and then extract the size lemma with .sizeOf_lt_of_mem when proving the size of recursive arguments decreases. Compared to the verbosity of ordinary induction here, this is much briefer. Now we can implement better JSON induction methods:

set_option linter.unusedVariables false in open Lean (Json) in def Json.recOn {motive : Json Sort u} (x : Json) (null : motive .null) (bool : b, motive (.bool b)) (num : n, motive (.num n)) (str : s, motive (.str s)) (arr : (a : Array Json), ( j, j a motive j) motive (.arr a)) (obj : (o : Std.TreeMap.Raw String Json compare), ( field value, field, value o.inner.inner.toList motive value) motive (.obj o)) : motive x := match x with | .null => null | .bool b => bool b | .num n => num n | .str s => str s | .arr a => arr a (fun j jMem => Json.recOn j null bool num str arr obj) | .obj o => obj o (fun k v vMem => Json.recOn v null bool num str arr obj) termination_by x decreasing_by a:Array Jsonj:JsonjMem:j asizeOf j < sizeOf (Json.arr a) suffices sizeOf j < sizeOf a a:Array Jsonj:JsonjMem:j athis:sizeOf _fvar.20252 < sizeOf _fvar.20250 := ?_mvar.20760sizeOf j < sizeOf (Json.arr a) a:Array Jsonj:JsonjMem:j athis:sizeOf _fvar.20252 < sizeOf _fvar.20250 := ?_mvar.20760sizeOf j sizeOf a; All goals completed! 🐙 All goals completed! 🐙 suffices sizeOf v < sizeOf o o:Std.TreeMap.Raw String Json comparek:Stringv:JsonvMem:k, v o.inner.inner.toListthis:sizeOf _fvar.20262 < sizeOf _fvar.20259 := ?_mvar.21399sizeOf v < sizeOf (Json.obj o) o:Std.TreeMap.Raw String Json comparek:Stringv:JsonvMem:k, v o.inner.inner.toListthis:sizeOf _fvar.20262 < sizeOf _fvar.20259 := ?_mvar.21399sizeOf v sizeOf o; All goals completed! 🐙 o:Std.TreeMap.Raw String Json comparek:Stringv:JsonvMem:k, v o.inner.inner.toList?k, v o.inner.inner.toListModelo:Std.TreeMap.Raw String Json comparek:Stringv:JsonvMem:k, v o.inner.inner.toListString o:Std.TreeMap.Raw String Json comparek:Stringv:JsonvMem:k, v o.inner.inner.toList?k, v o.inner.inner.toListo:Std.TreeMap.Raw String Json comparek:Stringv:JsonvMem:k, v o.inner.inner.toListString All goals completed! 🐙

Custom "recursors" like this let you split up recursive definitions really easily as long as you are able to take and apply hypotheses like j ∈ a. As long as you carry around membership information, you can split up recursive definitions while still proving termination.

List equality was a pain

Lawful equality already exists for List α, but it expects an existing instance of equality for α. For a particular subtyping proof, I needed a ReflBEq Json (∀x : Json, x == x); I hoped to use ReflBEq (List α). At the time, I ran out of patience and cleverness trying to prove a termination goal, so I reimplemented the lemma instead. Relatedly, writing inline definitions in a typeclass instance makes it strangely difficult to unfold those definitions.

Subtyping

After handling all those JSON difficulties, I went back to my very helpful blueprint for how subtyping would work: an incomplete set of rules to determine when one type implied another. From the beginning, I wanted to bundle the "schema" t : JsonType, the propositional check, and the JSON into an object TypedJson t. Then when you wanted to check if a typed json also satisfied t', you could run a decision procedure t.subtype t' to prove ∀x, t.check x → t'.check x. This would allow you to use JSON in more general contexts without having to care about the exact types involved.

I applied a bit of cleverness by bundling the proof into the return type of the subtype checker:

inductive DecideSubtype (t1 t2 : JsonType) where
  | none : DecideSubtype t1 t2
  | isSubtype : (∀j, t1.check j = true → t2.check j = true) → DecideSubtype t1 t2

This allowed me to reuse the convoluted logic of the subtype checking for the proof plumbing. Proving that subtyping worked separately involved so much more casework; it was so much easier to piggyback off the existing casework. Truly, dependent types are wonderful! Especially as long as you remember: "Programs that use these types will find themselves forced to mirror the algorithm used in the type-level function so that their structure matches the pattern-matching and recursive behavior of the type." This is the blessing and curse of dependent types.

There was a lot of rewriting and sublemmas for all the different JsonType constructors, especially for object subtyping, but I was mainly correcting the plan, ensuring that required and optional types behaved correctly, and trying a bunch of examples. Once I got a correct human proof, the Lean version held no surprises.

Object Construction

Given some TypedJson, say x1 x2 : TypedJson t, I would like to be able to construct typed objects during compile time. This might actually be the ideal use case for heterogeneous lists, a dependently typed list where each element can have its own type.

I carefully put the key-values as type parameters, so I can access them at compile time more easily, then the actual typed JSON is included as a field.

/-- Heterogeneous list of object fields indexed by schema.
    Each field carries a TypedJson value with proof that it checks against its type. -/
inductive ObjectFields : List (String × JsonType) → Type where
  | nil : ObjectFields []
  | cons {ty : JsonType} {rest : List (String × JsonType)} (name : String) :
      TypedJson ty → ObjectFields rest → ObjectFields ((name, ty) :: rest)

Then I could guarantee objects would pass validation once you ensure no duplicates.

/-- Construct an object from ObjectFields -/
def mkObj {req : List (String × JsonType)} (fields : ObjectFields req)
    (noDups : req.Pairwise (fun a b => ¬compare a.1 b.1 = .eq) := by native_decide) :
    TypedJson (.object req []) := sorry

I had Claude Code make a macro that would construct ObjectFields using obj{} notation, since I could just point it at the existing macro for list syntax [a, b, c]. Now we have a more readable constructor:

def testMkObj (s : String) (n : Nat) : TypedJson (
      .object [("name", .string), ("age", .number)] []
    ) := -- obj{} just created an ObjectFields, which mkObj converts to TypedJson
  mkObj obj{"name": (s : TypedJson .string), "age": (n : TypedJson .number)}

Narrowing

When you are within if statements, TypeScript is capable of "narrowing" unions to a variant:

type Person =
    | { age: number, id: number }
    | { dob: Date }

const person = getPerson();

if ("age" in person) {
    // person's type is "narrowed" to {age : number, id: number}
    console.log("person.age =", person.age);
    console.log("person.id =", person.id);
}

This is really helpful, but it is also more fraught with difficulty than it appears. Since objects may have extra fields in JSON, it is possible to have person = {dob: ..., age: 10}. Now TypeScript mistakenly believes that person.id is defined. When I learned about this, I decided that I should stop looking at the TypeScript source code. For a proof of concept, I only discriminate using a known property's values.

The blog post "Flow Nodes: How Type Inference Is Implemented" really helped me here too. For Lean, we have to start with filters for whether properties can match or not:

def JsonType.canMatchPropertyStr (t : JsonType) (key : String) (str : String) : Bool :=
  t.canBeObject &&
  match t.getKey? key with
  | .some kt => kt.check str
  | .none => true -- we have no information about key

If x has property key with value str, then I prove that t.check x = true implies t.canMatchPropertyStr key str = true. Similarly, if x does not have property key with value str, then I can prove t.canMismatchPropertyStr key str = true.

If t = t1 ||| t2 ||| t3 ||| t4, then I can turn that into a list [t1, t2, t3, t4] for which at least one must check. Now if t.check x = true, then I know some ti.check x = true, so ti.canMatchPropertyStr key str. The upshot is that I can filter [t1, t2, t3, t4] for canMatchPropertyStr, and one of them will still match x.

theorem JsonType.filterUnion_correctness
    {t : JsonType} {f : JsonType → Bool} {x : Json}
    (h : t.check x = true)
    (h' : ∀ t', t'.check x = true → f t' = true) :
    (t.filterUnion f).check x = true := by sorry

Putting this all the two together in JsonType.narrowKeyStr, we can do narrowing:

def circleType : JsonType :=
  .object [("type", .strLit "circle"), ("radius", .number)] []

def squareType : JsonType :=
  .object [("type", .strLit "square"), ("side", .number)] []

def shapeType : JsonType := circleType.union squareType

def calculateArea (shape : TypedJson shapeType) : IO Unit := do
  match h : shape.val.getObjVal? "type" with
  | .ok "circle" =>
    let circle : TypedJson circleType := (shape.narrowKeyStr "type" "circle" h).coe
    IO.println "Computing circle area"
  | .ok "square" =>
    let square : TypedJson squareType := (shape.narrowKeyStr "type" "square" h).coe
    IO.println "Computing square area"
  | _ => IO.println "Unknown shape type"

Conclusion

Despite the potential, creating schemas for all of VegaLite might be more work than I'd be willing to go through for plotting, where I only need correctness for my data. Even falling back to JSON Schema, there are plenty of opportunities for usability improvements there that would more meaningfully change how I use Lean. The syntax for constructing JSON in Lean is very unergonomic, since we can't combine string formatting with JSON syntax easily. Creating custom string syntax for JavaScript in general might be more usable for using JavaScript libraries in widgets.

String formatting can't be combined with raw strings like r##""##, and the default formatting uses {}, a common string in JavaScript and JSON. Mustache templating with multi-line strings would make it fairly elegant to include d3.js code inline in Lean. We might be able to call out to TypeScript and even have nice VS Code support for the embedded language.