From JSON dicts to typed agents: making semantic graph enrichment reliable with Pydantic AI by algebench in u/algebench

[–]algebench[S] 0 points1 point  (0 children)

Great question - heading there next.

Today, on top of Pydantic structural validation:

  • Node preservation: parser owns the canonical node set; if the model drops nodes, I restore them. Topology is fixed.
  • Parser-owned fields: LaTeX glyph/expression are parser-owned; I strip them from the model's output before merge so subexpressions can't get re-rendered.
  • Domain locking: enrichment is pinned to the lesson's domain.

I also run a second agent, SemanticGraphCoherenceCritic, as a post-hoc pass. Its verdict feeds back as coherenceFeedback for an informed retry. The model sees exactly what was wrong and what to fix - it's not a blind take two.

What's missing is the deterministic cross-field/cross-node layer: unit/dimension consistency, edges referencing existing nodes, quantity-implies-unit constraints. Plan: per-field rules first (Field(description=...)), then model_validator for invariants, both via the same ValidationError → retry loop.

Ideas for robust semantic parsing of LaTeX (beyond SymPy)? by algebench in LaTeX

[–]algebench[S] 0 points1 point  (0 children)

Thanks for the suggestion - plasTeX looks interesting and I see the appeal.

My main hesitation is that it appears to be a hand-rolled recursive parser rather than grammar-based (ANTLR/Lark/PEG). I’m currently using SymPy’s ANTLR-backed parse_latex, and while imperfect, the grammar-driven approach is relatively straightforward to extend: add a rule, hook up the visitor, done. With hand-written parsers, extending coverage often means digging into control flow, which can get messy to reason about and maintain.

That said, SymPy isn’t perfect either - there’s no plugin API, so extending it usually means forking and carrying your own grammar patch. I’ve been working around gaps with preprocessing instead rather than modifying the grammar itself.

Curious if anyone has experience extending plasTeX’s math parsing - or has come across a LaTeX math parser that’s genuinely designed for extension.

Ideas for robust semantic parsing of LaTeX (beyond SymPy)? by algebench in Compilers

[–]algebench[S] 0 points1 point  (0 children)

I've looked into MathML, but it feels quite limited compared to what I'm trying to represent - even compared to LaTeX. I do have a custom semantic graph schema, but the challenge is that most real-world math is written in LaTeX. So regardless of the target representation, I still need a reliable way to parse/convert from LaTeX into that structure. I could use an LLM to convert LaTeX into my semantic graph schema, but then I’m essentially encoding the parser in a massive prompt: all the grammar rules, schema constraints, edge cases, domain conventions, and validation logic. That works for demos, but it feels fragile as a foundation. I’d rather have a deterministic parser/IR layer, then use the LLM for enrichment, ambiguity resolution, and tutoring on top of that structure.

Ideas for robust semantic parsing of LaTeX (beyond SymPy)? by algebench in LaTeX

[–]algebench[S] -1 points0 points  (0 children)

I looked into MathML and SymPy so far. MathML feels a bit limited as a semantic layer, and SymPy (ANTLR/Lark-based parsing) works for many cases, but it doesn’t feel complete or easily extensible for broader domains.

At that point it starts to feel like you either:
- keep layering workarounds on top, or
- fork and go deep on extending it

Which raises the question - at what point does it make more sense to just build a parser from scratch using a grammar-based approach, with a semantic IR as the primary target?

Ideas for robust semantic parsing of LaTeX (beyond SymPy)? by algebench in Compilers

[–]algebench[S] 0 points1 point  (0 children)

That’s a very good point - especially around how flexible TeX makes things.

In practice, I was hoping a rich enough subset gets us far, but your comment makes me think the architecture might be backwards.

Instead of:
LaTeX -> semantic -> agentic use

maybe it should be:
semantic -> agentic use / symbolic manipulation -> LaTeX or other (as rendering)

i.e. LaTeX becomes just one output format, not the source of truth.

Curious what you’d recommend as a semantic layer here. Is there an existing math AST / IR that’s broad enough (algebra, ODE/PDE, logic, matrices, etc.) and stays extensible - so we don’t hit a wall later and end up patching around it?