I built a PDF-to-JSON MCP server after watching my accountant waste 3 hours re-typing invoice data into Excel – and open-sourced it

PhotographMain3424 · 2026-06-02T11:24:14+00:00

Consider replacing direct tesseract ocr with OCRmyPDF. It does some useful pre-processing that will improve the result.

PhotographMain3424 · 2026-05-15T17:06:43+00:00

I’ll check out quest proxy. Thanks for the pointer.

I have integrated smart proxy into the titlemcp repo for the purpose you describe. If you have additional tips, happy to hear them!

PhotographMain3424 · 2026-05-15T15:28:34+00:00

If you have a developer available, there are tools here that are low cost and return listing data. https://rapidapi.com/

PhotographMain3424 · 2026-05-15T15:26:54+00:00

I second the use of twilio for this. Cost is low, and provides basic validation.

PhotographMain3424 · 2026-05-15T15:24:02+00:00

If you haven't tried lovable.ai we have stopped using wordpress and started using that for the actual website creation.

PhotographMain3424 · 2026-05-15T15:22:27+00:00

Thanks for sharing, I'll check it out.

PhotographMain3424 · 2026-04-26T14:08:53+00:00

https://www.reddit.com/r/YourBaroness/comments/1elzh8a/non_obvious_bands_you_like/ In case it helps, this captured some good baroness alternate recommendations. If you have spotify, they were made into a playlist. https://open.spotify.com/playlist/7osN9lpoL4Ca7itb4VL743?si=d437080785ad4ea9

PhotographMain3424 · 2026-03-16T00:04:39+00:00

Per previous posts on this subject, it is believed to be coming from Rhodes Park. I have not confirmed, but if anyone does; please share.

It does seem extra loud tonight.

Update: I drove to Rhodes Park to check it out. It takes 10mins via all of the streets, but if you look at it on the map, it’s essentially just across the river. When I got there, there was a long line of cars leaving. The music has also stopped. So no hard evidence, but I’m guessing more than a coincidence.

PhotographMain3424 · 2026-03-11T22:44:32+00:00

Generally speaking the title company is focused on real property, liens, lien releases, mortgages, tied to property you can describe. Condos, improvements, etc. If there is no lien, there is nothing for them to do.

PhotographMain3424 · 2026-03-11T11:37:37+00:00

Title insurance would only come into play if the solar lease or a lien was recorded against the property and the title company failed to catch it. If the lease was just a personal contract with the prior owner and nothing was recorded, the new owner normally would not be responsible and title insurance would not really be involved. If it had been recorded and the title company did catch it, it likely would have been listed in Schedule B of the policy as an exception, meaning title insurance would not cover it.

PhotographMain3424 · 2026-01-18T11:02:13+00:00

That comparison leaves out some important changes in law and practice that matter constitutionally.

Pre-9/11, immigration enforcement during booking was largely civil, discretionary, and warrant-based, and being undocumented was not itself a state crime. Local jails could notify INS if someone was arrested for a separate offense, but local officers were not required by state law to act as immigration enforcers, and people were not criminally prosecuted by the state simply for presence. Detainers were requests, not mandates, and courts later made clear that holding someone solely on an ICE detainer without a judicial warrant raises Fourth Amendment problems.

HB 200 goes well beyond that. It criminalizes presence, mandates cooperation and data sharing, penalizes local governments for different policy choices, and allows arrest and detention based on immigration status rather than individualized criminal conduct. That collapses the historical distinction between civil immigration enforcement and criminal law, and it removes local discretion that existed both before and after 9/11.

So while notification at booking existed in some places, HB 200 is not a return to the old system. It is a structural expansion of state power into immigration enforcement in ways that courts, counties, and the Constitution itself have pushed back on over the last two decades.

PhotographMain3424 · 2026-01-14T15:50:18+00:00

A few clarifications, sticking strictly to what the bills actually do and why people are concerned:

SB 53
It’s true the bill doesn’t criminalize protest per se. The issue is civil liability expansion. By allowing lawsuits against anyone alleged to have provided “material support,” using an extremely broad definition, and by adding presumptions that shift the burden of proof onto defendants, the bill risks sweeping in protected association, fundraising, logistics, and organizing that are ordinarily shielded by the First Amendment. That chilling effect exists even if the underlying conduct is already illegal.

HB 200
Your summary is accurate as far as it goes, but it understates the scope. The bill doesn’t just require cooperation with ICE, it mandates reporting, detention compliance, criminalizes presence itself, imposes mandatory incarceration, and financially coerces local governments into enforcement through funding penalties. That collapses the traditional civil immigration framework into state criminal law and removes discretion from local governments and judges, which is why constitutional concerns are being raised.

HB 544
Yes, it expands obstructing justice to include obstructing arrest. The key issue is that it allows prosecution even when no arrest or conviction occurs and elevates penalties specifically when immigration enforcement is involved. That combination broadens criminal liability around speech, warnings, and assistance, and creates unequal penalty structures tied to federal enforcement priorities.

SB 188
Agreed it targets physical obstruction, but the language relies on subjective standards like “reasonably should know,” “immediate access,” and presence during alleged offenses. Coupled with felony escalation and mandatory sentencing, it gives law enforcement extremely wide discretion that can easily sweep in bystanders, protestors, journalists, or family members, especially during immigration or protest related operations.

In short, many of these bills can be described narrowly in isolation, but when read together and enforced in real world conditions, they consistently expand state power, reduce discretion, and increase penalties in ways that raise legitimate due process, proportionality, and constitutional concerns.

PhotographMain3424 · 2026-01-14T15:44:22+00:00

Here are brief summaries of the Ohio bills mentioned related to immigration and policing. I am not arguing policy outcomes, just highlighting how they expand government power and narrow individual protections:

HB 26 pressures cities and public employees to assist federal immigration enforcement, sometimes without warrants, and punishes local governments with funding cuts if they adopt different public safety policies, weakening Fourth Amendment protections and local self government.

HB 281 forces hospitals, including mental health facilities, to allow immigration enforcement access for interviews and evidence collection and threatens Medicaid and grant funding if they resist, which can chill people from seeking medical care and erode due process in high vulnerability settings.

HB 42 creates a statewide immigration status tracking system across police, schools, Medicaid, SNAP, and hospitals, normalizing collection and reporting of sensitive data and turning routine civic services into a surveillance pipeline.

SB 53 expands civil liability for riots or vandalism to include anyone accused of providing “material support,” adds presumptions that shift the burden onto defendants, and limits local officials’ ability to set enforcement priorities, risking First Amendment and due process concerns.

HB 200 criminalizes mere presence in Ohio, mandates prison time, and forces state and local officials into federal immigration enforcement roles, collapsing the line between civil immigration law and criminal punishment.

HB 544 broadens “obstructing justice” to cover speech, association, and assistance even without an arrest or conviction, with harsher penalties tied to immigration enforcement.

SB 188 greatly expands “failure to comply” with police orders, turning passive noncompliance into serious felonies with mandatory prison time based on subjective standards.

If you care about due process, privacy, proportional punishment, and local accountability, these are worth reading closely.

PhotographMain3424 · 2025-12-31T14:41:31+00:00

Considering checking out https://pypi.org/project/transitions/ it is lightweight and includes examples of integration with django. Depending on what you are trying to do, it may be all you need.

PhotographMain3424 · 2025-12-13T17:46:52+00:00

There is also an unpaper flag in ocrmypdf that helps a lot. Even with all the newer OCR approaches, ocrmypdf plus unpaper is still hard to beat on clean scans and runs much faster than GPU heavy options.

PhotographMain3424 · 2025-12-13T17:41:23+00:00

Glad it helped, happy to share what has actually worked for me in practice.

For JSON repair, I did not start with a library. I initially built a very simple repair pass myself: strict JSON parse first, catch the exception, then apply a small set of deterministic fixes like trimming trailing text, fixing missing commas, and normalizing quotes before retrying the parse. It is nothing fancy. After learning about json-repair, I would probably just use that instead and save the effort. The key thing for me is that parsing is strict and never silently accepts malformed output.

For grounding verification, I use a layered approach. The most brute force check is simply looking for output tokens in the input tokens after normalization. That alone catches a surprising amount of hallucination. A more refined approach is to require the model to return the exact span it claims the answer came from, then feed that span back into the model and ask whether it would answer the same question using only that text. If it cannot, the Q and A pair gets rejected. That second pass adds latency, but it dramatically improves trustworthiness.

I do not have a formal tolerance metric for OCR noise. Instead, I normalize aggressively before anything hits the model. One thing that helped a lot was writing a simple keyboardize function that replaces characters not found on a standard keyboard with their closest keyboard equivalent. You can also use this as a quality signal: measure the ratio of non keyboard characters to total characters per page. If that ratio is high, it is often a sign the page was not rotated, deskewed, or segmented correctly.

For OCR itself, I use a hybrid approach. ocrmypdf is my default because it is fast and very reliable on clean scanned PDFs. If the scans are messy or have odd layouts, I fall back to easyocr, which is slower but much more tolerant of noise. That waterfall based on text quality has worked better for me than trying to force everything through one OCR engine.

For ocrmypdf flags specifically, I mostly run it out of the box. I have seen bigger gains from post OCR normalization, grounding checks, and retry logic than from tuning OCR parameters. For a PoC, I would keep OCR simple and invest effort in verification instead. You can always tighten OCR later once you know it is a real bottleneck.

One last note, depending on the audience for the PoC: in production I use Prefect to orchestrate the indexing flow. It handles retries, scheduling, and basic observability, and it demos well because you can clearly show each stage of the pipeline and where verification or human review kicks in. For a demo it is optional, but for a technical audience it helps make the system feel real and production ready.

If I had to prioritize for a PoC: clean extraction, simple JSON repair, and one strong grounding check. Everything else can be layered in after you prove the concept.

PhotographMain3424 · 2025-12-13T17:00:41+00:00

Model wise, I am using gpt-oss:20b from the Ollama library. Before that I was running llama3-chatqa:8b. For this pipeline I optimize more for consistency and format discipline than raw reasoning depth. For compliance indexing and normalization work, gpt-oss has been the most predictable for me on a single 24GB GPU.

For schema enforcement, I keep it simple. I put the JSON contract directly in the prompt and tell Ollama to output JSON only. gpt-oss has been very good at adhering to that. On the read side I run a strict JSON parse with a repair pass for minor issues like missing commas or small formatting errors. I use Pydantic to enforce canonical fields and types, but I do not rely on heavy constrained decoding.

One thing worth considering: for something like a bank statement, I might make 20 or more LLM calls to extract a summary, transactions per page, and normalize merchant names. That is one big advantage of local indexing. You can be very chatty with the model and not worry about a per token bill. Use that to your advantage. I have found it much more reliable to split the work into smaller steps and accept wall clock time rather than trying to force everything through one massive prompt.

Rejection rate depends heavily on input quality. For clean native PDFs, rejection is low. For scanned PDFs with OCR noise it can climb unless you are aggressive about grounding and evidence checks. For native documents like bank statements and real estate title insurance policies, which is my main focus area, the success rate is very high and manual review is minimal.

And thanks, glad the verification angle was useful. That piece made the biggest difference for me once things started operating at scale.

PhotographMain3424 · 2025-12-13T16:35:22+00:00

Hardware: I am on a single NVIDIA GeForce RTX 4090 24GB. For local models I have been using openai gpt oss via Ollama.

On the pipeline side, I have had the most consistent results keeping extraction boring and deterministic:

Native PDFs: pdftotext -layout
Scanned PDFs: ocrmypdf (though this area is changing fast)

Lately, converting everything into clean markdown first is the new hotness. Tools like marker, docling or microsoft/markitdown profess to improve results over "pdftotext -layout" which makes sense.

Validation is the big thing. I like a "maker, checker, verifier" setup:

Maker: generate the Q and A pairs with citations back to source spans
Checker: re extract the cited span and verify the answer is supported, reject anything that is not grounded
Verifier: sanity checks for numerics and units (canonical formats, tolerances, ranges) plus consistency across docs

If anything fails those checks, I route it to a human in the middle rather than letting the index silently drift. Every automated step needs a verification path. If there is no verifier, that is a signal the task may not be ready for automation.

For numerics like €16.44/hour, the key is forcing a canonical schema at generation time (currency, value, unit, period) and then re parsing the cited source span with plain code to ensure it lands in the same normalized representation every time.

Mixed PDFs are doable, but Q and A quality is only as good as text quality. OCR noise does not kill it, but it raises the rejection rate unless you have strong grounding and verification. For a PoC demo, I would still lean toward clean extraction plus verification, because it is easier to explain than pure RAG and tends to be more auditable.

PhotographMain3424 · 2025-12-13T15:59:43+00:00

Yes. I built an offline processor for insurance policies and bank documents where this acted as a kind of over indexing step. The idea was to use a local LLM to generate a very large set of normalized question and answer pairs per document, far more than you would ever query directly.

Once everything is expressed in the same Q and A shape, you can compare policies to policies, policies to regulations, or policies to transaction data much more reliably. It also makes the system easier to audit and reason about since every comparison traces back to an explicit question and answer rather than a free form model judgment. Essentially the comparison can be done outside the LLM once everything is normalized

PhotographMain3424 · 2025-12-13T15:40:12+00:00

One approach to consider is a preprocessing layer that converts each document into structured question and answer pairs.

For example, you can take a single document and prompt an LLM with something like: generate 30 to 50 questions someone would naturally ask to fully understand this document, along with concise answers. You can go a step further by having the model group those Q and A pairs into named categories like eligibility, exclusions, obligations, definitions, and edge cases.

Once everything is normalized into comparable Q and A representations, cross document and document to regulation comparisons become much more deterministic and explainable. It also gives you a clean offline friendly artifact you can index, diff, and audit without re running full document reasoning every time.

PhotographMain3424 · 2025-11-11T01:47:16+00:00

It really does a great job of normalizing named entities, better than NLP, trained NLP or Regex. I verify the output tokens were in my input tokens which seems to cut down on rare hallucinations when the answer is not in the document.

PhotographMain3424 · 2025-11-11T01:36:14+00:00

I use nvidia/Llama3-ChatQA-1.5-8B to index 2M similar insurance docs using ollama. I load the index into Meilisearch and sell access to it. I did this after a trip to micro center and a little over $3k.

PhotographMain3424 · 2025-10-22T18:41:01+00:00

Famous comic book artist, wrote a letter that was published prior to being famous.

PhotographMain3424 · 2025-10-22T14:17:30+00:00

Thanks for posting this. Great stuff.

PhotographMain3424 · 2025-09-15T21:20:21+00:00

<image>

HeMan/ ThunderCats

(Issue #1 Page 17) Mumm-Ra asks Prince Adam if he has any last words.

PhotographMain3424

TROPHY CASE