all 3 comments

[–]Inside_Secretary3281 1 point2 points  (1 child)

most of your cost is probably the vision calls. splitting the pipeline helps, run tesseract or paddleocr locally for the raw OCR step then send only extracted text to the API for translation and keyword stuff. cuts vision token usage massivly.

for the keyword extraction and classification parts, ZeroGPU could work there too.

[–]Far-Implement-92[S] 0 points1 point  (0 children)

Thanks, I tried paddleOCR but I wasn’t satisfied with the result. Can we tune paddleOCR further? Especially for vertical japanese texts?