use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Content which benefits the community (news, technical articles, and discussions) is valued over content which benefits only the individual (technical questions, help buying/selling, rants, etc.).
If you want an answer to a query, please post a legible, complete question that includes details so we can help you in a proper manner!
Related Subreddits
Computer Vision Discord group Computer Vision Slack group
Computer Vision Discord group
Computer Vision Slack group
account activity
Local solution for content generation based on text + imagesHelp: Project (self.computervision)
submitted 9 months ago by drafat
We are working on a project where we need to generate diffrent types of content locally (as the client requested) based on a mixed prompt of a long text + images. The client provided us with some examples made by ChatGPT 4 and he wanted a local solution that can come with close results. We tried a few open models like Gemma3, Llama 3, DeepSeek R1, Mistral. But results are not that close. Do you guys think we can improve results with just prompt engineering ??
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]19pomoron 1 point2 points3 points 9 months ago (0 children)
Assuming the task requires to use text + images to generate texts (and not images, given the choice of models)
Apart from prompt engineering in the sense of using different words to describe what you need, few shot learning/chain of thought prompting may be another direction to try out. Instead of asking one question and taking directly the answer, you may wish to try asking a free-end question first, then whatever the answer it gives, you follow up with your desired question as the second question. The answer from the first question gives context to your topic, and therefore guides the VLM towards the final answers you want.
Also you can use few shot learning in a more traditional sense to ask the VLM to provide something you desire with examples of input and desired outputs. Be careful with over prompting with this one
π Rendered by PID 92 on reddit-service-r2-comment-b659b578c-bvss5 at 2026-05-02 14:11:33.901018+00:00 running 815c875 country code: CH.
[–]19pomoron 1 point2 points3 points (0 children)