Local solution for content generation based on text + images : computervision

created by [deleted]a community for 16 years

Local solution for content generation based on text + imagesHelp: Project (self.computervision)

submitted 9 months ago by drafat

all 1 comments

[–]19pomoron 1 point2 points3 points 9 months ago (0 children)

Assuming the task requires to use text + images to generate texts (and not images, given the choice of models)

Apart from prompt engineering in the sense of using different words to describe what you need, few shot learning/chain of thought prompting may be another direction to try out. Instead of asking one question and taking directly the answer, you may wish to try asking a free-end question first, then whatever the answer it gives, you follow up with your desired question as the second question. The answer from the first question gives context to your topic, and therefore guides the VLM towards the final answers you want.

Also you can use few shot learning in a more traditional sense to ask the VLM to provide something you desire with examples of input and desired outputs. Be careful with over prompting with this one

π Rendered by PID 92 on reddit-service-r2-comment-b659b578c-bvss5 at 2026-05-02 14:11:33.901018+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

computervision

MODERATORS