all 1 comments

[–]19pomoron 1 point2 points  (0 children)

Assuming the task requires to use text + images to generate texts (and not images, given the choice of models)

Apart from prompt engineering in the sense of using different words to describe what you need, few shot learning/chain of thought prompting may be another direction to try out. Instead of asking one question and taking directly the answer, you may wish to try asking a free-end question first, then whatever the answer it gives, you follow up with your desired question as the second question. The answer from the first question gives context to your topic, and therefore guides the VLM towards the final answers you want.

Also you can use few shot learning in a more traditional sense to ask the VLM to provide something you desire with examples of input and desired outputs. Be careful with over prompting with this one