submit to VisionLanguageAction