Yeah, this is the problem with frankensteining two systems together. Giving an LLM a prompt, and giving it a module that can interpret images for it, leads to this.
The image parser goes “a crossword, with the following hints”, when what the AI needs to do the job is an actual understanding of the grid. If one singular system understood both images and text, it could hypothetically understand the task well enough to fetch the information it needed from the image. But LLMs aren’t really an approach to any true “intelligence”, so they’ll forever be unable to do that as one piece.
Honestly, makes sense, the active voice version is just… more efficient and easier to parse quickly.