// image placeholder — probe examples
Motivation
Modern VLMs look impressive on demos, yet a line of work on "VLM blindness" shows they routinely fail tasks a child handles: counting overlapping objects, judging relative positions, recognizing simple relationships. This write-up reimplements those probes from scratch and extends them, to build an honest picture of where these models actually break.
What I'm looking at
- Reimplementing the core probe suite rather than trusting reported numbers.
- Stress-testing counting, spatial reasoning, and attribute binding.
- Checking whether failures are consistent across prompting and resolution changes.
Findings
// in progress — detailed results, figures, and failure case studies to follow as the analysis matures.