// image placeholder — probe examples

Motivation

Modern VLMs look impressive on demos, yet a line of work on "VLM blindness" shows they routinely fail tasks a child handles: counting overlapping objects, judging relative positions, recognizing simple relationships. This write-up reimplements those probes from scratch and extends them, to build an honest picture of where these models actually break.

What I'm looking at

  • Reimplementing the core probe suite rather than trusting reported numbers.
  • Stress-testing counting, spatial reasoning, and attribute binding.
  • Checking whether failures are consistent across prompting and resolution changes.

Findings

// in progress — detailed results, figures, and failure case studies to follow as the analysis matures.