I think using vision models for browsing is the wrong approach. It is the same as using OCR for scanning PDFs. The underlying text is already in digital form. So it would make more sense to establish a standard similar to meta-tags that enable the agentic web.
If you're working from the markup rather than the appearance of the page, you're probably increasing the incentives for metacrap, "invisible text spam" and similar tactics.
PDFs are more akin to SVG than to a Word document, and the text is often very far from “available”. OCR can be the only way to reconstruct the document as it appears on screen.