Good question. How does one simply find the bounding quad of rotated perspective text? Will that handle perspective distortion?
I guess the author partly answers your question early on with discussion of the Merino-Gracia paper, which fits a quad to individual lines of text, and a comment about how that relies on first being able to detect lines of text.
Matt also doesn’t claim this method is better. He says “I’m sure its neither as accurate or as useful as the Merino-Gracia approach.“ I assume the example text “Needlessly Complex” is a bit of self-deprecating humor, acknowledging he may not be taking the easiest path there is. But the method here seems interesting and useful to me for its approach; it doesn’t have to identify word or page boundaries, or lines of text, as a prerequisite. The assumptions are simple and the optimization is simple, it’s a nice study in different ways to think about the problem.
Finding linear boundaries of a wile block of text is much easier than finding letter boundaries.
It's a 1980s textbook matter of finding lines where the brightness gradient is extremely large.
Which algorithm are you referring to? I have a copy of Jain et al and I can’t find what you’re describing. Do you have a link to something? The Hough transform is used in this article, if that’s what you’re thinking of, but that will not work to find the bounding box of text, the lines have to be solid, contiguous, and linear for that to work. Note the method in the article doesn’t depend on the text having a solid surround color, or even have the text arranged in a roughly rectangular shape. And it also doesn’t depend on the text being linear. These differences are valuable, not having to make the same assumptions you’re making, and it means this method (whether or not it’s “better”) may work in a wider variety of situations, or may make a very good complement to existing methods.
The method does have to identify lines of text to find the rotation angle, but doing so after perspective correction using the "all letters should be about the same size" assumption means that a Hough transform is enough for that step, since the lines should already be roughly parallel.
(Having to identify page boundaries is handwaved away with "I’m going to make a huge simplifying assumption that that the image we’re processing basically contains only the text that we want to rectify")