Have you considered some sort of "crowdsourcing" / voluntary botnet type approach?
The ArchiveTeam[1] have a simple VM image that anyone can use to schedule and coordinate large site archival jobs that might already address some of teh issues.
Might be tricky to find people willing to provide resources, but with even a smallish group it might work out. May need to consider abuse and run multiple queries and compare results, which might add to the overall request cost.
My approach is sufficiently fluid that this would mean pushing pretty crude code to a bunch of hosts frequently and on a irregular basis. The runs themselves are fairly ad hoc.
Being able to directly query a corpus (IA, DDG, Bing, etc.) is another option.
Search across large corpora remains fairly expensive, I can understand hesitency here.
Nonstandardisation of search APIs across sites is another frustration.
The ArchiveTeam[1] have a simple VM image that anyone can use to schedule and coordinate large site archival jobs that might already address some of teh issues.
Might be tricky to find people willing to provide resources, but with even a smallish group it might work out. May need to consider abuse and run multiple queries and compare results, which might add to the overall request cost.
[1] https://www.archiveteam.org/