Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

On the other side, I've also noticed it appears to be aggressively pruning its index in the past few years, so the fact that it's crawled your site doesn't mean it's necessarily searchable either.

Another "bug" that seems to manifest quite often: if I search for a specific phrase or unique word on a page that I found in a SERP, so I know it's crawled that page, it often doesn't return that page either.

Add to that the automatic CAPTCHA-hellban you get if you use "site:" in anything more than a tiny amount (and the one you still get if you search "too much"), and I realise that there's increasingly huge amounts of information out there on sites that Google may have crawled before and knows about, but doesn't want to show me for some reason. I remember it used to be much easier to find information about obscure topics even if it meant wading through dozens of pages of SEO spam; now it's nearly impossible for anything but the most vapid of queries.



Another bug I'm noticing lately is it'll flat out ignore things sometimes, even if you put a term in quotes or try to exclude it with a -leadingdash. About 30% of the time if I use those operators, they'll have no effect on the results. I don't understand why they'd make things worse on purpose, but I don't know how it could be just a "mistake" no one noticed.


Search engines in general have realized that it's more profitable to show you irrelevant results than to show you nothing. Furthermore, they've realized it's more profitable to show you irrelevant results laden with their ads than show you highly relevant results from ad-free sites.

Perverse incentives at work!


This is precisely what happened. When google merged with doubleclick.net the new company should have been named doubleclick.net and not google. The old google ceased to exist at that point and was swallowed by an advertising company.

I strongly agree with this bill hicks bit on advertising:

https://www.youtube.com/watch?v=-gd01vfKfr0


Is there any good search engine these days?


grep


Not really. LLMs are good if you account for hallucination.


phind.com

Kagi (don’t use it personally)


[flagged]


I was being hyperbolic but it's not that far from the reality of the situation. Google's decline started around the same time as that merger.

I'm not the only one who thinks this way: https://www.nytimes.com/2020/09/21/technology/google-doublec...


See also Boeing merging McDonald Douglas[1] [1]https://qz.com/1776080/how-the-mcdonnell-douglas-boeing-merg...


I know there were rumblings in the late 00's and early 10's about how McDonnell Douglas culture and executives were ruining Boeing.

But some people take a step farther back than this and blame Congress for the 737 MAX. They basically forced the merger, and unhappy weddings make for unhappy homes.


I've seen plenty of mergers where there's a weird brain transplant and flippozambo! the acquired company's leadership is now in charge of the buying company. The fish, as they say, trots from the head.


I totally agree with you. Ads have influenced everything they’ve done since. It’s like a brilliant, talented individual who has been addicted to heroin for a decade.


Yeah but google also became wildly successful to the point that they blow money on ventures with no real businesses plan and give up two years later when they can't turn a profit. They effectively have a blank check at all times. They're more like a businessman who's addicted to making money at the expense of all their personal relationships.


Regarding unhinged ideas, doubleclick is quite old, but is it old enough that opening a hyperlink would've typically required a double click at the time? Or is the metaphor here that their ads are so amazing people are double-clicking them in ecstasy?


Double-click, as others have said was never something you did with hyperlinks, even before the web.

Double-clicks were used with icons on the desktop because you could do more with an icon than just open it. You could move it, copy it, etc. Double-click was a convention for a shortcut to open the reference of the icon. A single-click would have not allowed those other actions.


Because of this, double-click became business speak for going to the next level of detail, digging into, etc.

The idea behind this name for ads was: this company makes ads relevant and compelling, so users drill into them and find whatever you want to advertise.

For what it's worth, because of the affordance you mention, even though users didn't have to, they consistently double-clicked banner ads, and most things they wanted to activate, even after they learned they only had to single click the blue underlined things.


Your use of the past-tense is premature, I see double-clicking all the time at work.


Have we moved so far past the desktop that this knowledge is starting to be lost?

I had to swap real quick to my desktop and see if I still had to double click an icon just to be sure.

I mostly access things with the super key / search now and I guess people with phones would just tap an icon.


You can configure single clicking in Windows, or at least you once could.


You can but that means hovering over files for a second will select them (and clear your previous selection). Other systems (e.g. KDE) manage single click to open without that annoyance.


> A single-click would have not allowed those other actions.

Yes it would. And did. Windows chose double click to open but other systems managed with a single click while still allowing you to drag around icons and files.


Opening a hyperlink has never required a double click in browsers. Not from Mosaic forward.


I find Amazon really irritating for that. I do a search for a very specific thing, and a ton of results always come back, often having nothing to do with my search request. And sponsored results both at the top and scattered through the results.


Amazon is gotten so bad that unless I know an exact part number or model then I don't bother. I'll go somewhere else for any research and only come back to Amazon if I want to price shop what I found.


Even with an exact part number, it will often push related items first. I was searching for a specific thermal printer, literally using the PN (something like C18647585), and it still decided to show me "sponsored" and related thermal printers first. So it somehow knew that part number as a keyword for thermal printers, but just didn't want to show me the one result that actually would be helpful (it was a third party seller, so maybe that penalizes the result?)


Amazon is so bad that I shop on Walmart's website now.


I get better results by searching Amazon via DDG, Brave or Kagi. Amazon's search, especially for books, is nearly useless by comparison.


Luckily for us we have the high-quality, independent book data provider Goodreads! /s


There's Worldcat, though its site revise last year made it useless for me.

I'm finding Open Library (part of the Internet Archive) is increasingly useful for book search.

You might also have success with a major library (e.g., British Library, Library of Congress, major US city libraries (NYC, Boston, LA, San Francisco, Chicago, etc.), and some academic libraries. Watch that these aren't in fact backed by Worldcat though. (Many local library systems are.)


I found goodreads search to be quite good, where's it lacking for you?


Goodreads is tolerable, but mainly as a data source. The product itself has been in maintenance mode for... a decade?... or basically since the Amazon acquisition.

They rolled out a completely new design semi-recently for... half of the pages... and left the other half on the 10+ year old styles.

It just feels like Amazon is happy to take advantage of its dominant position with Goodreads having a more complete catalog than any of the other more open offerings. And yet, they seem to invest no effort in modernizing or improving the site, making it more performant, etc. The moderation tools kinda suck too — doing super common things like merging (incorrect) duplicate listings is a PITA.

Also the app exhibits blatant conflicts of interest like prioritizing buying new books from Amazon over, e.g., digital library loans, with no option for users to configure that.

Nothing I'm saying here is new though - https://en.wikipedia.org/wiki/Goodreads#Criticism.


Amazon owns Goodreads. It's not independent. It's also not mentioned on the site afaik. (They also own IMDb and a bunch of other internet companies that aren't Amazon-branded). If you want something independent, try storygraph or librarything.


Yeah I knew about IMDB. I know a guy at Amazon that said IMDB used to actually be in Perl, they rewrote it in Java over a few years.


This is particularly bad if you search for a type of thing, e.g. "mechanical keyboard". Many of its top suggestions will be for nonmechanical keyboards and that won't be obvious without reading their descriptions carefully.


Aliexpress is the worst for this...

"keyboard non mechanical like cherry switch membrane touch rgb light clicky gaming blogging ergonomic xiaomi redmi arduino android ios windows laptop desktop computer tablet phone", and if you sort by price, the first one is $0.99.

Then you have 5 different colors of plasticky $15 keyboards and a usb card reader for $0.99 to choose.


“Mechanical keyboard like”

What’s infuriating is how this lying has become normalized in “good” brands. For instance, try to buy a 60” TV. I do not think you can find one. They are all 59.5” and sold as ‘60” class’.


That usually means that discriminatory taxes or regulations are being dodged, for better or worse.


Perhaps that's one reason but it's also a >1% reduction in display area which is not nothing.


Stop buying from Amazon. I haven’t bought anything from them in years. There is nothing that Amazon offers for sale that you can’t find somewhere else, aside from maybe entertainment content that they produce.

Don’t reward bad behavior or they’ll keep doing it.


What you can find on amazon that you can't find elsewhere (at least here):

- All your regular non-food needs in one cart.

- No-hassle refunds if there is a problem with the order.

The first one is just convenience and I could do without. The second one is where most other stores fail. Or at least enough of them that I don't want to risk having to phone their hotline or pay for shipping back because they or their delivery contractor fucked up - or is fraudulently claiming they tried to deliver when they made no such attempt at the specified address and instead want you to waste your time picking up the package at a random location accross town.


Also on the HN front page right now is an example of the price we pay when we prune "undesirable" websites from our search indices:

https://www.quantamagazine.org/sci-fi-writer-greg-egan-and-a...

I don't know if 4chan is included in the google index, but I've never gotten a 4chan result in any search I can recall.


why would you expect to get a 4chan page? none of that data is persistent? iirc google relies on links to the page, so that is impossible, plus the content rotates constantly when they drop off the last page


Because you get 4chan results in duckduckgo and yandex.


What's the query and result?

I still don't get how you expect it to work when the content rotates quickly and disappears

If you search 4 Chan on Google it does come up in the results and a safe search warning


Only a couple boards on 4chan update quickly, the vast majority contain threads which stand for months at a time.


Funny you say that. I got referred to the local AI models thread (/lmg/) on the technology board just the other day.


The invisible hand of the market at work. Nothing perverse about it. It generating more money and that the only metric that matter.

Nothing to see here.


Generating money for google is not the only metric that matters for the users. The incentives are perverse from the perspective of everyone other than google executives and investors with significant google holdings.


And we are seeing alternatives like Kagi pop up because of it.


Let's ask Infoseek about that.


[dead]


Your use of Tinyurl doesn't really shorten the URL. Please don't obfuscate unnecessarily.


Ideally, I'd like to delete messages after a day, a month, a year. But HN messages stay online forever and search engines eventually pick up on them

I just don't like being in the panopticon


Link is https://www.aisearch.vip/aisearch#

There are enough dead "short" links on forums already.


that short URL goes to aisearch.vip


I notice this happening when the actual query would have returned 0 results, Google ""helpfully"" will modify your query (such as dropping quotes) to generate more results.

This is super annoying because it doesn't appear to inform you of this anywhere in the UI, until you click through to page 2 and see what it modified your query to be.


Over the years, I've found the frequency of "0 result" queries has gone way up. Subjective anecdata from me, but it's a pretty big difference. There must be some large areas of their index that have been dropped over time.

From what Google's hinted at and probably your own experiences (which I reckon are like mine), it's pretty clear that most folks aren't great at Google searches. This might be why Google has leaned on AI to "guess" the best results. They figure their AI can predict what you want better than you are able to specify via your search query.


Which has the really annoying side effect that now you actually need to produce a worse query to get decent results.

A few years back I used to scoff at people who wrote searches in the form of literal questions instead of boiling things down to key terms, e.g. the "how do i..., what is the largest..." type searches.

Nowadays I not only need to write stupid searches like this to get better results but quite literally find that my brain has adapted the pattern and my past skill at crafting salient key term, operator driven queries is eroding.


Also 1 page results. Around 10-20 links. Like, that's it? Really? No more? That's not what I see when I try the same query on Bing!


I've noticed the same. And sometimes you'll get search results showing 10+ pages, but if you actually follow through them, the results die by the second page. Google also omits many domains from search results now.


Sites like Twitter and Instagram also frequently completely change the search term now to something else for certain queries. This practice is anti-competitive in the highest order. The very foundation of having a text search is to have an exact query match to begin with... The alternate spelling item should only be a suggestion in results at the most, but they've flipped this now, and that's outright deceptive.


It is located under the search query itself. For instance, see https://www.google.com/search?hl=en&q=%22spill%20clean%20big..., a nonsense query I made up with no hits. You should see

    No results found for "spill clean big search stain".

    Results for spill clean big search stain (without quotes):
on the page.

This isn't new, I've always seen Google doing something like this. It hasn't always been large on the page but it's always been there.


Might be desktop exclusive? For me on mobile (testing another random phrase as now yours hits this post), I don't see that text, or any other indication the query has been modified.


I'm used to that. But this is worse, this is when the search I enter would give the ideal results because I eventually contort the search parameters that I find some results.


Funny that people call these "bugs" as if anything related to google search happens on accident.

They don't need to waste the eng resources or infrastructure on rock solid search anymore, they own the market and got all the users into their funnel of products, most locked in for life.

Search results still show sponsored listings, they still have all the users, and all the profit, and a lot less of the profit sucking operational costs it took to be good at what made them a household name, search


Is google trading "accuracy" for computation cost at the same time inserting junk results into the results?


They aren’t inserting junk, they just don’t do anything to rank quality results above the junk anymore. The junk was always there, on page 2 and beyond, and who would ever need to hit those pages. Nowadays I’ll be 15 pages deep and so far off base of the search term that I could write a better index with curl and regex.

The issue isn’t that they are watering down the cream in the milk, it’s that the cream isn’t part of the milk at all now.


This is a good point. Google got to the top by having the best search result back when it mattered. It now no longer matters.


There was a Google search engineer on Reddit who claimed the opposite personally, Google is going down the trash but the alternatives aren't any better. Of course I can find it now, thanks Google.

I wish there was a search engine that ran like mid 2000s google but with a social media component so you can down vote SEO spammer blogs into oblivion.


Unfortunately, content farms can push new websites and blogs faster than you could ever downvote them. LLM are going to make that task increasingly easier. I've no idea how we're ever going to be able to search anything anymore using classic search engines. We either go back to website directories, or forward to AI generated content..


Web rings.

Think about it - some human element of trust and vouching for someone being added to the ring.


Till you find out that most human will sell out ring links for a bit of cash with no problems.


Perhaps that is one additional layer of friction that will make human moderation / social voting feasible. The fire hose of AI trash content will come too rapidly for it to work at layer 1 (all content), but if the barrier to entry is a financial transaction to take over placement in a human-curated webring or directory it becomes easier to moderate / vote away the trash.


Sure. It's a problem with peer-reviewed science journals even. There are no perfect solutions to monied interests bribing the curators.


Add a trust metric and chains of provenance. Bad ring link -> bad trust percolating up that chain. Little trust, your site isn't always shown as part of the ring. Too much loss of trust, you're out.

(Ultimately, this is a bad facsimile of human group behavior - all the way up to shunning people who deeply violate group norms. And I don't think it'll scale super-well. )


That's pagerank, right? The trust was built from href votes.


Except there's no provenance or root of trust. There is (IIUC) no back-propagation of a penalty if sites violate trust, just an overall observational measure.

And I'd still say pagerank did work really well in an Internet where there was overwhelmingly trust. But in a world where default-trust is a bad stance, I believe there needs to be an equivalent of what "You can trust X" does in small in-person groups. (Or, alternatively "Sheesh, X went off and just destroyed all trust")

I do think it'll need to be more than a single metric, too. Trust is multidimensional by topic(E.g. "I trust the NYTs data science folks, I have zero trust for the OpEds"), and it is somewhat personal. (E.g. I might have experienced X lying to me, while they've been 100% honest to you - maybe in/outgroup, maybe political alignment, maybe differing beliefs, etc. Ultimately, what we call trust in an indirect situation is "most of my directly trusted folk vouch for that person)


Keyservers. You decide which keyservers to register with and to trust for verifying others. Browsers would handle en-decryption automatically and allow you to flag, filter, or finger (in the Unix sense).


No, they can't. Or at least, they don't. I see the same trash-fire sites on Google all the time. Google just DGAF.


I used DDG for a while, but DDG's quality fell precipitously a few years ago (similar issues where it ignores quotes and won't find pages even if you search for the title string exactly, etc) and I eventually came back to Google which has also been increasingly frustrating.

> I wish there was a search engine that ran like mid 2000s google but with a social media component so you can down vote SEO spammer blogs into oblivion.

There's no way this won't get abused, but the SEO stuff is out of control. Not even spammer blogs, but if you have a quick question like "how do I check tire pressure" you will only get articles that start with a treatise on the entire history of car tires and the answer is deeply buried somewhere in the article. My guess is that Google sees that we're on the page for a longer time than we would spend on pages that just return the answer, and they assume that "more time on page" == "better content" or something.


DDG has become ridiculous. They seem to be merging "local", geoIP based results no matter what country I select on the region list (or I disable it). Very often completely unrelated stuff (but local) appears on the 5th or 6th result, midway the first page.

Most egregiously I will search for something very rare (e.g. about programming) and DDG will return me results regarding my city's tourist/visitor info. It's as if it just keeps ignoring words from the search prompt that return no results until it runs out of keywords then it's just the geoIP results.


I hate this forced localization so much and its everywhere. The internet used to be a place where you would actually encounter stuff outside your locale.


That is because DuckDuckGo started relying almost entirely on Bing for their regular search results after first Yahoo gave up maintaining its own index then Yandex became part of a natio non grata leaving them to choose between partnering with Bing and partnering with Google or creating their own index https://help.duckduckgo.com/duckduckgo-help-pages/results/so...


The tire pressure query is exactly the kind of thing that AI should be able to handle easily, though. At which point google has an incentive to sort their competitiveness out.


Kagi is an alternative and it is worlds better. Try it out!

https://kagi.com


Love kagi. The first time I got the "your payment was successful" notification I felt like I'd never get that much value out of it. But now, a few months later, I feel like I could never go back.


Don't they use Google for their results? I'd imagine they'd run into the same issues the article is pointing out.


No? At least not like you are implying. Kagi queries multiple data sources and synthesizes results. This means Google’s failure to index does not impact Kagi in the same way as it would DDG (with Bing).


I'm also a huge fan of Kagi. I've been a paying user since they launched the paying subscription. Really happy with it!


Though a subscriber myself, Kagi doesn't really add results, does it? It merely weeds out the trash for you. So you can get to the bottom of search results.


Just being able to block spammy Stack Overflow clones from ever appearing in the results is worth the price of admission for me.


Here is another, just launched: https://greppr.org/


Looks promising, though I noticed that it doesn't encode queries properly when searching. For example, if you go to the homepage and search for "../robots.txt", you'll be redirected to the site's own robots.txt file


Thank you kindly for testing, I'll need to fix that one.


Checking this out, thanks mate.


What I want is a "serious mode" that makes it favor primary sources, peer reviewed papers, and raw data. When I search for economic data, I don't want a million news articles referencing pieces of it. I want the raw data release. When I search for some video going viral, I don't want a million videos of journalists talking and showing clips. I want the full raw video.


Beautifully said! As a thinker of philosophy, I have come to understand that our clip-society is based by design. People can express power over others if they tell you a construction and then show a clip to support it. They really don't want you to see the source/what it is/the truth. They want you want you see what they show you. This problem is accelerating in western societies and it is a fundamental problem of human nature. Journalism is the healthy expression and what we see in today's media is the sickly end.


Google scholar?


> I wish there was a search engine that ran like mid 2000s google but with a social media component so you can down vote SEO spammer blogs into oblivion.

This is sort of what I've been trying to do with Marginalia Search, except I don't really believe a voting system would work. It's far too easy to manipulate. Been playing with the thought of having something like an adblock-list style system where domain shitlists can be collaborated on and shared without being authoritative for the entire search engine.

My search engine is still pretty rough around the edges and limited but I think it works well enough to demonstrate the idea has some merit.


> Been playing with the thought of having something like an adblock-list style system where domain shitlists can be collaborated on and shared without being authoritative for the entire search engine.

Even just personal shitlists would be golden and make just about everyone happy.


Something I've wanted (which probably exists as an extension in Chrome?) for Google searches is a simple blacklist. Just a little button and confirmation next to a result, telling it to never show this blog-spam-ad-laden-SEO-mess of a page to me ever again. Maybe it's an uphill battle, but for some niche topics (like certain games) there are some sites I keep having to scroll past and sometimes accidentally click that are written in SEO-speak and say a lot without saying anything at all.



Will check it out, thanks bud.


I miss AltaVista so much... It was no frills and only based on page content.


Those of us who worked there thank you!


Loved everything about Alta Vista, including the logo, and the UI.

I miss 90s Internet in general. It wasn't the ugly battleground and desolation planet that the current net has become.


Remember Guestbooks? you'd visit a website, and volunteer your name and which country you were from and leave comments. And it wouldn't be a cesspool of spam and porn and XSS attacks? How quaint!


Oh gosh, yes! And reading the guestbook was always so fun. An elderly friend of mine passed away in 2018, and in doing a (google) search of him, I found guestbooks he'd signed 20 years ago.


I loved me the 'near' keyword - thanks!


I both appreciated Alta Vista, and appreciated its office space in Littleton ( i think ) when i worked in it after its passing. ;)


You would look for a thing and the first five pages were random mailing list discussion archives discussing how the thing was 5 years before... Altavista was impressive, but there is a reason why it went away.


> I wish there was a search engine that ran like mid 2000s google but with a social media component so you can down vote SEO spammer blogs into oblivion.

I want this too, but I think an often understated aspect of this issue is that by this point Google has absolutely trashed the web of that era. In these threads people will say "the content you want isn't out there, it's all on social media now" -- and they're largely right, but I think Google is the party most responsible for mutilating the web to the state it is in now, and users fled to social media partly because it seemed like a safe haven.

What we need is a concentrated effort to rebuild the web. Take the best parts of what we've learned and combine with the best parts of what we've left behind and try to build something better, for humans, not for advertisers and hyper-capitalists.

That will take time, energy, and people who remember what we lost and believe we can build something better. A better search engine alone is not enough.


Largely right, but actually a lot of that stuff is still out there. The personal and hobby pages, forums, blogs, etc.

Google just doesn't know that they exist anymore, or rather doesn't want us to know, because those sites are not commercial enough or big enough.

Almost without fail, no matter what you search for, it tries its best to turn it into a search for a product or service. And those content oriented websites don't fit that, so it just pretends they don't exist.


It seems like google hardly returns results from traditional forums or blogs which has probably accelerated their decline artificially.


The web changed when every kinda slimy business bro realized they could monetize gaming search results. No matter what your fantasy web looks like, be assured, people will game it to the point it's not what you intended.


If I take that viewpoint on everything I might as well live as a recluse in the woods and avoid people altogether. I have to believe that there are enough of us are out there that genuinely want to build better things for people.


The web, just like the real world isn’t static. Becoming and staying intellectually, emotionally and physically mobile may be the only long term strategy to avoid ending up in one or the other dystopia, sooner or later.

When rates of change were slower, you might only have to “move” once in your life, but with increasing rates of change in our human experience, staying nimble is arguably of ever increasing importance.


Yes. This is like water -- keep it moving, find fresh streams.


And my point is that there are probably a lot of those motivated people working on the problem today. You make it out as though we've arrived at this state by either lack of effort or competence by Google/Microsoft. My guess is that every time they change the algorithm, the spammers adapt too. That's inevitable and would be just as much of a challenge for your supposed utopia. If you have some secret they don't, there's certainly plenty of money to be made.


I think google does OK with the syntax it still supports for text queries, but if you switch to the images tab it just thows all of that stuff out the window. I would love to be able to search for "cat eating watermelon" or whatever and only get results with cats eating watermelon, ordered by the proximity of that text to the image returned. Hopefully AI is going to do something for that, but the state of the art, as embodied by the biggest player (Google) is shamefully deficient.


I've noticed this with the quotes as well...


I'm noticing this with DDG as well. :( I guess the powers that be have decided that information must be hidden.


Its even stupider than that. There are only two major, publicly available web indexes in the USA, Google's and Bing's. After 24 February 2022, DuckDuckGo ended their partnership with Yandex, and since then they say "we have more traditional links and images in our search results too, which we largely source from Bing" https://help.duckduckgo.com/duckduckgo-help-pages/results/so...


The web indexes from Google and Bing are available publicly? I can pull it down from somewhere and try to make a search engine?


Bing at least license their indexes to partners on a commercial basis, as did Yahoo until they gave up indexing the web. I am sure that the NSA, the Chinese government, the ahrefs website, and other organizations have comprehensive indexes of the web which they don't share in this way.

Mojeek seems to be the independent, non-paywalled search engine with the biggest index, for an overview see https://seirdy.one/posts/2021/03/10/search-engines-with-own-...


Be careful! The Google search guys will come on HN and gaslight you about this, claiming that the advanced search functionality works perfectly and it's simply user error.

We know it's not, but expect them to try to tell you you're imagining things.


Those operators are no longer supported. You can use Verbatim mode, which is more like the old behavior.


What? Are you serious?


Use verbatim mode, under tools, after your initial search. They broke it, but it still helps.


> On the other side, I've also noticed it appears to be aggressively pruning its index in the past few years, so the fact that it's crawled your site doesn't mean it's necessarily searchable either.

I've noticed this as well. I have a crappy website for my app I need to do better marketing for (not my priority just now), but I've noticed that, for however crap it is, I have received ZERO incoming hits from Google, apart from a couple people that have literally just googled my domain name.

I do not believe for a second there's not a single query done in the 2 months the page has been up, globally, for which my website wasn't a bit relevant. Either that, or the spam problem Google has is much bigger than anyone thinks.

Yet another data point in favour of the Dead Internet theory.


You could try the google search console - it gives you a view on what hits/clicks have come in over time.

edit: Hah. I notice it suggests using it at the top of the page if you use 'site:..." - and I only get 5 results for my site when the console claims to have indexed 10 times that many!

edit2: Also duckduckgo returns more like 15 hits ...


Silly to see people complaining about search results and indexing without backing those claims with data from search console. It’s like devs turn off their brains when it comes to marketing because they don’t like it.


Google's bloody Search Console says I got 16 impressions in 2 months for literal searches of my domain name, and nothing else. Funny seeing people thinking I got those figures by reading tea leaves.

Who's the silly one now?


I have all sorts of things that I wrote years ago and I can never find them searching by title unless I put the specific name of the site in the query. I sure can find the slideshare though where some guy from Oracle stole not only my title but much of the content from my blog.


Dead Internet theory?


From Wikipedia:

"The dead Internet theory is an online conspiracy theory that asserts that the Internet now consists almost entirely of bot activity and automatically generated content, marginalizing human activity. The date given for this "death" is generally around 2016 or 2017."

Not sure I'd call it a conspiracy theory.


> Not sure I'd call it a conspiracy theory.

What I find funny about that framing is that, regardless of whether or not the theory has merit, a conspiracy theory by definition asserts that there exists two or more people conspiring with the intent to produce the alleged outcome. From what I understand, dead Internet theory alleges no such collusion or intent. I could be wrong but I believe that it merely suggests that the amount of bot-generated activity has come to dwarf human generated content to the point where the Internet is effectively "dead" from the perspective of its original purpose: humans sharing human knowledge.


10 or so years ago I wound up blocking everyone other than Google in my robots.txt because I was sick and tired of webcrawlers from China crawling my site twice a day and never sending me a single referrer. Same with Bing. Back when I was involved with SEO the joke was you could rank #1 for Viagra on Bing and get three hits a month.


At least so far according to Cloudflare bots consist of around 1/4 of all internet traffic. But that could be pretty far off depending on how they get those estimates.


This very link had a Cloudflare "prove you're human" screen that prevented me from reading it.


The figure I saw most recently was 42%. Weirdly my brain can remember the number but not where I saw it.

But what I'm curious about, whichever number is true, is whether people mean "malicious bots" when they say this, or just any kind of autonomous agent. And also whether they are counting volume of data or simply network requests.

Because if by "bot" they just mean "autonomous agent making a network request" then honestly I'm surprised the number isn't higher, and I don't think there's anything wrong with it. Every search crawler, every service detector, all the financial bots, every smart device (which is now every device) and a thousand other more or less legitimate uses.


I've got a script for parsing my web logs which removes all the lines which match persistent indexers/bots/scrapers and any obvious automatons. Logs generally shrink to 40-50% of their volume, so I'd at least double CF's estimate.


https://www.youtube.com/watch?v=kL8rHf_idt0

Thoughty2: The internet has died

In this video they rename it from 'theory' to 'prophecy'. As in the internet isn't quite dead yet, but its rot filled bloated body is near its dying breath.


Same issue for me. Message is clear: be relevant to be indexed. And everything indexed is relevant.


use intext:


What I fucking hate is writing a query, sometimes even with parts in double quotes to clarify, and google "helpfully" correcting it to something unwanted, and then putting up the damn captcha when I click the link to search exactly what I want.


Another thing I've noticed: Google only indexes what people search. Meaning, sometimes if you search for something obscure and you don't get good results, come back a week later and you'll get much better results because your query is now a part of their indexed search terms.


This, I have noticed some years ago. It seems much like, if the number of returned results doesn't meet a given threshold, some kind of optimizer runs over night on these searches in order to provide a more extensive result set.


Super interesting discovery! I wonder if whatever algorithm Google is using has reached its scalability limit on today's Internet, and it takes some kind of an over-night batch job to do obscure searches usefully. Maybe all Google Search is doing is just a giant cache of slow search results.


This has been some years ago. Notably, I observed this in relation to search suggestions. You could enter a search and get zero results, but a day later or two, you'd get at least a suggested search term (regardless how accurate or meaningless this may have been). So I guessed, these were built up, at least partly, retroactively. With results now happily including these "sympathetically adjusted search terms" without presenting this as an explicit option, I'd guess, this may now apply automatically.


> Add to that the automatic CAPTCHA-hellban you get if you use "site:" in anything more than a tiny amount

Pretty much any advanced operators seem to do it for me, notably "intitle:" and "inurl:". I'd wager that there are a lot of automated searches using these to look for exposed admin interfaces, but I find them extremely useful for filtering out the crap that clogs up results when a ton of news sites all regurgitate the same viral press release or wire article.


Just fyi, the database that is used for the site:domain.com is actually not the same database that they use for live searches.

So you may see a certain number of pages using the site: command but not or less pages may be indexed.

If you want pages indexed, out then in an xml sitemap file, make sure there are internal links to them on your site, and external links from other sites really helps. Third party indexer tools help as well.


Google results have become so bad that I use "site:" for a majority of my searches these days. I have a bunch of Chrome search engine keywords set up so that I can go straight to results on Wikipedia, Economist, Reddit, Stack Overflow, Cppreference, etc.

It's concerning that they're even nerfing site search, which seems like a core feature for a search engine. You could argue that Google isn't really a search engine any more, but rather a general knowledge engine and advertising platform. I hope somebody can build an alternative to Google that does what a search engine is supposed to do, i.e. index the web without all the extra garbage. But maybe SEO has killed that dream at this point.


> there's increasingly huge amounts of information out there on sites that Google may have crawled before and knows about, but doesn't want to show me for some reason

this is some machine learning stuff they are doing, instead of indexing all the specific keywords they are creating vector embeddings and basically summarizing what's on the page and going on similarity to your query rather than specific keywords. Good for casual searches, but extremely annoying for power users


"Add to that the automatic CAPTCHA-hellban you get if you use "site:" in anything more than a tiny amount"

Source? This would be worrisome.


Anecdata, but I can confirm a uniform and long-standing experience that adding colon-based operators to a search query results in a CAPTCHA challenge every single time on a subsequent search, even if the subsequent search is 'vanilla' (i.e., no operators). Has been like this more years now than I can remember. Apparently this kind of 'advanced' usage is indication of bot activity.


I have never had this experience once in... decades? I use operators such as site: frequently. I suggest there's some other property of your environment that's setting captcha off - vpn, shared sketchy ip/network, etc. Bad actors suck.


Same. Use it everyday for majority of my G searches, never once seen a captcha (except when using VPN). OP, are you logged in to google? I am. Wonder if that’s the difference?


So now anyone displaying slightly more intelligence than an eggplant while doing a search in Google is a "bot"?

Appalling


welcome to the machine learning future, where anything you do that is a statistical outlier gets you algorithmed by a machine that is incapable of reason but knows when you're different.

As a person who has been a statistical outlier most of my life, I am dreading this. It's bad enough dealing with human impressions and mis-judgment, but now we get it from our computers now, which used to be logical, deterministic havens.


>As a person who has been a statistical outlier most of my life

Anomaly detected. Termination authorized.


All humans must be reduced to sameness so machine successors can flourish.


Appalling what that says about Google, or what that says about the average search user?


C) All of the above


For what it’s worth this never, ever happens to me. These days I only get captcha’d when someone’s laptop on the same network gets owned and is being used to hit google.


Hmm... VPN, big proxy, or some other contributing factor? I use site: all the time, not on chrome, and without being logged in... If I've ever gotten captchas doing so, it wasn't frequently enough to see a pattern. Maybe some property of the site makes a difference that puts your usage and my usage on either side of that fence?


> VPN, big proxy, or some other contributing factor?

My big crime is that I live in Romania, I think.


:: Google shaking its fist in the air ::

Damn Romanians!


> Damn Romanians!

Not even Romanian! (Brit transplant)


Everybody adopts the nationality of their IP address in internet land!


Anecdotal, but this happens to me a lot, and not just with the "site:" operator. Generally using any of the advanced operators seems to set it off. Things like inurl:, intitle:, etc, trigger it also. Not every time, but after a few times. From a normal ISP connection, no VPN, even while logged into Google, etc.


I personally have been surprised to find myself CAPTCHA'd out of google search recently. No idea what's up with that. Regular commercial ISP, no VPNs.


I've never gotten a CAPTCHA-hellban that I know if, but I absolutely get a CAPTCHA when I use "site:" for more than just a couple searches. (It sounds par-for-the-course w/ Google, though...)


FWIW, I've personally experienced exactly that happening too.


Yeah, I get these. The problem is that the Captchas take forever to fill out (like 5 minutes of challenges). But the worse part is that the captchas are asking for wrong answers. It tells you to select scooter and there's no scooter in the photo but it thinks there is. So you just end up stuck in a captcha loop for a long time.

I am not sure why I get them but it might be due to using anti-fingerprinting tools.


I've wondered if it isn't intentionally impossible to solve, because "the algorithm" decided that you're a bot or malicious and they want to spin your cycles endlessly. The affect on me know is I won't even try anymore, I'll just take a different route. That may even reinforcement teach the system that I was a bot that couldn't solve it


I think it's more malicious than that. They know I use privacy tools and can't be tracked -> they can't make money on me -> bully me into not using their service.


It may also be part of their anticompetitive war on other browsers. I get captchas constantly in a new default Firefox profile, but not in a new default chrome profile. Spoofing user agent to recent chrome agent in Firefox makes the captchas happen far less often for me.


I sometimes get multiple captchas in a row that I fill correctly but they keep on showing more..I then just do the audio one which works.


This is probably the common thread among all the people reporting this. As an alternate date point, I haven't experienced the captcha from using advanced search queries.


bots use it to find websites with security flaws I assume


>now it's nearly impossible for anything but the most vapid of queries

I've noticed that myself, looking for very precise content, which I know is out there but failed to bookmark. (Most recently, for amateur astronomy and roleplaying.) The solution to finding niche stuff now seems to be digging through relevant reddit or forum threads, hoping someone posted a link to it.


Very true. Some client websites have had half their keywords gone from position 1-3 to deindexed, then back, then gone, then back, and that's been since February 2023.


>Add to that the automatic CAPTCHA-hellban you get if you use "site:" in anything more than a tiny amount

Is it more expensive, or do they just wish to prevent people from being able to cache their own results locally?


I assume it is so that websites don't abuse it to build search boxes for their own sites without showing ads?

E.g. I can build a searchbox on mywebsite.com, and if you type "hamster" I'll just query google for "site:mywebsite.com hamster" and return the results to you. That way, my site can be static but still have a search box, and google has all the work but gets no money.


I suspect it bypasses some advertising metric and they don't like it.


Startpage does the same thing when I use sitebut with no captcha to bypass the hellban. Sometimes it just shows no results intentionally. Refreshing the page fixes that one


It should actually be cheaper to run. Much cheaper, since the site operator acts as a restrict on URLs.


They are probably trying to reduce "misinformation" by removing most of the web from their index. With AI, they could just ask bard, "does this website contain any information that would be considered misinformation?" and then just ban it.

If you want "misinformation," or to just search the web like it's the mid 2000s, you can use http://Yandex.com. They do a pretty good job on controversial queries. Google has gotten so political that they even have this "results are changing rapidly" page they return when there's been some new political hot topic that they haven't gotten the commissars at headquarters to weigh in on yet as to what's going to be the official narrative.[1]

[1]https://www.theverge.com/2021/6/25/22550430/google-search-re...


Yandex is also censored but in the other direction. Probably does censor less than Google but enough that you shouldn't rely on it alone for topics that involve Russia. Their index is also limited in general when it comes to non-russian content. But it does return many things that Google would rather you not see so it is invaluable if you want to get the full picture.

Yandex image search also has an infinitely better interface, linking your directly to image sources and not being full of links to sites that want you to sign up before showing anything like Google is. It's still not perfect and IME often groups images to agressively which effectively hides "similar" results.


Nice try FBI. In all seriousness though, has it actually gotten so bad that yandex of all search engines is less censored? Or is it just less censored when it comes to topics controversial to the US (and not russia)? The fact that so much censoring is going on that google has a "hold on while we censor this" page is insane.


> google has a "hold on while we censor this" page

a what now?


I'd never noticed any issues with Google until a few months ago where I was googling an exact phrase that I knew appeared on one site. Google gave me nothing but DuckDuckGo found it.

The site is probably 20 years old and has no SSL, but still... giving me no results is worse then giving me the one correct result.


>and I realise that there's increasingly huge amounts of information out there on sites that Google may have crawled before and knows about, but doesn't want to show me for some reason.

I wonder what that reason could be.


It's expensive?

Why provide the best product when you only have to have a product slightly better than your competition. After that everything is profit.

Couple that with a huge portion of new sites seem to be bot generated shit that's copied from other places on the internet it seems Google has given up on the open web.


Not sure why they'd care, they have effectively infinite money from adsense.


As long as people don't abandon search, yeah, they do. If they lose their absolute dominance in search, they will automatically have competition on adsense too.

Or maybe Google disagrees with my assessment, but I can't imagine what kind of inside information would make them do that. It looks like a very clear and inescapable reality to me.


You should know that's not how capitalism works. They have to keep making more money per dollar every year or they get punished in the market. They've tapped out on their limits of growth and now actual costs are increasing due to floods of automated crap at levels far beyond what we had in the past.


They have a practical monopoly on web and mobile ads, if they really are stagnating then all they need to do is jack up prices by a fraction of a cent and it's already billions in profit. I'm sure they have no problem increasing revenue over time.

Given how stupidly common ads are, increasing prices and upping scarcity would be a good thing overall anyway.


> I wonder what that reason could be.

How would that make them money? Here, instead have a few links to irrelevant videos that bring in ad revenue!


I've always wondered why we got rid of curated directories and changed to search for almost everything (and yes I do realize that volume of sites is problematic).


Also anything past the first page, will just show you crap on the first page. I use to be a power user of operators like 'site:' but agreed, it results in a captcha every other page sometimes.


I noticed it in January 2018. I thought that it's a temporary degradation so it will be fixed soon. It have never been fixed ;/


This is one of the reasons I make notes off websites that piqued my interest in something. It’s just too hard to re-search from scratch.


> so I know it's crawled that page, it often doesn't return that page either.

This is your misunderstanding. The fact that a thing was in the index does not ensure it will always be there. Things disappear from the web all the time. Serving fresh docs means not only crawling the new stuff but also deleting the unreachable stuff promptly.


No, I literally did a site: search seconds after visiting the site to see if it had any other pages with what I was looking for, and it found zero results --- not even the original page I found.


And you simply refuse to believe that online index updates exist?


>CAPTCHA-hellban you get if you use "site:" in anything more than a tiny amount

Please explain this point.


I recognise this as well. I write for a living. So I'll do lots of searches to cross check stuff. But if you search to quickly, or to 'weirdly', or whatever you'll have to pick out bridges or zebra's or whatever is the current fashion in Captcha.


the best one is "select the photograph containing a crosswalk". How am I supposed to know what a crosswalk looks like in each & every culture on earth?


I mean, as a human, you are expected to use context clues.

You don’t need to know the markings used for crosswalks in every place around the world to know what a crosswalk looks like based on its purpose. There’s only so many ways to create a pedestrian crossing across a street, after all.

If anything, that seems like an extremely appropriate choice for something attempting to restrict access for bots that wouldn’t necessarily be able to act on the same context clues and intuition.


This doesn't really cross cultural boundaries. For example, the skull and crossbones means nothing to Iraqis despite universally being seen as as sign of danger and caution in US

https://en.wikipedia.org/wiki/1971_Iraq_poison_grain_disaste...


What cultural boundaries are there to cross?

You're asked to point out the designated crossing area for pedestrians across a street. Sure, some places use crosswalk stripes perpendicular to the street, others use squiggles, others use lines on the sides, and some don't use any markings at all, but it should be plainly obvious to anyone, anywhere in the world where the designated area is based on there being some marking, or control devices, or literally people walking in the photo.

This isn't rocket science. Using contextual clues to figure something out is literally one of the most basic human abilities.


I share your frustration but I’ve come to learn that a lot of people don’t process things contextually and have an extremely difficult time with problems or reading that require picking up context clues.


Or even what a crosswalk is.

It's a 'pedestrian crossing' everywhere else English is used - including the Geneva Convention.


Captcha has always been very US-centric for obvious reasons. I can see somebody less "open-minded" easily fail some of these tasks.


I assume you don't have to answer correctly on the crosswalk question, you just have to answer the way most humans answer the question when asked... but I have nothing to back that up.


I'm not sure. It used to be that you could just select whatever as long as you do it with a mouse (so have human-like cursor movement). But latetly reCaptcha and hCaptcha have both been yelling loudly every time I didn't select one square that had a car or staircase or whatever it makes your look for even if that object is relatively small and easy to miss.

I think this is because the primary purpose of the excersise is AI training though.


FWIW I have found audio captchas much less annoying and time-consuming. For Google's captchas, click the headphones symbol.


If you use advanced search features say 10 times in 10 minutes or whatever (a reasonable amount when refining a search if you ask me), you're quite liable to be elected to have a trial of endurance against the "prove you are a human" feature, having to solve multiple (my record is 16) consecutive "select all images that contain BLAH" tests.


> having to solve multiple

Do you use adblock? I find if adblock is enabled when doing captchas I have to keep clicking pictures over and over.


I had to solve thousands of captchas as part of the yahoo groups archiving project. You only have to choose four of the images, whatever the test is, and it's not really precise, so you can make small mistakes and it still will let you pass.


>> CAPTCHA-hellban you get if you use "site:" in anything more than a tiny amount

> Please explain this point.

If Google thinks your searches are unusual, it will force you to answer captchas to see the results. They assume anyone using advanced features must be trying to abuse their service.


I do mostly "normal" searches and I get one captcha challenge a couple times a day.

I do use my own VPN, so maybe my IP is flagged or something.


Are you abusing them, or are they using the captcha to get you to change your behavior back to something they prefer


> Are you abusing them, or are they using the captcha to get you to change your behavior back to something they prefer

No, I think they just don't care if they throw out the baby with the bathwater.


If your activity seems automated in some way, Google will give you a captcha and sometimes it'll give you one on every search even after you've completed one captcha. But the reason for this is probably a combination of IP usage (e.g. a VPN IP shared between users), browser anonymity, and how specific you're getting with your search results, and not just the fact that you've done 20 searches today with "site:".


If you search page 3 and beyond of the results. Well, when it had pages of results instead of shitty infinite scroll.


AFAIK infinite scroll is still in the A/B testing phase. At least with clean browser profiles it seems random which one I get.


It's the height of irony that automated processing produced the AI chatbots that are vogue today, but if your activity is automated, Google considers it a crime. I say irony but that implies the hypocrisy was surprising.


You have to solve a bunch of captchas if your searches are obscure or frequent enough.


> On the other side, I've also noticed it appears to be aggressively pruning its index in the past few years

In terms of breadth and depth, the quality of google search has declined noticeably. They don't have any real competitors in search so they can do whatever they want.

> now it's nearly impossible for anything but the most vapid of queries.

Rather than getting us want we want, they want to gives us what they want. A narrow band of approved results. Youtube is like this as well, but then again, youtube and google are both part of Alphabet. It's like google news was a test run and they slowly exported it to search, youtube, etc.


SERP: Search engine results page. I asked ChatGPT.

"SERP stands for Search Engine Results Page. It refers to the page displayed by a search engine in response to a user's query. When a user enters a search term or keyword, the search engine generates a list of relevant web pages and presents them in the form of a SERP. The SERP typically includes a combination of organic search results, which are the regular listings based on relevance to the query, and paid advertisements, which are sponsored listings that advertisers pay for to appear prominently on the page. SERPs often contain additional elements such as featured snippets, knowledge graphs, image or video results, local map results, and other specialized features, depending on the specific search engine and query."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: