Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not a super well thought out article. Example: lots of speculative complaints that ChatGPT will lead to an explosion of low quality and biased editorial material, without a single mention of what that problem looks like today (hint: it was already a huge problem before ChatGPT).

Ditto with the “ChatGPT gave me wrong info for a query” complaint. Well, how does that compare to traditional search? I’m willing to believe a Google search produced better results, but it seems like something one should check for an article like this.

IMO we’re not facing a paradigm change where the web was great before and now ChatGPT has ruined it. We may be facing a tipping point where ChatGPT pushes already-failing models to the breaking point, accelerating creation of new tools that we already needed.

Even if I’m wrong about that, I’m very confident that low quality, biased, and flat out incorrect web content was already a problem before LLMs.



> without a single mention of what that problem looks like today (hint: it was already a huge problem before ChatGPT).

I see this counter-argument all the time and it makes no sense to me.

Yes, the web is already filled with SEO trash. How is that an argument that ChatGPT won't be bad? It's a force multiplier for garbage. The pre-existence of garbage does not at all invalidate the observation that producing more garbage more efficiently is even worse.


Yeah, exactly. It’s like saying, “What’s wrong with having a bus-sized nuclear-powered garbage cannon aimed at my head? I already have to take out my trash once a week.”


Because you already only view like 0.0001% of the web's content. Garbage is already filtered by algos. Those algos just have to keep up with chatGPT the same way they've already been keeping up with spam, the 95% of the web that is a dumpster fire, etc.

Potentially it doesn't really become more difficult.


The difference is it takes me less than a second to immediately identify that content as garbage. And the places I frequent are good at stopping that garbage from getting on their website.

Between the 0.0001% we care about and the 99% percent that’s automated trash, there’s a solid 1% of content churned out by actual humans at very low quality. Think about things like recipe fluff, “news” articles for noname organizations, and all the super low effort blogs giving Birds Eye view summaries of things like Kubernetes ripped right off some other intro material.

ChatGPT produces straight up better and more informative content than those actual humans, and I am almost sure that it does so much faster and at a lower price. Actually, I think in some ways ChatGPT produces better content than most of the users on Reddit these days too


Sure, again it really only matters if the search engine can filter the stuff we don't want to see, and show the stuff we want to see. If they can do that, we don't have a problem. It's the same as it was.

If they can't do it, we have a problem.

I guess there's the additional problem of bots posting comments everywhere, but that's really just a problem for social media sites and so I'm fairly unsympathetic.

People do spend a lot of their time these days on social media, but that's a new phenomenon, and I doubt it will last, so I don't think the future web is ruined.


> Those algos just have to keep up with chatGPT the same way they've already been keeping up with spam, the 95% of the web that is a dumpster fire, etc.

That "just" is an arms race so fantastically difficult that the current leading business doing it has a market cap of $1.4 trillion.

Those algos are, to date, some of the world's most sophisticated uses of AI.

This is like observing that the howitzer was just invented and saying, "Don't worry, we've got chainmail armor."


I don't think that's a keen analogy at all. We don't currently have chainmail. s/Google search/gmail spam filtering/youtube suggested algorithm is a dike holding back an ocean of shit. I'm not sure how to make an armor analogy, so let's just say it's really good armor.

Also assuming you meant 1.4T=Alphabet, I cannot go along with your pretending that the 1.4 trillion dollar cap is a function of PageRank, nor can I pretend that it's remotely related to whether they can continue providing good results post-chatGPT.

Why don't you think they can handle it?


I certainly hope they can handle it. But it looks to me that generative AI is posed to give a huge new weapon to bad actors and I absolutely think that's a bad thing, regardless of whether the good actors are somehow able to defend themselves from it.


On the other hand, the system for finding non-garbage content is the same: read publications and writers that you (and other real people you know) already like. If there are 10 good websites and 100 garbage ones, you probably find out about 2 or 3 of the good ones by word of mouth. If there are 10 good websites and 10^32 garbage ones, you will still be able to read those 2-3 good ones.


The system for finding a needle in a sprinkling of hay is the same as finding a needle in a mountain-sized haystack but I would sure as hell prefer to be given the former task over the latter.


This analogy does not work because we search a hay stack with our eyes, but we search for information with far more selective tools. You are comparing apples and oranges.


The smart way to find a needle in a haystack is to burn the haystack and pass the ashes in front of a magnet. I'm not sure what this analogy means for AI-generated SEO spam though.


Search engines are already unusable for certain things that they used to be usable for. There's not really a such thing as "even more unusable." If I offer you an oven that doesn't get hot, that's not any better than an oven that makes things colder; you would not want either.


I think we call that a "fridge", and they're quite popular actually...


I also see this argument all the time:

“New thing X is going to destroy the world!”

“Actually it’s an extension of decades-long trends and may accelerate issues we already face”

“Well it’s still bad, so any negative statement should be treated as true, even if it’s false!”

The article didn’t say ChatGPT was making low quality content worse. It said, in as many words, that ChatGPT will create this problem.


because most garbage now are all ready produced by script. chatgpt will actually improve them


Better garbage? More convincing bullshit?

Are we winning?


It's already pretty bad with Github/SO threads. Guys will scrape threads on GH/SO and repost them to their sites, usually with a ton of ads but the post ranks higher than the original thread so it will come up first when you google an error.


How could it rank higher tho? SO has a huge domain ranking. How an arbitrary website can compete with that?

I always thought it’s the opposite and platforms like SO and Medium incentivise posting there exactly via their crazy domain ranking.


> How could it rank higher tho? SO has a huge domain ranking. How an arbitrary website can compete with that?

It's unclear, but they do. My guess would be that they're willing to do shadier SEO than SO will, and any that get caught just stand up more domains.


It's usually temporary, until Google tags the copycat site as a spammy content farm and destroys its ability to rank. I haven't seen a site sustain high ranking / lots of traffic through copying Stackoverflow in maybe a decade at this point (since Panda etc.).


It doesn't matter, though. It's a hydra. Different sites over time but the result is that google results on programming topics are reliably, and increasingly, shit. I gave up on it and pay for Kagi.


It doesn't need to sustain it, it just needs to be there when you search. I generally get SO first, but I see a LOT of copycats on the first page of DDG/Google when I search.


Even if it doesn't, you now suddenly have the top 20+ results with the exact same info


That's why I have Firefox bookmarks where I type `s <query>` into the address bar which enters `site:stackoverflow.com <query>` into my search engine. Likewise for `r <query>` => `site:reddit.com <query>`.

This annihilates the SEO spam and is useful for most of my searches. It's glorious finding recipe ingredients without wading through a blogger's life story or a search result page filled exclusively with ads above the fold.


They probably rank higher on long tail keywords. Usually for more in the weeds issues that don't get as much search volume


>How could it rank higher tho?

Because they sell more clicks, impressions, etc...


> Well, how does that compare to traditional search?

Poorly.

Traditional search is a dumb pipe, it gives you multiple links to review and evaluate on the basis of a well-understood PageRank algorithm. It's gotten a lot worse, but humans adapted to its limitations, and know what not to click on (affiliate marketing sites that rank #1 for instance).

GPT3 is a dead end, it provides a single response and you can either accept what it tells you or not. It is not going to disclose what links it scraped to provide the information, and it's not going to change its mind about how it put that info together. This is because of the old Arthur C. Clarke axiom "Any sufficiently advanced technology is indistinguishable from magic”."

AI peddlers will use every UX dark pattern possible to make it look like what you are seeing really is magic.


For sure, though it's easy to imagine a search results page that mixes current organic search results, search ads, and also some kind of AI 'answers' or 'sugestions'. Then we just have to vet those as possibly-dubious-but-maybe-helpful along with the rest.


it doesn't have to be this way. check out perplexity ai. it's like gpt in that it's conversational but like google in that it provides references


The difference is we can improve the AI to be more accurate, and I suspect before long it’ll generate better content than a human would that’s verifiable with citations. There may come a time where writing is done by a machine much as a calculator does our math. But knowledge maybe shouldn’t be canonically encoded in ascii blobs randomly strewn over the web - maybe instead of accumulated knowledge needs to be structured in a semantic web sort of model. We can use the machine to describe to use the implications of the knowledge and it’s context in human language. But I get a feeling in 20 years it’ll be anachronistic to write long form.


The model needs known "good" feedback to improve. The problem is that the quality of its training data declines with the more output produced. It's rather inevitable that we'll be drowning in AI generated garbage before long. A lot of people are confusing LLMs with true intelligence.


That’s why I think knowledge needs to be better structured than blobs of text scattered everywhere. An AI can be more than an LLM, Wolfram posted recently about that. You can use the LLM to convert a question into a semantic query and a semantic validator and check and amend and provide a semantic knowledge graph explaining an answer and the LLM can convert it back to meat language. I think people confuse LLM with true intelligence, but the cynics also confuse LLM with a complete and fixed point solution.

Your point also seems to assume no curation can happen on what is ingested. Simply because that might be what’s happening now you could also simply train the LLM on known good sources and be as permissive or restrictive as is necessary. Depending on how good the classifiers are for detecting LLM output (openai released on recently) or other generated / automatically derived content you can start to be more permissive.

My point is people seem to be blinded by what is vs what may be. This is not the end of the development cycle of the tech, it’s the pre-alpha release by the first meaningful market entrant. I’d be slower to judge what the future looks like rather than assuming everything stays fixed in time as it is.


Oh definitely. There's bound to be improvements especially when you glue an LLM to a semantic engine, etc.

The issue is again, fundamentally, one of data. Without authenticating what's machine generated and what's "trusted" proliferation of AI generated content is bound to reduce data quality. This is a side effect of these models being trained to fool discriminators.

Ultimately now I think there is going to be a more serious look around the ethics of using these models and putting guard rails around what exactly is permissible. I suspect the US will remain a wild west for some time but the EU will be a test-bed.

Ultimately, I'm fairly excited about the applications of all this.


Good point. I was already concerned about people's reliance on Google's zero-click answers as the deepest level of inquiry before ChatGPT hit the scene. ChatGPT feels like a multiplier of this convenience factor, being also slightly more specific and generally more consistent.


There's also just that Google's search ranking doesn't work anymore.

I searched "lowest temperatures in boston every year" and got some shit-looking MySpace-like website with a table of temperatures, hell knows where it got its data, instead of a link to the correct page on NOAA or something more authoritative.


This is a fun example;

First hit in DDG for that query is a trash site but at least the data is there…

https://www.currentresults.com/Yearly-Weather/USA/MA/Boston/...

Versus trying to pull the data from NOAA;

https://www.ncdc.noaa.gov/cdo-web/search

The way that the first site works the keywords into the intro text repeatedly to juice their rank is almost impressive. Can the search engines really not see that the page is garbage?


I mean, what exactly makes the first page garbage? I'm not disagreeing, but "is this site garbage" is not a question that a search engine can ask.


I agree. The real "problem" in this specific case is that the authoritative source (NOAA) seemingly doesn't make the data available in a manner that's discoverable by crawlers.

The currentresults.com page seems.. fine? It has a proper source cited at the bottom of the data. I wish it didn't have display ads, but that's the nature of the web nowadays. That's not a problem solvable by a traditional search engine.


> "is this site garbage"

Why not? If it has headers that say it was made with FrontPage 2003 and has five thousand AdSense boxes, uses old world fonts like Arial instead of HelveticaNeue Light, uses 16-bit VGA colors like #0000ff, or has bgsound and blink tags, it should perhaps be downranked.


Because those things you listed are (potential) answers to my first question, not the one you quoted.

A search engine should not see a site written in Arial and derank it for that reason. Blink tags, sure, they're obviously wrong for accessibility reasons, but there's a huge gap between those two things - and even so, how badly should they affect ranking?

I'm saying "garbage" can be subjective, and when there are objective "garbage" indicators, it's not obvious how to deal with them. What you've listed is only a small set of indicators from a small niche of so-called "garbage" sites. And personally, I don't even want to see old or old-styled sites dismissed from the web if they have good content.


Full of Google ads probably


> Even if I’m wrong about that, I’m very confident that low quality, biased, and flat out incorrect web content was already a problem before LLMs.

Definitely, and I believe the post admits as much. The point he's making is that it's going to get exponentially worse, until the web is useless (the "tipping point" you mention).

What are the "new tools that we already needed" though? I think I'm too pessimistic in my outlook on these things, and would be interested to hear your optimistic future scenarios.

Right now, my view is that as that as long as something is profitable, it'll continue. A glimmer of hope is that once the web is completely useless, people will stop using it, and we can rebuild.


Cost

Back in the day you'd have to pay to print your bullshit. Imagine if printing bullshit were free and instant?


One major difference is that generated content up until recently was pretty obvious. Tons of stuff like finance articles are autogenerated using templates, and SEO spam is obviously not intended for you as a human.

The rest is generally churned out en masse at the cheapest price, so in practice it contains no content and is very poorly written.

ChatGPT can produce decent quality content faster and cheaper than most humans. Despite not being fully accurate, and falling apart in certain domains like math, it has an amazing breadth of topics and things it can do at an acceptable level.

Right now, enough prompt engineering work is required that it still takes handholding to get ChatGPT to churn out content. But given where we are now it seems well within reach for the next gen of models to be able to go from “Write me an article about X that covers Y and Z” to “Write me 100 articles about varying topics in X” to “Take in the information from this corpus and distill it into 50 articles based on the most interesting parts.”

The main thing that should stay safe is detailed technical content like programming guides where you need to actually be able to reason about the material to produce good content, and can’t just paraphrase the ten thousand related sample materials in your training set. ChatGPT is decent about giving mostly-working code snippets (especially if it can use a library, although it may just make one up) but getting it to reason through things will probably require an entirely different approach to how it works. Still, because it’s already capable of producing technical content that passes a basic first glance, it could precipitate a trust crisis. I worry more about what happens when people try to get ChatGPT to generate recipes, or give medical advice, or operate in the support group/personal advice/etc. space.


I agree that the article doesn't really bring up anything new or interesting.

One important implication of a ChatGPT centered web is the removal of reward/credit to content creators. Now when you Google for something you'll probably arrive at some StackOverflow, blog, or Reddit post where there's at least an author's name attached to an answer. But ChatGPT just crawls that content without citing sources, reducing any reward for contributing. Maybe this doesn't have serious implications - after all most people contribute under pseudonyms, but its worth bringing up.


And most people are thinking on ChatGPT like it couldn't evolve. Like it was statically attached to its current state. They are not considering its astonishing potential to evolve.

It's just the beginning, just like the internet on the early 90s. Give it 30 more years and we all gonna be AI dependants, like we are on the internet. On the near decades the future generations will not be able to just imagine life before AIs.


Agreed, particularly given the grammar mistakes. Although, ironically, the grammar mistakes increase confidence that this is an article written by a human.


Agreed. What is it with everyone wanting to see Chat GPT only in the light of accessing information and then complaining about it?


Because thats what you do to evaluate it as a service?


It makes me wonder if these people have ever talked to a real person before. Spoiler: real people can be wrong too!


Real people take longer to be wrong. The potential volume one bad actor can generate matters; https://en.wikipedia.org/wiki/Gish_gallop is a dangerous enough technique when someone has to actually physically come up with the bullshit.

"Gish gallop as a service", essentially.


You can trust your friends to not straight up make up facts when they don't know something, like Chat-GPT does, for example.


Quantity has a quality all its own.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: