AI didn't solve the problem of summarizing complex large datasets. For example a common way to deal with such datasets is to use a random subset of this dataset. This represents a single line of code potentially to perform this operation.
But you don't need to do a random subset with AI. You can summarize everything, and summarize the summaries and so on.
I will say that at least gpt4 and gpt3, after many rounds of summaries, tends to flatten everything out into useless "blah". I tried this with summarizing school board meetings and it's just really bad at picking out important information -- it just lacks the specific context required to make summaries useful.
A seemingly bland conversation about meeting your friend Molly could mean something very different in certain contexts, and I'm just trying to imagine the prompt engineering and fine tuning required to get it to know about every possible context a conversation could be happening in that alters the meaning of the conversation.
Thats the exact issue with gpt. You don't know how its making the summary. It could very well be wrong in parts. It could be oversummarized to a bla bla state like you say. There's no telling whether you have outputted garbage or not, at least not without secondary forms of evidence that you might as well use anyway and drop the unreliable language model. You can summarize everything with traditional statistical methods too. On top of that people understand what tradeoffs are being made exactly with every statistical methods, and you can calculate error rates and statistical power to see if your model is even worth a damn or not. Even just doing some ML modelling yourself you can decide what tradeoffs to make or how to set up the model to best fit your use cases. You can bootstrap all these and optimize.
What LLMs can do efficiently is crawl through and identify the secondary forms of evidence you mentioned. The real power behind retrieval architectures with LLMs is not the summarization part- the power comes from automating the retrieval of relevant documents from arbitrarily large corpuses which weren't included in the training set.
What makes a document relevant or not? Provenance? Certain keywords? A lot of this retrieval people cite that llms are good at can be done with existing search algorithms too. These are imo nicer because they will at least provide a score for the fit of the given document to the term.