Here is my high level take: most AI researchers I trust recognize that AI alignment is at least fiendishly hard and probably impossible. This breaks down into at least two parts. First, codifying the values of a group of people is hard and impossible to do neutrally, since many sets of reasonable desiderata fail various impossibility theorems, not to mention the practical organizing difficulties. Second, ensuring the AI generalizes and behaves correctly according to a based on supervised learning over a set of examples is likely impossible, due to the well-known problems of out-of-distribution behavior.
Of course, we can't let the perfect be the enemy of the better. We must strive to align our systems better over time. Some ways include: (a) hybrid systems that use provably-correct subsystems; (b) better visibility, vetting, and accountability around training data; (c) smart regulation that requires meaningful disclosure (such as system cards); (d) external testing, including red-teaming; (e) reasoning out loud in English (not in neuralese!); and more.
Below I'll list each quote and rewrite them with elaboration to unpack some unstated assumptions. (These are my interpretations; they may be different than what the authors intended.)
> 1: This is oddly a case to signify there is value in an AI moderation tools - to avoid bias inherent to human actors.
"To the extent (1) AI moderation tools don't have conflicting interests (such as an ownership stake in a business); (2) their decisions are guided by some publicly stated moderation guidelines; (3) they make decisions openly with chain-of-thought, then such decisions may be more reliable and trustworthy than decisions made by a small group of moderators (who often have hidden agendas)."
> 2: Do you understand how AI tools are trained?
"In the pretraining phase, LLMs learn to mimic the patterns in the training text. These patterns run very deep. To a large extent, fine-tuning (e.g. with RLHF) shapes the behavior of the LLM. Still, some research shows the baseline capabilities learned during pretraining still exist after fine-tuning, which means various human biases remain."
Does this sound right to the authors? From what I understand, when unpacked in this way, both argument structures are valid. (This doesn't mean the assumptions hold, though.)
Do you understand how AI tools are trained?