Alignment is not sufficient to prevent large language models from generating harmful information: A psychoanalytic perspective

Kavli Affiliate: Jia Liu

| First 5 Authors: Zi Yin, Wei Ding, Jia Liu, ,

| Summary:

Large Language Models (LLMs) are central to a multitude of applications but
struggle with significant risks, notably in generating harmful content and
biases. Drawing an analogy to the human psyche’s conflict between evolutionary
survival instincts and societal norm adherence elucidated in Freud’s
psychoanalysis theory, we argue that LLMs suffer a similar fundamental
conflict, arising between their inherent desire for syntactic and semantic
continuity, established during the pre-training phase, and the post-training
alignment with human values. This conflict renders LLMs vulnerable to
adversarial attacks, wherein intensifying the models’ desire for continuity can
circumvent alignment efforts, resulting in the generation of harmful
information. Through a series of experiments, we first validated the existence
of the desire for continuity in LLMs, and further devised a straightforward yet
powerful technique, such as incomplete sentences, negative priming, and
cognitive dissonance scenarios, to demonstrate that even advanced LLMs struggle
to prevent the generation of harmful information. In summary, our study
uncovers the root of LLMs’ vulnerabilities to adversarial attacks, hereby
questioning the efficacy of solely relying on sophisticated alignment methods,
and further advocates for a new training idea that integrates modal concepts
alongside traditional amodal concepts, aiming to endow LLMs with a more nuanced
understanding of real-world contexts and ethical considerations.

| Search Query: ArXiv Query: search_query=au:”Jia Liu”&id_list=&start=0&max_results=3