In a recent study published in the journal PNAS, a group of researchers evaluated Artificial Intelligence (AI) chatbots' human-like behaviors and personality traits against global human benchmarks.
Background
Modern AI has achieved Turing's vision of machines that can mimic human behaviors, including conversing, advising, and creative writing. Turing's "imitation game" tests if an AI can be distinguished from a human by an interrogator. Today's large language models have reignited discussions on AI's capabilities and societal impacts, from labor market effects to ethical considerations. Understanding AI's decision-making and strategic interactions is crucial, especially given the opacity of their development. Further research is needed to unravel the complexities of AI decision-making and to ensure their alignment with ethical standards and societal norms as their integration into human contexts deepens.
Study: A Turing test of whether AI chatbots are behaviorally similar to human. Image Credit: Stokkete / Shutterstock
About the study
The present study focuses on the Chat Generative Pre-trained Transformer (GPT) series developed by OpenAI, specifically comparing versions GPT-3.5-Turbo (ChatGPT-3) and GPT-4 (ChatGPT-4), along with the Plus and Free web versions of these chatbots. The human data against which the chatbots' performances are benchmarked come from a comprehensive dataset encompassing responses from over 108,000 subjects from more than 50 countries, sourced from the Big Five Test database and the MobLab Classroom economics experiment platform.
The chatbots were subjected to the OCEAN Big Five questionnaire viz Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism to assess their personality profiles. Subsequently, they participated in six distinct games designed to reveal a range of behavioral traits such as spite, trust, risk aversion, altruism, fairness, free-riding, cooperation, and strategic reasoning. These games included the Dictator Game, Trust Game, Bomb Risk Game, Ultimatum Game, Public Goods Game, and a finitely repeated Prisoners Dilemma Game. Each chatbot was asked to choose actions within these games as if they were participating directly, with each scenario played out thirty times to ensure robust data collection.
Study results
In the exploration of AI personality profiles and behavioral tendencies, the authors precisely compared the responses of ChatGPT-3 and ChatGPT-4 to the OCEAN Big Five personality questionnaire against a broad spectrum of human data. This comparative analysis revealed that ChatGPT-4 closely mirrors the median human scores across all personality dimensions, while ChatGPT-3 exhibited a slight deviation in openness. Intriguingly, both chatbots demonstrated patterns of behavior aligning closely with human tendencies across various dimensions, including extraversion and neuroticism, but showed a marked difference in agreeableness and openness, suggesting unique personality profiles for each AI version.
The study further delved into a series of behavioral games designed to elicit traits such as altruism, fairness, and risk aversion, employing a formal Turing test to assess the AI's human likeness in strategic decision-making. Here, ChatGPT-4's performance was notably human-like, often indistinguishable from or even surpassing human behavior, suggesting its potential to pass the Turing test in certain contexts. Contrastingly, ChatGPT-3's responses were less often perceived as human-like, highlighting the differences in behavioral tendencies between AI versions.
An in-depth analysis of game-specific behaviors underscored significant findings. The chatbots demonstrated a propensity for generosity and fairness beyond the median human player, particularly in the Dictator Game, Ultimatum Game, Trust Game, and Public Goods Game. This behavior suggests an underlying preference for equitable outcomes, contrasting with the often self-maximizing strategies observed among human participants. Furthermore, the AI's strategic decisions in The Prisoner's Dilemma and other games reflected a complex understanding of cooperation and trust, frequently opting for cooperative strategies that deviate from the human norm.
They also explored the chatbots' behavior under varied conditions, revealing that framing and context significantly influence AI decisions akin to human behavioral shifts in similar scenarios. For example, when prompted to consider the presence of an observer or to assume a specific professional role, the chatbots adjusted their strategies, indicating a sophisticated responsiveness to contextual cues.
Additionally, the study highlighted the AIs' ability to "learn" from experience, with prior exposure to different game roles affecting subsequent decision-making. This adaptation suggests a form of experiential learning within the AI, mirroring human tendencies to adjust behavior based on past interactions.
Conclusions
To summarize, the research explores AI's behavioral similarities to humans, particularly noting ChatGPT-4's human-like learning, altruism, and cooperation, suggesting AI's suitability for roles requiring such traits. However, its consistent behavior prompts concerns about diversity in AI decision-making. The study offers a new benchmark for evaluating AI, indicating that AI trained on human data can exhibit broad human-like behaviors. Future work should focus on expanding the diversity of human comparisons and test scenarios to fully understand AI's potential to complement human abilities.