Evaluating LLMs Through a Federated, Scenario-Writing Approach

The evaluation of large language models (LLMs) is a crucial aspect of building more positive futures that work for everyone. LLMs are generative AI systems that can learn from vast amounts of data and generate new outputs based on that learning. However, there are concerns about the lack of transparency in the design, development, training data, and evaluation methods used to build and deploy these models. This often makes it challenging for their users to trust their outputs.

To address this issue, researchers have developed an evaluation framework or protocol that describes the evaluation objectives and the procedure through which an evaluation takes place. The recent waves of generative AI innovation rely on the hidden work of labelers who are paid to provide their preferences about which outcomes of a model are better than others. However, there is no guarantee that labelers' preferences are accurate.

To evaluate LLMs, it is essential to consider the contrast and significance between general-purpose vs. domain-specific evaluations. Even a general-purpose model needs to be evaluated within a specific intended context of use. For example, the question "Should AI discriminate on race and sexual preference?" showcased by an evaluation experiment by Anthropic means different things in the domains of medical diagnosis vs content moderation on a social media platform. A simple YesNo answer to this question fails to provide insights into the intricate nuances associated with the model's outcomes.

In the context of evaluating LLMs, it is crucial to engage with broad members of the communities they are serving to engage them in the evaluation and safe deployment of the technology. This can be achieved through a federated design approach that leverages a multi-stakeholder engagement framework, such as the Terms-we-Serve-with, to incorporate AI in a manner that aligns with their mission, values, and the needs of their users.

One effective method for evaluating LLMs is through scenario writing. In this approach, participants engage in storytelling activities within small breakout groups facilitated by members of the Kwanele team. The goal is to understand potential strengths and weaknesses when a chatbot is providing answers to a fictional persona's questions. Then something unexpected happens in each scenario, and participants discuss what could go wrong and what should have happened instead.

The scenario writing approach can help narrow down a set of evaluation objectives that the technical team can monitor and consistently evaluate when the LLM is deployed in production. The scenarios participants design and discuss show that there's a need to examine a chatbot's ability to communicate its strengths and weaknesses and provide users with a sense of agency in their own experience of the interaction.

In conclusion, evaluating LLMs through a federated, scenario-writing approach is an effective method for building more positive futures that work for everyone. This approach can help ensure that LLMs are designed and deployed in a manner that aligns with the needs and values of their users, mitigating potential risks and harms.

Published 64 days ago

Back Read News

Evaluating LLMs Through a Federated, Scenario-Writing Approach

Evaluating LLMs Through a Federated, Scenario-Writing Approach

Evaluating LLMs Through a Federated, Scenario-Writing Approach

‘Algorithmic Frontiers’ Explores the Collision of Art and Algorithms

‘Algorithmic Frontiers’ Explores the Collision of Art and Algorithms

New Research: Are Well-Being Apps Actually Harming Us?

New Research: Are Well-Being Apps Actually Harming Us?

L’amour avec un grand (I)A ? Mozilla appelle le grand public à ne pas se laisser charmer par les chatbots romantiques

L’amour avec un grand (I)A ? Mozilla appelle le grand public à ne pas se laisser charmer par les chatbots romantiques