Evaluating Evaluations: NeurIPS Workshop 2024

Workshop Overview

Generative AI systems are becoming increasingly prevalent in society, producing content such as text, images, audio, and video with far-reaching implications. While the NeurIPS Broader Impact statement has notably shifted norms for AI publications to consider negative societal impact, no standard exists for how to approach these impact assessments. This workshop aims to address this critical gap by bringing together experts on evaluation science and practitioners who develop and analyze technical systems.

Building upon our previous initiatives, including the FAccT 2023 CRAFT session "Assessing the Impacts of Generative AI Systems Across Modalities and Society" and our initial "Evaluating the Social Impact of Generative AI Systems" report, we have made significant strides in this area. Through these efforts, we collaboratively developed an evaluation framework and guidance for assessing generative systems across modalities. We have since crowdsourced evaluations and analyzed gaps in the literature and systemic issues around how evaluations are designed and selected, resulting in a more comprehensive second edition of the paper.

The goal of this workshop is to share existing findings with the NeurIPS community and collectively develop future directions for effective community-driven evaluations. A key focus is participatory AI: Wide benefits can be gained from the scope and involving all participants, not just domain experts. By encouraging collaboration among experts, practitioners, and the wider community, the workshop aims to create more comprehensive evaluations and develop urgent policy recommendations for governments and AI safety organizations.

Workshop Objectives

Share existing findings and methodologies with the NeurIPS community
Collectively develop future directions for effective community-built evaluations
Address barriers to broader adoption of social impact evaluation of Generative AI systems
Develop policy recommendations for investment in future directions for social impact evaluations
Create a framework for documenting and standardizing evaluation practices

Call for Tiny Papers

We are soliciting tiny papers (up to 2 pages long) in the following formats:

Extended Abstracts: Short but complete research papers presenting original or interesting results around social impact evaluation for generative AI.
"Provocations": Novel perspectives or challenges to conventional wisdom around social impact evaluation for generative AI.

Themes for Submissions

We welcome submissions addressing, but not limited to, the following themes:

Conceptualization and operationalization issues in evaluations of:
- Bias, stereotypes, and representational harms
- Cultural values and sensitive content
- Community-centered definitions of disparate performance and privacy
- Documentation frameworks for financial and environmental costs of evaluations
Ethical or consequential validity considerations for:
- Data protection
- Data and content moderation labor
- Historical implications of evaluation data or practices for evaluation validity
Interrogating or critiquing the theoretical basis of existing evaluations
Novel methodologies for evaluating social impact across different AI modalities
Comparative analyses of existing evaluation frameworks and their effectiveness
Case studies of social impact evaluations in real-world AI applications

Submission Guidelines

Paper Length: Maximum 2 pages, excluding an unlimited amount of references
Format: PDF file, using the NeurIPS 2024 LaTeX style file
Submission Portal: [Insert submission portal link here]
Anonymity: Submissions should be anonymous for two-way anonymized review.
This is a participatory, in-person event. Accepted Authors are encouraged to present their work and discuss it at the event.
Broader impact statement and Limitation section are not counted in the paper length.

Important Dates

Submission Deadline: August 1, 2024
Notification of Acceptance: September 1, 2024
Workshop Date: [Insert workshop date here]

Workshop Structure

Total Duration: 8 Hours

Time	Session	Description
9:00 AM - 9:30 AM	Welcome and Introduction	Opening remarks Overview of workshop structure and objectives
9:30 AM - 11:00 AM	Reflections on the Landscape	Collaborative reflection on the existing landscape Talks, panels, and breakouts by modality (text, images, audio, video, and multimodal data) Topics: Underlying frameworks, Contextualization challenges, Defining robust evaluations, Incentive structures
11:00 AM - 11:15 AM	Break
11:15 AM - 12:45 PM	Talks + Provocations	Invited speakers to present on current technical evaluations for base models across all modalities Key social impact categories covered: Bias and stereotyping, Cultural values, Performance disparities, Privacy, Financial and environmental costs, Data moderator labor Presentations of accepted provocations
12:45 PM - 1:45 PM	Lunch Break
1:45 PM - 3:45 PM	Group Activity	Participants break into groups focusing on key social impact categories Activities include: Choosing Evaluations, Reviewing Tools and Datasets, Examining construct reliability, validity, and ranking methodologies
3:45 PM - 4:00 PM	Break
4:00 PM - 5:45 PM	What's Next? Documentation + Resources	Develop policy guidance highlighting impact categories, subcategories, and modalities requiring further investment Discussions on: Documenting Methods, Developing Shareable Resources, Underlying Frameworks, Contextualization Challenges, Defining Robust Evaluations
5:45 PM - 6:00 PM	Closing Remarks

Invited Speakers

Confirmed Speakers:

Abigail Jacobs
- Assistant Professor, School of Information
- Assistant Professor of Complex Systems, College of Literature, Science, and the Arts
- University of Michigan
Nitarshan Rajkumar
- Cofounder of UK AI Safety Institute
- Adviser to the Secretary of State of the UK Department for Science, Innovation and Technology
Su Lin Blodgett
- Senior Researcher, Microsoft Research Montreal

Tentative Speaker:

Abeba Birhane
- Adjunct Lecturer/Assistant Professor, Trinity College Dublin
- Senior Fellow in Trustworthy AI at Mozilla Foundation

Expected Outcomes

Three months after the workshop, we aim to achieve the following outcomes:

Evaluation Report and Resources/Repository:
- Publish a comprehensive summary of the workshop findings
- Update resources including:
  - Documentation framework for standardizing evaluation practices
  - Open source repository addressing identified barriers to broader adoption of social impact evaluation of Generative AI systems
Policy Recommendations:
- Share detailed policy recommendations for investment in future directions for social impact evaluations based on group discussions and workshop outcomes
Knowledge Sharing:
- Foster a more systematic and effective approach to evaluating the social impact of generative AI systems by disseminating lessons and findings to the broader AI research community

Evaluating Evaluations

Examining Best Practices for Measuring Broader Impacts of Generative AI