When Hanna Wallach first started testing machine learning models, the tasks were well-defined and easy to evaluate. Did the model correctly identify the cats in an image? Did it accurately predict the ratings different viewers gave to a movie? Did it transcribe the exact words someone just spoke?
This work of evaluating a model’s performance has been transformed by the creation of generative AI, such as large language models (LLMs) that interact with people. So Wallach’s focus as a researcher at Microsoft has shifted to measuring AI responses for potential risks that aren’t easy to quantify — “fuzzy human concepts,” she says, such as fairness or psychological safety.