Lightbulb on puzzles

courtneyk/Getty Images

Efforts are underway to provide a common set of benchmarks to assess generative artificial intelligence (AI) products and to create a “body of knowledge” on how these tools should be tested. 

The aim is to provide a standard approach to the evaluation of generative AI applications and to galvanize efforts to address the risks. This common approach is a shift away from existing “piecemeal” efforts.

Also: Six skills you need to become an AI prompt engineer

Dubbed Sandbox, the initiative is led by Singapore’s Infocomm Media Development Authority (IMDA) and AI Verify Foundation, and has garnered support from global market players, such as Amazon Web Services (AWS), Anthropic, Google, and Microsoft. These organizations are part of a current group of 15 participants, which also comprises Deloitte, EY, and IBM, as well as Singapore-based OCBC Bank and telco Singtel. 

Sandbox is guided by a new draft catalog that categorizes current benchmarks and methods used to evaluate large language models (LLMs). The catalog compiles commonly used technical testing tools, organizing these according to what they test for and their methods, and recommends a baseline set of tests to evaluate generative AI products, IMDA said. 

Also: Want a job in AI? These are the skills you need

The goal is to establish a common language and support “broader, safe and trustworthy adoption of generative AI”, it said. 

“Systematic and robust evaluation of models is a critical component of LLM governance and helps form the bedrock of trust in the use of these technologies,” IMDA said. 

“Through rigorous evaluation, the capabilities of a model are revealed, which can assist in determining its intended uses and potential limitations. Evaluation [also] provides a vital roadmap for developers to make improvements.”

Achieving this common language requires a standardized taxonomy and baseline set of pre-deployment safety evaluations for LLMs, it noted. The Singapore government agency hopes the draft catalog offers a starting point for global discussions, with the aim of driving consensus on safety standards for LLMs. 

Also: How to write better ChatGPT prompts (and this applies to most other text-based AIs, too)

Moving toward common standards also means involving other stakeholders in the ecosystem, beyond the model developers, such as application developers that build on…



Source link