Researchers with the Center for AI Safety and Scale AI are gathering submissions for Humanity’s Last Exam. The submission form is here. The idea is to develop an exam with questions from a breadth of academic specializations that current LLMs can’t answer.
While current LLMs achieve very low accuracy on Humanity’s Last Exam, recent history shows benchmarks are quickly saturated — with models dramatically progressing from near-zero to near-perfect performance in a short timeframe. Given the rapid pace of AI development, it is plausible that models could exceed 50% accuracy on HLE by the end of 2025. High accuracy on HLE would demonstrate expert-level performance on closed-ended, verifiable questions and cutting-edge scientific knowledge, but it would not alone suggest autonomous research capabilities or “artificial general intelligence.” HLE tests structured academic problems rather than open-ended research or creative problem-solving abilities, making it a focused measure of technical knowledge and reasoning. HLE may be the last academic exam we need to give to models, but it is far from the last benchmark for AI.
One wonders if it really will be the last exam. Perhaps we will get more complex exams that test for integrated skills. Andrej Karpathy criticises the exam on X. I agree that what we need are AIs able to do intern-level complex tasks rather than just answering questions.