OpenAI: GDPval framework tests AI on real-world jobs

Spread the love
OpenAI: GDPval framework tests AI on real-world jobs

OpenAI has announced a new evaluation framework, GDPval, to measure artificial intelligence performance on economically valuable tasks. The system tests models on 1,320 real-world job assignments to bridge the gap between academic benchmarks and practical application.

The GDPval framework evaluates how AI models address 1,320 distinct tasks that are associated with 44 different occupations. These jobs are primarily knowledge-work positions within industries that each contribute more than 5% to the gross domestic product (GDP) of the United States. To construct this list of relevant professions, OpenAI utilized data from the May 2024 U.S. Bureau of Labor Statistics (BLS) and the Department of Labor’s O*NET database. The resulting selection of occupations includes professions frequently associated with AI integration, such as software engineers, lawyers, and video editors. The framework also extends to occupations less commonly discussed in the context of AI, including detectives, pharmacists, and social workers, providing a broader assessment of potential economic impact.

According to the company, the tasks within the evaluation were created by professionals who possess an average of 14 years of experience in their respective fields. This measure was intended to ensure the tasks accurately reflect “real work products, such as a legal brief, an engineering blueprint, a customer support conversation, or a nursing care plan.” OpenAI specified that GDPval’s scope across numerous tasks and occupations distinguishes it from other evaluations focused on economic value, which may concentrate on a single domain like software engineering. The design of the evaluation forgoes simple text prompts. Instead, it provides the AI models with files to reference and requires the creation of multimodal deliverables, such as presentation slides and formatted documents. This approach is meant to simulate how a user would interact with the technology in a professional work environment. OpenAI stated, “This realism makes GDPval a more realistic test of how models might support professionals.”

In its study, OpenAI used the GDPval framework to grade the outputs from several of its own models, including GPT-4o, GPT-4o-mini, GPT-3, and the more recent GPT-5. The evaluation also included models from other companies: Anthropic’s Claude Opus 4.1, Google’s Gemini 2.5 Pro, and xAI’s Grok 4. The core of the grading process involved experienced professionals who performed blind evaluations of the models’ outputs. These human graders unknowingly compared the AI-generated work against outputs produced by human experts, providing a direct quality benchmark without knowledge of the work’s origin.

To supplement this human-led process, OpenAI developed an “autograder” AI system. This system is designed to predict how a human evaluator would score a given deliverable. The company announced its intention to release this autograder as an experimental research tool for others to use. OpenAI issued a caution, however, stating that the autograder is not as reliable as human graders. It affirmed that the tool is not intended to replace human evaluation in the near future, reflecting the nuanced judgment required for assessing high-quality professional work.

The initial findings from the GDPval tests indicate that current advanced AI is nearing the quality standards of human professionals. “We found that today’s best frontier models are already approaching the quality of work produced by industry experts,” OpenAI wrote. Among the models tested, Anthropic’s Claude Opus 4.1 was identified as the best overall performer. Its particular strengths were observed in tasks related to aesthetics, which encompasses elements such as professional document formatting and the clear, effective layout of presentation slides. These qualities are often critical for client-facing materials and effective communication in a business context.

While Claude Opus 4.1 excelled in presentation, OpenAI’s GPT-5 model demonstrated superior performance in accuracy. This was especially evident in tasks that required finding and correctly applying domain-specific knowledge. The research also highlighted the rapid pace of model improvement. The results showed that performance on GDPval tasks “more than doubled from GPT-4o (released spring 2024) to GPT-5 (released summer 2025).” This substantial increase in capability over a relatively short period indicates a significant acceleration in the development of underlying AI technologies.

The evaluation also included an analysis of efficiency. “We found that frontier models can complete GDPval tasks roughly 100× faster and 100× cheaper than industry experts,” OpenAI reported. The company immediately qualified this finding with a critical caveat. “However, these figures reflect pure model inference time and API billing rates, and therefore do not capture the human oversight, iteration, and integration steps required in real workplace settings to use our models.” This context clarifies that the calculation excludes the considerable time and cost associated with managing, refining, and implementing AI-generated work in a practical business workflow.

OpenAI acknowledged significant limitations in the current version of the GDPval framework, describing it as “an early step that doesn’t reflect the full nuance of many economic tasks.” A major constraint is its use of one-off evaluations. This means the framework cannot measure a model’s ability to handle iterative work, such as completing multiple drafts of a project, or its capacity to absorb context for an ongoing task over time. For instance, the current test cannot assess if a model could successfully edit a legal brief based on client feedback or redo a data analysis to account for a newly discovered anomaly.

A further limitation noted by the company is that professional work is not always a straightforward process with organized files and a clear directive. The current framework cannot capture the more complex and less structured aspects of many jobs. This includes the “human—and deeply contextual—work of exploring a problem through conversation and dealing with ambiguity or shifting circumstances.” These elements are often central to professional roles but are difficult to replicate in a standardized testing environment. “Most jobs are more than just a collection of tasks that can be written down,” OpenAI added.

The company stated its intention to address these limitations in future iterations of the framework. Plans include expanding its scope to span more industries and incorporate harder-to-automate tasks. Specifically, OpenAI will attempt to develop evaluations for tasks that involve interactive workflows, where a model must engage in a back-and-forth process, or those that require understanding extensive prior context, which remains a challenge for many AI systems. As part of this expansion, OpenAI will release a subset of the GDPval tasks for researchers to use in their own work.

From these results, OpenAI’s stated conclusion is that AI will inevitably continue to disrupt the job market. The company posits that AI can take on routine “busywork,” thereby freeing human workers to concentrate on more complex and strategic tasks. This perspective frames AI as a tool for augmenting human productivity rather than purely for replacement. “Especially on the subset of tasks where models are particularly strong, we expect that giving a task to a model before trying it with a human would save time and money,” OpenAI wrote.

Concurrent with these findings, the company reiterated its stated commitment to its broader mission. This includes plans to democratize access to AI tools, an effort to keep “supporting workers through change, and building systems that reward broad contribution.” “Our goal is to keep everyone on the ‘elevator’ of AI,” the company concluded.


Featured image credit

FAQs

Frequently Asked Questions

What is a Premium Domain Name?   A premium domain name is the digital equivalent of prime real estate. It’s a short, catchy, and highly desirable web address that can significantly boost your brand's impact. These exclusive domains are already owned but available for purchase, offering you a shortcut to a powerful online presence. Why Choose a Premium Domain? Instant Brand Boost: Premium domains are like instant credibility boosters. They command attention, inspire trust, and make your business look established from day one. Memorable and Magnetic: Short, sweet, and unforgettable - these domains stick in people's minds. This means more visitors, better recall, and ultimately, more business. Outshine the Competition: In a crowded digital world, a premium domain is your secret weapon. Stand out, get noticed, and leave a lasting impression. Smart Investment: Premium domains often appreciate in value, just like a well-chosen piece of property. Own a piece of the digital world that could pay dividends. What Sets Premium Domains Apart?   Unlike ordinary domain names, premium domains are carefully crafted to be exceptional. They are shorter, more memorable, and often include valuable keywords. Plus, they often come with a built-in advantage: established online presence and search engine visibility. How Much Does a Premium Domain Cost?   The price tag for a premium domain depends on its desirability. While they cost more than standard domains, the investment can be game-changing. Think of it as an upfront cost for a long-term return. BrandBucket offers transparent pricing, so you know exactly what you're getting. Premium Domains: Worth the Investment?   Absolutely! A premium domain is more than just a website address; it's a strategic asset. By choosing the right premium domain, you're investing in your brand's future and setting yourself up for long-term success. What Are the Costs Associated with a Premium Domain?   While the initial purchase price of a premium domain is typically higher than a standard domain, the annual renewal fees are usually the same. Additionally, you may incur transfer fees if you decide to sell or move the domain to a different registrar. Can I Negotiate the Price of a Premium Domain? In some cases, it may be possible to negotiate the price of a premium domain. However, the success of negotiations depends on factors such as the domain's demand, the seller's willingness to negotiate, and the overall market conditions. At BrandBucket, we offer transparent, upfront pricing, but if you see a name that you like and wish to discuss price, please reach out to our sales team. How Do I Transfer a Premium Domain?   Transferring a premium domain involves a few steps, including unlocking the domain, obtaining an authorization code from the current registrar, and initiating the transfer with the new registrar. Many domain name marketplaces, including BrandBucket, offer assistance with the transfer process.