Artificial intelligence is being evaluated more than ever. Benchmarks multiply, audits proliferate, impact assessments become mandatory, and entire regulatory architectures are built around the idea that sufficiently rigorous evaluation will keep systems under control.
Yet something fundamental remains unresolved.
Despite the growing sophistication of evaluation frameworks, the social effects of AI systems continue to surprise the institutions meant to oversee them. Not because these effects are invisible, but because they emerge across boundaries that current evaluation structures were never designed to hold together.
A recent study from MIT documents this fracture with unusual clarity. The paper maps who evaluates the social impacts of AI, how those evaluations are conducted, and where responsibility ultimately resides. What it reveals is not a lack of effort, nor an absence of technical competence, but a structural dispersion of meaning.
Evaluation is distributed across actors with different mandates, incentives, temporal horizons, and definitions of harm. Academic researchers focus on methodological rigor and long-term implications. Industry teams prioritize compliance, scalability, and operational risk. Civil society organizations examine lived experience and societal externalities. Regulators seek enforceable thresholds and legal defensibility.
Each perspective is internally coherent. None is sufficient on its own.
The result is a landscape where assessments accumulate without converging, where findings coexist without integrating, and where responsibility is everywhere and nowhere at the same time. Evaluation becomes an activity, not a stabilizing force.
The most striking insight of the MIT paper is not who evaluates, but what happens between evaluations. Findings do not travel intact. Interpretations shift as reports are summarized, repurposed, translated into policy language, or embedded into decision-making workflows. By the time an evaluation informs action, its original meaning has often thinned, curved, or fragmented—the kind of semantic drift that escapes traditional evaluation frameworks.
This is not a failure of methodology. It is a failure of continuity.
Social impact, unlike technical performance, does not remain stable when passed from one institutional context to another. Without mechanisms to preserve interpretive coherence across these transitions, evaluation risks becoming symbolic rather than operative.
In such a setting, governance produces the appearance of control without its substance. Systems are assessed, documented, and certified, yet their real-world effects continue to escape prediction and accountability. What is missing is not another framework, but the capacity to know when meaning itself has begun to drift.
The MIT study does not offer a single solution, nor does it attempt to impose a unified model. Its value lies elsewhere. It makes visible the gap between evaluation and understanding, between measurement and governance.
As AI systems increasingly mediate economic, social, and institutional decisions, this gap becomes untenable. Evaluation that cannot retain meaning across contexts cannot guide responsibility. Oversight that does not account for interpretive stability cannot prevent systemic failure.
The question, then, is no longer whether AI should be evaluated. It is whether evaluation, as currently structured, is capable of holding what it claims to assess.
For readers who wish to examine the source material directly, the full MIT paper is available here: https://arxiv.org/abs/2511.05613v1
This note does not close the discussion. It marks the point at which continuing without attention becomes indefensible.