Art. 10 EU AI Act: data and data governance for high-risk AI
Art. 10 requires that the training, validation, and testing data for high-risk AI systems meets quality criteria: relevant, sufficiently representative, and as free of errors and complete as possible for the intended purpose. It also requires documented data governance practices covering collection, preparation, bias examination, and gap mitigation, and it permits the limited processing of special-category data where strictly necessary to detect and correct bias, under safeguards.
Updated: June 2026
This is an explicit providerproviderThe actor who develops an AI system (or has it developed) and places it on the market or into service under its own name — carrying manufacturer-style duties: design controls, documentation, conformity.Open full entry → obligation under the EU AI Act. It falls on whoever develops or places the high-risk AI systemAI systemA machine-based system that, for explicit or implicit objectives, infers from input how to generate outputs — predictions, content, recommendations or decisions — that can influence physical or virtual environments. The OECD-style definition followed by the EU AI Act.Open full entry → on the market. Deployers carry a related input-data duty under Art. 26.4.
Introduction: data as the root of most AI risk
Most of the failure modes the EU AI Act is concerned with originate in data. A biased outcome is usually a biased dataset expressed through a model. A privacy exposure is usually data that should not have been collected, retained, or used. A performance failure is often a training set that no longer represents the population the system serves. Art. 10 is the obligation that addresses risk at its source, by setting quality and governance requirements on the data that high-risk AI systems are built and run on.
Art. 10 applies primarily to providers, who develop the system and control its training. But its logic reaches deployers too, because the input data a deployerdeployerAn organization using an AI system under its own authority in its activities — carrying operator duties: use per instructions, oversight, input relevance, monitoring, notices.Open full entry → supplies in operation must meet the conditions the provider specified, a duty that appears separately as the deployer obligation in Art. 26.4.
What the data must be
Art. 10 requires that training, validation, and testing datasets are subject to appropriate data governance and meet quality criteria. The datasets must be:
- Relevant to the intended purpose of the system.
- Sufficiently representative of the persons and situations the system will be used on, so the system does not perform well on one group and poorly on another.
- As free of errors and as complete as possible in view of the intended purpose.
- Appropriate in their statistical properties, including for the groups the system is intended to affect.
These are not absolute standards. The article qualifies them with "to the best extent possible" and "in view of the intended purpose", which means the provider must make and document a reasoned judgement about what level of quality is adequate for the stakes of the use case, rather than meeting a fixed numerical bar.
What the governance must cover
Beyond the quality of the data itself, Art. 10 requires documented data governance and management practices. These cover the design choices and data origin, the collection process and provenance, the preparation operations such as labelling and cleaning, the formulation of assumptions about what the data measures, an assessment of whether the data is available, suitable, and sufficient, and an examination for possible biases that could affect health, safety, or fundamental rights, together with measures to detect, prevent, and mitigate those biases.
This is the governance trail a conformity assessmentconformity assessmentThe pre-market process demonstrating a high-risk AI system meets the EU AI Act's requirements, leading to CE marking and registration.Open full entry → expects: not just a clean dataset, but a documented account of where it came from, how it was prepared, what was assumed, and how bias was looked for and addressed.
The special-category data provision and the real-time data angle
Art. 10(5) contains an important and often-misread provision. To detect and correct bias, providers may exceptionally process special categories of personal data, the sensitive data the GDPR otherwise restricts, but only where strictly necessary, and under safeguards: the bias cannot be detected by processing other data, the data is subject to technical limits on reuse, security and privacy-preserving measures apply, and the data is deleted once the bias is corrected or its retention period ends.
This provision is also where the operational angle of data protection at the point of use comes in. For systems that process data in real time, the discipline of minimising and masking sensitive data before it reaches the model is the operational expression of the same principle: process the least sensitive data necessary, protect what must be processed, and document why. Sensitive data masked or redacted at the input level, before it reaches the model, is a concrete control that serves both the Art. 10 data governance obligation and the GDPR's minimisation principle.
Why it matters
Data governance failures are doubly exposed, because the same dataset can carry both a fairnessfairnessThe responsible-AI principle that systems should not create or reinforce unjust discrimination; operationalised through bias testing, representative data and per-group thresholds — with multiple, mutually incompatible mathematical definitions.Open full entry → defect and a privacy defect, and the two are policed by different parts of the law. A training set that over-represents one group creates an Art. 10 quality failure and a fairness risk under the risk management system, while the same set, if it contains personal data that should not have been collected, creates a GDPR exposure. Addressing data governance well closes several risks at once; neglecting it opens several at once.
Governing data quality and governance
The controls treat data as a managed asset with a documented lifecycle, not a raw input that happens to be available.
The core artefact is a data sheet for each dataset, recording its origin and provenance, its size and population characteristics, the preparation and labelling operations applied, the assumptions made, the bias examination performed and its findings, and the known limitations. This sheet becomes part of the technical documentation and is the evidence a conformity assessment examines.
For systems processing personal data, the data governance controls integrate with the organisation's GDPR controls rather than running in parallel: one minimisation discipline, one lawful-basis analysis, one retention schedule, applied to the AI data lifecycle. Where special-category dataspecial-category dataGDPR Article 9 data: health, ethnicity, political opinions, religion, sexual orientation, biometrics for identification — processable only on narrow grounds. Inferring these traits creates them.Open full entry → is processed under the Art. 10(5) exception, the strict-necessity justification and the safeguards are documented before processing begins, not reconstructed afterward.
Compliance checklist
- Is there a documented data sheet for each training, validation, and testing dataset, covering provenance, preparation, and limitations?
- Has each dataset been assessed for relevance, representativenessrepresentativenessHow well training data reflects the population and conditions the system will face in deployment — the fitness-for-purpose core of AI data quality.Open full entry →, error rate, and completeness against the intended purpose, with the judgement documented?
- Has each dataset been examined for biases that could affect health, safety, or fundamental rights, with mitigation measures recorded?
- Where special-category data is processed to detect or correct bias, is the strict-necessity justification documented and are the Art. 10(5) safeguards in place?
- For systems processing personal data in real time, is sensitive data minimised or masked before it reaches the model?
- Do the data governance controls integrate with the organisation's GDPR controls rather than duplicate them?