Article 10 AI Act: Data and Data Governance for High-Risk AI

An AI system is only as good as the data it learns from. That sounds like a truism, but practice shows that truism turning into real harm for patients, job applicants, and citizens on a regular basis. In 2019, researchers from the University of Chicago and Brigham and Women's Hospital revealed that a widely used algorithm in American healthcare systematically disadvantaged Black patients. Not because the design was explicitly racist, but because the training data used healthcare spending as a proxy for healthcare needs. Black patients historically had less access to care and therefore lower costs, leading the algorithm to label them as "less sick". This is precisely the kind of problem Article 10 of the EU AI Act aims to prevent.

Why data sits at the core of AI regulation

The European legislator understood clearly that you cannot regulate AI without addressing the data underneath it. It does not matter how sophisticated your model is: if the training data is skewed, incomplete, or contaminated, the system produces skewed, incomplete, or contaminated results. Recital 67 of the AI Act puts it sharply: high-quality data plays a vital role in the performance of AI systems, and deficient datasets can become a source of discrimination prohibited under Union law.

Article 10 translates that principle into concrete obligations. It targets specifically high-risk AI systems, the category subject to the strictest requirements: think healthcare AI, recruitment and selection, education, credit scoring, or law enforcement.

The six paragraphs of Article 10: a complete walkthrough

Paragraph 1: The main rule

The first paragraph lays the foundation. High-risk AI systems that use techniques involving the training of models with data must be developed on the basis of training, validation, and testing datasets that meet the quality criteria of paragraphs 2 to 5.

The wording is deliberately broad: it covers not just deep learning or neural networks, but any technique that uses data to train a model. At the same time, the law recognizes that not every AI system works the same way. Paragraph 6 therefore specifies that for systems not using training techniques, the requirements apply only to testing data.

Paragraph 2: Data governance and management

Paragraph 2 forms the heart of the article. It requires that training, validation, and testing datasets be subject to data governance and management practices appropriate for the intended purpose of the high-risk AI system. It then lists eight specific areas of concern:

(a) Relevant design choices. The law requires you to document which choices you made when designing your dataset and why.

(b) Data collection processes and origin. You must be able to demonstrate where your data comes from. For personal data, you must also document the original purpose of data collection, a direct link with the GDPR.

(c) Data preparation operations. Annotation, labelling, cleaning, updating, enrichment, and aggregation: all of these processing operations must be accounted for.

(d) Assumptions. What assumptions underlie your data? What do you assume the data measures and represents?

(e) Availability and suitability. An assessment of the availability, quantity, and suitability of the required datasets must take place.

(f) Examination of bias. This is one of the most impactful requirements: an examination of possible biases that are likely to affect the health and safety of persons, have a negative impact on fundamental rights, or lead to discrimination prohibited under Union law. The law explicitly points to the risk of feedback loops, where system outputs flow back as inputs for future operations.

(g) Measures against bias. It is not enough to identify bias. You must take appropriate measures to detect, prevent, and mitigate the biases identified.

(h) Identification of gaps. Finally, you must identify relevant data gaps or shortcomings that prevent compliance with the Regulation, and document how you will address them.

Paragraph 3: Quality requirements for datasets

Paragraph 3 formulates the core quality requirements. Datasets must be relevant, sufficiently representative, and to the best extent possible, free of errors and complete in view of the intended purpose. They must have the appropriate statistical properties, including with regard to the persons or groups for whom the system is intended.

An important detail: these characteristics may be met at the level of individual datasets or at the level of a combination thereof. This gives organizations flexibility. You do not need one perfect dataset; you may combine datasets as long as the whole meets the requirements.

Paragraph 4: Context and geography

Paragraph 4 adds a dimension that is often overlooked in practice. Datasets must take into account the characteristics that are particular to the specific geographical, contextual, behavioural, or functional setting in which the AI system is intended to be used.

In concrete terms: an AI system trained on North American data cannot simply be deployed in Europe. Cultural, legal, and demographic differences matter. A facial recognition system that performs excellently on a dataset with predominantly white faces fails structurally with other ethnicities. A credit scoring model trained on American financial data does not reflect the European market.

Paragraph 5: The special categories exception

Paragraph 5 is legally the most complex part and directly touches the interplay with the GDPR. It allows providers of high-risk AI systems to exceptionally process special categories of personal data, but exclusively for detecting and correcting bias.

This is a remarkable provision. The GDPR generally prohibits the processing of data concerning race, ethnicity, political opinions, health, and other sensitive categories (Article 9 GDPR). But the AI Act acknowledges a paradox: to verify whether your system discriminates based on race or gender, you sometimes need to know the race or gender of data subjects.

The law sets six strict conditions for this exception:

Bias detection cannot be effectively achieved by processing other data, including synthetic or anonymised data.
Technical limitations on re-use apply, plus state-of-the-art security and privacy-preserving measures, including pseudonymisation.
Strict access controls and documentation: only authorised persons may access the data.
The data must not be transmitted to third parties.
The special categories of personal data must be deleted once the bias is corrected or the retention period expires, whichever comes first.
The records of processing activities must document why processing was strictly necessary.

A 2025 study by the European Parliament emphasised that this interplay between the AI Act and the GDPR must be navigated carefully, as both regulations sometimes create contradictory incentives.

Paragraph 6: Systems without training

Paragraph 6 clarifies that for AI systems not using training techniques, paragraphs 2 to 5 apply only to testing datasets. Think of rule-based systems or expert systems: they do not need to subject their "knowledge base" to the same requirements, but their test data must comply.

The recitals: context and background

The recitals of the AI Act provide essential context. Recital 67 emphasises that bias can be inherent in underlying datasets, especially with historical data, and that feedback loops can gradually reinforce and perpetuate discrimination, particularly for vulnerable groups. Recital 68 points to the importance of European data spaces, such as the European Health Data Space, as instruments for trustworthy and non-discriminatory access to high-quality data. Recital 69 underscores that the right to privacy must be guaranteed throughout the entire lifecycle of the AI system, and mentions techniques such as anonymisation, encryption, and federated learning as possible safeguards.

Real-world evidence: why this matters

Amazon and the recruitment algorithm

In 2018, Reuters revealed that Amazon had built an AI recruitment tool that systematically disadvantaged women. The system had been trained on ten years of CVs submitted to the company, a dataset that predominantly contained male candidates. The model learned that "male" was the norm and penalised CVs that contained references to women, down to mentioning a women's sports team. Had Article 10, paragraph 2(f) and (g) already been in force, Amazon would have been required to examine the dataset for gender bias and take corrective measures before deploying the system.

Healthcare and the proxy trap

The healthcare algorithm mentioned earlier illustrates what happens when the assumptions behind data (paragraph 2(d)) are not made explicit. The developers chose healthcare costs as a proxy for healthcare needs without examining whether that assumption held for all demographic groups. Under Article 10, this would constitute a violation: assumptions must be formulated and tested.

Feedback loops in law enforcement

The warning in paragraph 2(f) about feedback loops is not theoretical. Predictive policing systems direct patrols to neighbourhoods where historically more arrests were made. Greater police presence leads to more arrests, which confirms and reinforces the model. The result: a self-reinforcing cycle of over-policing in certain communities, often with a disproportionate impact on ethnic minorities.

The interplay with the GDPR

Article 10 does not operate in a vacuum. For every organisation processing personal data for AI training, GDPR obligations apply in full. The AI Act adds a layer on top. Recital 69 emphasises that data minimisation and privacy by design remain applicable.

The tension is real: the GDPR limits data collection and processing, while Article 10 demands representative and complete datasets. Organisations must navigate both interests. The special categories exception in paragraph 5 is an attempt to bridge that tension, but the conditions are deliberately strict to prevent abuse.

Academic research has noted that the GDPR and the AI Act sometimes create contradictory incentives in combating algorithmic discrimination, and that the exception in Article 10(5) forms a necessary but insufficient bridge.

Connection to other articles

Article 10 does not stand alone. It forms a triptych with Article 9 (risk management system) and Article 15 (accuracy, robustness, and cybersecurity). The risk management system of Article 9 must identify risks arising from data problems; Article 10 prescribes how to address those problems; and Article 15 requires that the final system performs accurately and robustly on the basis of that data.

Centuro Global notes that organisations are best served by building these data governance requirements on top of their existing GDPR compliance structure, with the Chief Data Officer (CDO) role at the centre.

What should you do now?

The data governance requirements of Article 10 enter into force on 2 August 2026 for high-risk AI systems. That sounds like plenty of time, but the required changes are fundamental. Some concrete steps:

Inventory your datasets. Map which data you use for training, validation, and testing. Document origin, processing operations, and assumptions.
Conduct a bias audit. Examine your datasets for possible bias, with particular attention to protected characteristics and feedback loops.
Bridge the GDPR gap. Ensure your data processing records (Article 30 GDPR) align with the documentation requirements of Article 10.
Involve domain experts. Data quality is not a purely technical issue. Involve lawyers, ethicists, and domain specialists in formulating and testing assumptions.
Use European data spaces. Recital 68 points to European data spaces as a source of trustworthy, non-discriminatory data.
Document everything. The thread running through Article 10 is documentation. Every choice, every assumption, every measure must be traceable.

Conclusion

Article 10 is not the most widely read article of the AI Act, but it is one of the most consequential. Data is the fuel of AI, and whoever fails to control the quality of that fuel cannot guarantee that the end product is safe, fair, and reliable. The European legislator sent a clear message with this article: data governance is not a side issue, but a core obligation.

The examples from Amazon, the American healthcare system, and predictive policing show that this is not abstract regulation. It concerns real people who are affected by deficient data. Article 10 provides the legal framework to prevent that. The task for organisations now is to fill that framework with substance.