The European Commission has published an official template that model providers must use to publish a public-friendly summary of their training content. This is not a voluntary exercise, but the only permitted form to comply with transparency obligations for general-purpose AI models.
Critical deadline: The template is part of the GPAI package that came into force on August 2, 2025, together with the scope guidelines and the Code of Practice. For existing models, a transition period runs until August 2, 2027.
What it is and why now
The AI Act requires providers of general-purpose AI models to publish a public overview of the content used to train the model. The Commission published a template plus explanatory note for this purpose on July 24, 2025. The goal is to promote transparency so that stakeholders such as rights holders can effectively exercise their rights.
The template ensures uniformity, less room for interpretation, and better comparability between models. Meanwhile, the GPAI guidelines clarify exactly who falls under these obligations, what "placing on the market" means, how to deal with changes, and when an actor that adapts an existing model becomes a provider themselves. They thus place the template in a broader context of governance and responsibility.
For whom does this apply and when
Scope and deadlines
The obligation applies to all providers of general-purpose AI models offered on the EU market, including models available under a free or open-source license. The summary must be available at the latest when the model is placed on the market.
For models that were already on the market before August 2, 2025, a transition period runs until August 2, 2027. Additionally, there is an update obligation: update the summary at least every six months or earlier if new training data warrants it.
Failure to publish can lead to enforcement and fines of up to 3% of global turnover or €15 million from August 2, 2026, whichever is higher. The Commission's FAQ also confirms a reasonableness test. If you cannot retrieve certain information despite demonstrable effort or if retrieval is disproportionate, you must explicitly mention and justify that gap in the publication.
What needs to be included
The template divides the information into three main parts. Each part has mandatory elements and room for optional clarification.
General information
The first section requires you to identify the provider and the model. You indicate per modality which types of content have been used, for example text, image, audio or video, and outline general characteristics of the dataset. This also includes scope per modality within bandwidth ranges. This provides readers with an initial overview of what the model has seen during training.
List of data sources
Source type | Required detail level | Specific requirements |
---|---|---|
Web scraping | High | Crawler names, period, top 10% domains |
Public datasets | Medium | Dataset name, administrator, license |
Private datasets | Low | General description |
User data | Medium | Modality, product/service, opt-in process |
Synthetic data | Low | Generation method, source model |
The template distinguishes between different types of sources and prescribes a different level of detail for each type. For web scraping, for example, you must mention the crawler(s) used, the collection period, a content description of what was scraped, and a list of the top 10% domains from which was scraped. For SMEs, the top 5% or maximum 1000 domains applies, whichever is lower.
You also publish an overview of large publicly available datasets with relevant licenses. For user data, you must clearly indicate whether user interactions with your services have been used for training, which modalities this concerns, and for which products or services this applies.
Relevant aspects of data processing
The third chapter describes points that stakeholders need to exercise their rights. Think about how you have dealt with copyright, how you have cleaned up or removed unlawful content, and other processing that is important for the exercise of rights.
Balance between transparency and trade secrets: The Commission emphasizes that all this is intended to provide transparency, within limits that respect trade secrets. The required detail deliberately varies per source, so you don't have to reveal sensitive know-how but do publish useful information.
What you don't need to publish
The template does not ask for a complete dump of your training corpus. You don't have to reveal individual documents or exact data points and you don't have to disclose personal data. Further details about the processing of personal data belong in your privacy statement.
There are also limits to reconstructability. Information that can factually no longer be retrieved does not have to be reproduced at any cost. You then justify why this data is missing and what efforts you have undertaken to collect it anyway.
How to get this right quickly
The approach below works for organizations that work with multiple models, sources and teams. It's not a paper exercise. It forces an internal inventory that makes your governance stronger and improves your position towards rights holders and supervisors.
Organization and responsibilities
Appoint an owner who coordinates content, legal review and publication. Establish coordination with Legal, Privacy, Security and Communication. This prevents inconsistencies between your website, your model card and your technical documentation. A clear owner ensures that the template doesn't fall between different departments and that a consistent story emerges.
Inventorying data sources
Practical mapping of data sources
Map your data sources directly to the template categories: public, private, web-scrape, user-data, synthetic. Link the following information per source:
- Modality (text, image, audio, video)
- Collection period and frequency
- Selection or filtering rules applied
- License status and origin determination
- For web scraping: crawler names and crawl windows
- For publicly available datasets: dataset names and license terms
Automating domain selection
Ensure that your scraping pipeline can output domain frequency per model version. Record how you determine "top 10%" and keep the complete top list internally. For SMEs, apply the lower threshold or 1000 cap. This makes updates feasible every six months without having to reanalyze all data each time.
Documenting copyright and compliance
Briefly describe how you comply with TDM rules and how you handle opt-outs. Refer to your copyright policy and explain how you remove illegal content. This aligns with what the Commission expects from providers and what is also reflected in the GPAI Code of Practice. A clear explanation of your compliance process strengthens trust with rights holders and supervisors.
Publication and version management
Publication should be prominently displayed on your own website and alongside the distribution channels where the model is available. Keep version numbers and dates synchronized with your model releases and add a brief changelog for updates. Plan a fixed update moment every six months. Link this to your retrain or fine-tune moments. If you continuously update your model, you'll address the summary earlier. Record the update process in your QMS, also with post-market monitoring in mind.
Common mistakes and how to prevent them
Writing too technical or too vague
An overly technical listing doesn't help the target audience. Too vague language raises questions among rights holders. Write concretely, with recognizable categories and source examples per modality, but without evangelizing. The goal is information provision, not impressing with technical details.
Data silos that don't communicate with each other
Without a data catalog and source labels that align with the template, you quickly get bogged down. Start with mapping to the five source types and work back to the teams. Organizations that don't have their data well organized get stuck in the inventory phase.
Web scraping without origin administration
If crawler names and periods are not logged, the top-10% list becomes guesswork. Ensure in your data engineering that this metadata is recorded as standard. Without proper logging, compliance becomes impossible retroactively.
No story with user data
Saying you use "user data" without clarifying modality, product or service backfires. Provide the framework: where does it come from, in what form, and how do you ensure privacy. Transparency means users understand what happens to their data.
Publishing in one place and forgetting updates
The obligation applies to your website and your distribution channels. Moreover, you must update every six months. Automate this in your release process so updates aren't forgotten when you're busy with new developments.
Example paragraphs for different source categories
Model and modalities
Orion-2 is a multimodal language model trained on text, image and audio. The total training scope per modality falls within the bandwidth ranges specified in the European Commission template. The model is designed for diverse applications in natural language processing and multimodal analysis.
Publicly available datasets: We have used large publicly available corpora for the basic training of the model. For each dataset we mention name, administrator and license for full transparency. Examples include dataset A under license X, managed by organization Y, and dataset B under license Z.
Web scraping: Scraping took place in four windows between January-April 2024 and August-October 2024. Used crawlers were AlphaCrawler version 1.2 and WebSift version 0.9. The top 10% domains from which most content comes are listed at the bottom of this page with corresponding percentages.
User data: Interactions with our chat service, exclusively textual input, were used after explicit opt-in from users. This concerns prompt data that was used for further training after filtering and anonymization. Further information about the processing of personal data is in the privacy statement of the chat service.
Copyright and removals: We respect the TDM rules from the DSM directive and honor machine-readable opt-outs specified in robots.txt files. Illegal content was detected and removed prior to training through automated detection systems and manual verification. See our copyright policy for more details about these procedures.
How this fits with other GPAI instruments
The Commission explicitly positions the template alongside the GPAI guidelines and the Code of Practice. The guidelines explain who has which obligations and when, including the notification obligation for models with systemic risk. The Code contains operational expectations that help you secure policy and processes.
Triad of governance: See it as a triad where the guidelines determine the playing field, the code helps with working at level, and the template ensures visibility and traceability towards the outside world.
These instruments reinforce each other and together form a coherent framework for GPAI governance. Organizations that take all three seriously build a solid foundation for sustainable compliance and stakeholder trust.
Checklist for your publication
Organizational preparation
- Model leader, Legal and Data Engineering involved and one owner appointed
- Clear division of responsibilities between departments established
- Coordination with Communication and Privacy teams arranged
Data sources and content
- Data sources mapped to template categories
- For web scraping: crawler name, period, content description and top 10% domains available
- For publicly available datasets: names and licenses recorded
- User data clearly described, with reference to privacy statement
- Brief copyright paragraph about TDM rules and opt-outs
Publication and process
- Publication page on own site and distribution channels prepared
- Release and update process established, six-month cycle secured
- Version management and changelog functionality implemented
- Justification paragraph ready for any unavailable information
With this checklist you not only comply with the letter of the obligation, but also strengthen your legal and reputational position. The official sources with the template, Q&A and context are available through the press release of July 24, 2025, the comprehensive FAQ with concrete implementation and the GPAI guidelines with the delineation of roles and timing.