The Language Archive

Edit this information sheet | All information sheets
General information
Description The Language Archive (TLA) is a unit of the Max Planck Institute for Psycholinguistics concerned with digital language resources and tools. It is a large data archive holding resources on languages worldwide.
Organisation Max Planck InstituteNational service
Type of service Storage facility
Legislation Netherlands
Usage and appreciation Large number of partners, see:

Support organisation There is a contact form ( and a support forum (
Stage in the research project During
Postion within the research process3. Analysing data, 4. Preserving data
Domain 3. Shared Research Domain, 5. Public Domain
Type of data 2. Data collections and structured databases
Data curation5. Store, 6. Acces, use and re-use
Data classification Klasse 3 voor openbare informatie, Klasse 2 voor interne informatie [voor een beperkte groep]
Administrative information
Funding Three funding bodies: BBAW, KNAW and MPG

Depositor agreement The depositors themselves are responsible for compliance with any legal regulations in the area where the data is collected. Where required by national regulations the archive also signs contracts with national/regional institutions. All ethical issues are dealt with by using Codes of Conduct, such as the DOBES Code of Conduct for the DOBES part of the archive. The repository enables the depositors to restrict access to their resources at various levels. All distributed copies elsewhere are stored under the agreement that they are made available under the same access restrictions, if they are made available.

User agreement the depositor decides on access permissions; a code of conduct is available
Policy Two copies of every resource are stored within the MPI and at least 4 additional copies are stored in different physical locations in Germany. The storage hardware is being replaced at regular intervals to the latest state of the art. Regular checks are performed on archival content to check for file and format integrity. The Sun SAM-FS HSM system that is being used for storage also checks for file integrity upon file access. The repository will have 2 identical archive access setups at the backup sites in Göttingen and Munich, so that in case of an emergency the data can be accessed via one of these sites.

Intellectual property TLA requires the right to archive, but does not claim copyright
Data curation strategy Widely based on open standards; regular quality assessment vai Data Seal of Approval
Target group
Faculty Humanities
Primary target group Linguistics

Secondary target group
Classification of the service
Availability Web interface 24/7 ; APIs

In order to be better able to support the proper handling of these moral and/or juridical rights, TLA has implemented four levels of access:

1. Open resources can be accessed immediately.
2. Restricted open resources can be accessed by registered users which possibly (as in the case of DOBES) have to agree with a Code of Conduct.
3. In addition to the conditions that hold for restricted resources, protected resources can be accessed on request only. The responsibles (usually the depositors) will examine the request and, if they grant access, they may do so for a specific use or limited amount of time, which may have to be agreed upon in a usage declaration.
4. Some sensitive “closed” resources can be accessed only by the depositors (and, e.g., members of the respective speech community).
Integrity The repository in principle makes the original deposited objects available in an unmodified way, if the objects were in one of the accepted file types and encodings. Additionally, lower quality distribution copies of audio and video recordings are made available. New versions of archived resources can be deposited, in which case the old versions will be moved to a version archive. Different versions of the same resource are not compared; we assume the depositor has good reasons for depositing a newer version. A new version of a resource will get a new persistent identifier; the old version will keep the original persistent identifier. Metadata can change if the depositor or archivist sees the need for that, in the case of errors or missing information. Changes to the metadata are currently not logged. All archived objects are linked to their metadata descriptions and are organized in hierarchical (or multi-rooted) tree structures to indicate relationships between objects and sets of objects. The tree structures can change if the depositors decide that this is necessary. The identities of the depositors are checked by means of a login and password when they deposit material online. Provenance metadata as to who made changes to the repository is currently only stored in log files and not shown to the data consumer.

Confidentiality Data sets can be embargoed

Accepted metadata formats IMDI, see
Accepted content types Text, Audio, Video, Images
Accepted preferred formats Archived resources preferentially make use of UNICODE; XML; generic models such as ISO LMF, ISOcat DCIF; RDF; XML-EAF; IMDI/CMDI; MPEG 2/1/4; mJPEG2000; JPEG/TIFF/ PNG; 48 kHz-16 bit linear PCM;
Accepteerde file formats
Maximum size of deposits No information was found

Version management Yes
Quality control Quality assurance is the responsibility of the depositor
Access requirements Metadata is openly available. Log in is required to download data files.Scholars affiliated with federated organisations may log in using their institutional account. Others can create a guest account on the registration page.
Tools / Interfaces for access Online accessThe repository provides various ways of utilizing the archived data via online tools as well as by downloading the data in formats commonly used by the research communities. An advanced metadata search utility is provided, as well as a deep search tool for textual content. All metadata can be harvested via the OAI-PMH protocol. Unique persistent identifiers according to the Handle system are provided for each archived object.
(Persistent) identifiers DOI
Long term guarantees MD5 checksums are calculated for all objects and checked periodically. The availability of files on the file system is checked automatically daily. The availability of the archive access tools is checked automatically multiple times a day. The availability of file, web and application servers is monitored continuously. New versions of archived resources can be deposited, in which case the old versions will be moved to a version archive. In the future these old versions will also be made available to the end users but this is currently not yet the case.
Complies with international standards for trusted repositories DSA certified
Preservation strategy TLA has an explicit mission to archive language resources from all around the world, both collected by associated researchers as well as researchers who are not affiliated with the federated organisations. The mission is upported by the official possibility to store full copies at two computer centers at different locations for which the president of the Max Planck Society gives an institutional backing of 50 years of bit-stream preservation. working on duplicating the archive access framework in those backup locations as well, such that access to the data can be provided even if our institute would cease to exist.
Storage NoNo information was found
Access No
Preservation YesWe urge all researchers, institutions and individuals that are in the possession of such data: please do seriously consider the need for a long-term preservation plan in order to assure that these data will be available for future generations.
Special conditions
Agreements with Leiden University NoneNot necessary; researchers from applicable fields may deposit their linguistics data