EAC-TM – Another freely available translation memory, in 26 languages

Posted by

From: Ralf Steinberger 
Sent:
Wednesday, February 06, 2013 1:26 AM
Subject: EAC-TM – Another freely available translation memory, in 26 languages

 

EAC-TM is a translation memory (sentences and their manually produced translations) in 26 languages. It is a multilingual parallel corpus covering 325 language pairs.

 

Size:       Up to 5100 translation units per language; 78,000 in total.

 

Languages:  All 325 language pairs involving the following 26 languages:
Bulgarian, Czech, Danish, Dutch, English, Estonian, German,
Greek, Finnish, French, Croatian, Hungarian, Icelandic, Italian,
Latvian, Lithuanian, Maltese, Norwegian, Polish, Portuguese,
Romanian, Slovak, Slovene, Spanish, Swedish and Turkish.

URL:        http://langtech.jrc.ec.europa.eu/EAC-TM.html  

Creator:    EC Directorate for Education and Culture (EAC) and JRC

 

 

WHAT IS EAC-TM

 

EAC-TM was produced by translating the English language form data for the EAC’s Lifelong Learning Programme (LLP) and the Youth in Action Programme of the European Commission’s Directorate General for Education and Culture (EAC). The results of the translation were stored in 25 bilingual translation memories. DG EAC and the JRC post-processed these by cleaning the data and by producing one alignment for all 26 languages, resulting in parallel data for 325 language pairs.

 

The underlying documents are thus form data in the field of education and culture.

 

The EAC Translation Memory is much smaller than the other multilingual resources distributed in the past by the European Commission’s Joint Research Centre (JRC). Its main advantages are that (a) it covers even more languages and (b) it is based on texts from a very different domain (education and culture).

 

 

MOTIVATION FOR THIS RELEASE

 

The public data release is in line with the general effort of the European Commission to support multilingualism, language diversity and the re-use of Commission information. It follows the release of the JRC-Acquis parallel corpus in 2006 (over 1 billion words in 22 languages), of the DGT-TM Translation Memory in 2007 and 2011, the multilingual named entity resource JRC-Names in 2011, the multi-label classification software JRC EuroVoc Indexer JEX in 22 languages in 2012,the ECDC-TM Translation Memory in 25 languages in 2012, the DGT-Acquis parallel corpus in 23 languages in 2012, and further smaller multilingual resources. See http://ipsc.jrc.ec.europa.eu/?id=61 for more information on these resources.

 

 

WHAT EAC-TM CAN BE USED FOR

               

EAC-TM can be fed into translation memory software to support human translators in their work. As it is a large parallel corpus in electronic form, it can furthermore be used by specialists in computational linguistics to train statistical machine translation software, to generate multilingual dictionaries, to train and test multilingual information extraction software, and more.

 

 

WHAT NEXT?

 

The JRC and collaborating services of the European Commission hope to release further large-scale linguistic resources in the future.

 

 

Ralf Steinberger & Mohamed Ebrahim
European Commission – Joint Research Centre (JRC)
21027 Ispra (VA), Italy

URL – Applications: http://emm.newsbrief.eu/overview.html

URL – Publications on the science behind them: http://langtech.jrc.ec.europa.eu/JRC_Publications.html