The DGT Multilingual Translation Memory
of the Acquis Communautaire: DGT-TM

 

Summary

 

Since November 2007 the European Commission's Directorate-General for Translation has made its multilingual Translation Memory for the Acquis Communautaire, DGT-TM, publicly accessible in order to foster the European Commission’s general effort to support multilingualism, language diversity and the re-use of Commission information. In 2011 it released a second, even larger version.

 

These Translation Memories are huge both in their size and scope, but are of limited use to translators who work from/into more than 2 official EU languages. Namely, DGT’s tools can only be used to extract language pairs from these memories, but do not allow for the selective deletion of segments in unwanted languages. For example, if you work into Bulgarian, English, German, Hungarian, and Slovenian, you can extract your 14 language pairs, but cannot reduce the source files from their original 22 languages to your 5 languages only, for use in dtSearch or similar powerful search programs.

 

Below I provide a means for you to do just that, so you can cull unwanted languages and end up with a much slimmer subset of the database files which takes up much less space on your hard drive and is much easier to use.

 

The Details

 

My procedure is based on the UltraEdit Professional Text/HEX Editor. This is another utility that professional translators need anyway.

 

UE has a powerful feature called Replace in Files:

UE_Replace.jpg

 

What it does is Search and Replace in any number of files in a directory structure without your opening them. For example, if you unzip your EU DGT-TM files into the M:\EU\ directory, it will look into each of its subdirectories and automatically perform the S&R operation on the files found there. All you need to do is feed it the regular expressions which eliminate the unwanted languages.

 

The screenshot above shows the expression which removes Swedish. Hit Replace All and you’re set.

 

Here are the expressions you need to write one after the other into the Find What field to delete your unwanted languages:

 

%<tuv lang="EN-*^p</tuv>

%<tuv lang="BG-*^p</tuv>

%<tuv lang="CS-*^p</tuv>

%<tuv lang="DA-*^p</tuv>

%<tuv lang="DE-*^p</tuv>

%<tuv lang="EL-*^p</tuv>

%<tuv lang="ES-*^p</tuv>

%<tuv lang="ET-*^p</tuv>

%<tuv lang="FI-*^p</tuv>

%<tuv lang="FR-*^p</tuv>

%<tuv lang="HU-*^p</tuv>

%<tuv lang="IT-*^p</tuv>

%<tuv lang="LT-*^p</tuv>

%<tuv lang="LV-*^p</tuv>

%<tuv lang="MT-*^p</tuv>

%<tuv lang="NL-*^p</tuv>

%<tuv lang="PL-*^p</tuv>

%<tuv lang="PT-*^p</tuv>

%<tuv lang="RO-*^p</tuv>

%<tuv lang="SK-*^p</tuv>

%<tuv lang="SL-*^p</tuv>

%<tuv lang="SV-*^p</tuv>

 

Obviously, you want to delete those lines specifying the languages you wish to keep.

 

You can automate this deletion sequence in AutoHotkey, MacroExpress or similar macro programs.

 

When you’re done with all your mass deletions, you can rezip your subdirectories to reduce them to about an eight of their size. Then index your zipped EU directory structure in dtSeach and you’re set to sail with another powerful tool for your future translations.

 

Good luck:

 

Laszlo Gabris

Skype: laszlo.gabris

laszlo@huntrans124.com

laszlo8089@gmail.com

www.huntrans124.com

 

 

After I published this Application Note, Maxime Boisset max4lists@boisset.eu published three more solutions on the memoQ@yahoogroups.com mailing list. His solutions are better, because they run on free software and are much faster. I republish them here with his permission.

 

Thank you very much for this procedure Laszlo.

I followed your instructions, but used Notepad++ instead of UltraEdit.

 

In Notepad++, when you hit Ctrl+F, you can see a tab "Find in Files".

In it you can search and replace in many files at the same time.

 

I use the following regular expression:

<tuv

lang="(BG|CS|DA|EL|ES|ET|FI|HU|IT|LT|LV|MT|NL|PL|PT|RO|SK|SL|SV)-.+?</tuv>\r\n

to remove all languages except German, English and French

with the box ". matches newlines" checked.

 

-------------

<Comment by László>

 

Use the following expression to clean the latest ECDC.tmx file of unwanted languages:

 

\s+<tuv xml:lang="(BG|EN|HU|DE|SL|IS|NO|CS|DA|EL|ES|ET|FI|FR|GA|IT|LT|LV|MT|NL|PL|PT|RO|SK|SV).+?</tuv>

 

(delete the codes of those you do not want culled; check Wrap around, Regular expression, and . matches newlines)

</Comment by László >

-------------

 

This works only with the most recent version of Notepad++ which has much better support for regular expressions.

http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Regular_Expressions

 

>>>> 

 

Just for kicks, I tried doing it in gvim too.

 

You have to type this in command mode:

:args *.tmx

:arg

:argdo %s/<tuv

lang="\(BG\|CS\|DA\|EL\|ES\|ET\|ET\|FI\|HU\|IT\|LT\|LV\|MT\|NL\|\PL\|PT\|RO\|SK\|SL\|SV\)-\_.\{-}<\/tuv>\n//ge

| update

 

 

Notes about the regex:

\_.  = any character including new lines

\{-} = non greedy operator

\( and \) = group delimiter

\| = alternative (or) operator

\/ = literal "/"

\n = linefeed

| = concatenate commands

update = save changed files

 

This worked well on the test files, but there is a fly in the ointment:  gvim does not process all files automatically as expected, instead it  processes a certain number of files and displays a list of processed  files. The user has to hit <space> after each page to start the processing of the other files. (Notepad++ slurped all the DGT TMX files at one fell swoop.)

 

This makes the procedure inapplicable for large numbers of files.

 

>>>> 

 

I have found a solution that does the whole job in 3 min 01 seconds!

It is a bash script that you can run under Linux. I have tested it only under Ubuntu, but it probably runs under Cygwin too.

 

The result is a plain text file that looks like that:

 

----------

THE EEA JOINT COMMITTEE,

 

---

DER GEMEINSAME EWR-AUSSCHUSS —

 

---

LE COMITÉ MIXTE DE l'EEE,

 

 

 

Here is the script:

--- start of file ---

# Create directory for new files

mkdir txt

# With "for" start a loop that ends with "done".

for filename in *.tmx ;

# Convert file to UTF-8 encoding with iconv

# since grep and sed do not support utf-16 encoding

# and UTF-8 encoded files are half as big.

# With grep extract lines containing the tags for your languages and the

line below (with the context option "-A 1").

# Then remove tags and othe bits with sed.

# Write output to "txt" directory

 

do iconv -f UTF-16 -t UTF-8 "$filename"                 \

   | grep -A 1 -G '<tuv lang="\(EN-GB\|DE-DE\|FR-FR\)">' \

   | sed  -e 's/<tuv lang="\(DE-DE\|FR-FR\)">/---/g'     \

        -e 's/<tuv lang="\(EN-GB\)">/----------/g'       \

        -e 's:</*seg>::g'                                \

        -e 's/^--$//g'                                   \

    > txt/"`basename $filename .tmx`_de-fr-en.txt" ;

done

--- end of file ---