The DGT Multilingual Translation Memory
of the Acquis Communautaire: DGT-TM
Summary
Since
November 2007 the European Commission's Directorate-General for Translation has
made its multilingual Translation Memory for the Acquis Communautaire, DGT-TM, publicly
accessible in order to foster the European Commission’s general effort to
support multilingualism, language diversity and the re-use of Commission
information. In 2011 it released a second, even larger
version.
These
Translation Memories are huge both in their size and scope, but are of limited
use to translators who work from/into more than 2 official EU languages.
Namely, DGT’s tools can only be used to extract language pairs from these
memories, but do not allow for the selective deletion of segments in unwanted
languages. For example, if you work into Bulgarian, English, German, Hungarian,
and Slovenian, you can extract your 14 language pairs, but cannot reduce the
source files from their original 22 languages to your 5 languages only, for use
in dtSearch or similar powerful search
programs.
Below I
provide a means for you to do just that, so you can cull unwanted languages and
end up with a much slimmer subset of the database files which takes up much
less space on your hard drive and is much easier to use.
The Details
My procedure
is based on the UltraEdit Professional
Text/HEX Editor. This is another utility that professional translators need
anyway.
UE has a
powerful feature called Replace in Files:
What it does
is Search and Replace in any number of files in a directory structure without
your opening them. For example, if you unzip your EU DGT-TM files into the
M:\EU\ directory, it will look into each of its subdirectories and
automatically perform the S&R operation on the files found there. All you
need to do is feed it the regular expressions which eliminate the unwanted
languages.
The
screenshot above shows the expression which removes Swedish. Hit Replace All
and you’re set.
Here are the
expressions you need to write one after the other into the Find What field to
delete your unwanted languages:
%<tuv
lang="EN-*^p</tuv>
%<tuv
lang="BG-*^p</tuv>
%<tuv
lang="CS-*^p</tuv>
%<tuv
lang="DA-*^p</tuv>
%<tuv
lang="DE-*^p</tuv>
%<tuv
lang="EL-*^p</tuv>
%<tuv
lang="ES-*^p</tuv>
%<tuv
lang="ET-*^p</tuv>
%<tuv
lang="FI-*^p</tuv>
%<tuv
lang="FR-*^p</tuv>
%<tuv
lang="HU-*^p</tuv>
%<tuv
lang="IT-*^p</tuv>
%<tuv
lang="LT-*^p</tuv>
%<tuv
lang="LV-*^p</tuv>
%<tuv
lang="MT-*^p</tuv>
%<tuv
lang="NL-*^p</tuv>
%<tuv
lang="PL-*^p</tuv>
%<tuv
lang="PT-*^p</tuv>
%<tuv
lang="RO-*^p</tuv>
%<tuv
lang="SK-*^p</tuv>
%<tuv
lang="SL-*^p</tuv>
%<tuv
lang="SV-*^p</tuv>
Obviously,
you want to delete those lines specifying the languages you wish to keep.
You can
automate this deletion sequence in AutoHotkey, MacroExpress or similar macro
programs.
When you’re
done with all your mass deletions, you can rezip your subdirectories to reduce
them to about an eight of their size. Then index your zipped EU directory
structure in dtSeach and you’re set to sail with another powerful tool for your
future translations.
Good luck:
Laszlo
Gabris
Skype:
laszlo.gabris
laszlo@huntrans124.com
laszlo8089@gmail.com
After I published
this Application Note, Maxime Boisset max4lists@boisset.eu
published three more solutions on the memoQ@yahoogroups.com
mailing list. His solutions are better, because they run on free software and
are much faster. I republish them here with his permission.
Thank you
very much for this procedure Laszlo.
I followed
your instructions, but used Notepad++ instead of UltraEdit.
In
Notepad++, when you hit Ctrl+F, you can see a tab "Find in Files".
In it you
can search and replace in many files at the same time.
I use the
following regular expression:
<tuv
lang="(BG|CS|DA|EL|ES|ET|FI|HU|IT|LT|LV|MT|NL|PL|PT|RO|SK|SL|SV)-.+?</tuv>\r\n
to remove
all languages except German, English and French
with the box
". matches newlines" checked.
-------------
<Comment
by László>
Use the
following expression to clean the latest ECDC.tmx file of unwanted languages:
\s+<tuv
xml:lang="(BG|EN|HU|DE|SL|IS|NO|CS|DA|EL|ES|ET|FI|FR|GA|IT|LT|LV|MT|NL|PL|PT|RO|SK|SV).+?</tuv>
(delete the
codes of those you do not want culled; check Wrap around, Regular expression,
and . matches newlines)
</Comment
by László >
-------------
This works
only with the most recent version of Notepad++ which has much better support
for regular expressions.
http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Regular_Expressions
>>>>
Just for
kicks, I tried doing it in gvim too.
You have to
type this in command mode:
:args *.tmx
:arg
:argdo
%s/<tuv
lang="\(BG\|CS\|DA\|EL\|ES\|ET\|ET\|FI\|HU\|IT\|LT\|LV\|MT\|NL\|\PL\|PT\|RO\|SK\|SL\|SV\)-\_.\{-}<\/tuv>\n//ge
| update
Notes about
the regex:
\_. = any character including new lines
\{-} = non
greedy operator
\( and \) =
group delimiter
\| =
alternative (or) operator
\/ = literal
"/"
\n =
linefeed
| =
concatenate commands
update =
save changed files
This worked
well on the test files, but there is a fly in the ointment: gvim does not process all files automatically
as expected, instead it processes a
certain number of files and displays a list of processed files. The user has to hit <space> after
each page to start the processing of the other files. (Notepad++ slurped all
the DGT TMX files at one fell swoop.)
This makes
the procedure inapplicable for large numbers of files.
>>>>
I have found
a solution that does the whole job in 3 min 01 seconds!
It is a bash
script that you can run under Linux. I have tested it only under Ubuntu, but it
probably runs under Cygwin too.
The result
is a plain text file that looks like that:
----------
THE EEA JOINT
COMMITTEE,
---
DER
GEMEINSAME EWR-AUSSCHUSS —
---
LE COMITÉ
MIXTE DE l'EEE,
Here is the
script:
--- start of
file ---
# Create
directory for new files
mkdir txt
# With
"for" start a loop that ends with "done".
for filename
in *.tmx ;
# Convert file
to UTF-8 encoding with iconv
# since grep
and sed do not support utf-16 encoding
# and UTF-8
encoded files are half as big.
# With grep
extract lines containing the tags for your languages and the
line below
(with the context option "-A 1").
# Then
remove tags and othe bits with sed.
# Write
output to "txt" directory
do iconv -f
UTF-16 -t UTF-8 "$filename" \
| grep -A 1 -G '<tuv
lang="\(EN-GB\|DE-DE\|FR-FR\)">' \
| sed
-e 's/<tuv lang="\(DE-DE\|FR-FR\)">/---/g' \
-e 's/<tuv
lang="\(EN-GB\)">/----------/g' \
-e 's:</*seg>::g' \
-e 's/^--$//g' \
> txt/"`basename $filename
.tmx`_de-fr-en.txt" ;
done
--- end of
file ---