Use of metadata in LLMs to improve MT output.

During our time off this summer, I’ve been diving into upcoming trends in the LLM (Large Language Models) space and how they’re gradually approaching MT (Machine Translation) quality. However, one aspect that surprises me is the lack of discussion around using metadata in translation resources to enhance MT/LLM output.

As I see it, there are three key areas where LLMs can enhance traditional MT engines:

Key / String ID

Certain languages require translations to be adapted based on the UI component where the content is used. If the associated key provides this information, the right prompt can significantly improve translation accuracy.

For example in Spanish, the translation of a verb for a button or checkbox should generally use the infinitive form of the verb. However, if the verb is used in a description or in a sentence it should be conjugated.

Translating the content “Accept the cookies before proceeding” with MT will always provide the same translation “Acepte las cookies antes de continuar”. However, with LLMs we can provide a prompt to translate taking in consideration that aspect.

Adding the key/string ID to it like: “Accept.cookies.body”=”Accept the cookies before proceeding”, will generate the translation “Accept.cookies.body”=Acepta las cookies antes de continuar”. However, if the same content is given with a different key, Accept.cookies.checkbox, then the translation will be “Aceptar las cookies antes de continuar”.

An example of the prompt can be:

The strings I am sharing are for translation into Spanish and have an associated key. This key may contain information about where the content will be used in the UI. The format is ‘key’ = ‘Content to be translated’.

Placeholders

Placeholders in strings can provide essential context that can dramatically change the required translation. If properly utilized, LLMs can improve upon MT translations by taking placeholders into account.

For instance, translating “Sent by {{1}}” can vary depending on whether the placeholder is a date or a name. A more descriptive placeholder leads to better translation accuracy.

However, Machine Translation won’t take that into consideration and will provide the same translation no matter how descriptive the placeholder is.

Let’s look at the previous example:

If we translate “Sent by {{due_date_full_format}}” and “Sent by {{requester_name}}” using MT¹ we get: “Enviado por {{requester_name}}” and “Enviado por {{due_date_full_format}}”.

However, if using a prompt similar to:

I will provide you with a string to translate into Spanish, which may include a placeholder. The placeholder will be enclosed in double curly braces, {{ }}. Inside the placeholder, there may be information about the type of variable that will be used in the UI.

Then we will obtain two different translations: “Enviado antes del {{due_date_full_format}}.” and “Enviado por {{requester_name}}.”

Instructions at segment level

In one of our previous posts we talked about how adding instructions at segment level can help linguists and translators get a better translation avoiding mistranslations. For example establishing length limitations, glossary instructions.

If we have a word that is a product name but at the same time there are situations where that word has to be translated, it is a very complex scenario for the regular MT engines. For example:

"description_dialog_notary_enrollment": {

"String": "Notary ensures all your documents are authenticated and securely stored."

"Instructions": "Notary refers to our product. Please don’t translate that word."

}

"description_dialog_notary_enrollment": {

"String": "Notary ensures all your documents are authenticated and securely stored."

"Instructions": "This time Notary refers to the role. Please translate that word."

}

And we use MT to translate the content into Spanish, this is the translation we will get: “El notario garantiza que todos sus documentos estén autenticados y almacenados de forma segura.” As you can see, we would need a post editing step where we should make sure that the first sentence doesn’t get Notary translated.

If we use a prompt similar to:

I am going to provide you with a string to translate into Spanish and it will have instructions associated with it, the structure of the strings will be: “Key”: {

“String”: “content to translate”

“Instructions”: “instructions on how to translate the content”

}

The translation obtained depends on the instructions provided.

For the first one (where Notary is a Do Not Translate term):

"description_dialog_notary_enrollment": {

"String": "Notary garantiza que todos tus documentos estén autenticados y almacenados de forma segura."

}

For the second one (where notary is to be translated):

"description_dialog_notary_enrollment": {

"String": "El notario garantiza que todos tus documentos estén autenticados y almacenados de forma segura."

}

Even if we provide instructions to make it fit in a particular character length, we will obtain the desired string. For example:

"description_dialog_notary_enrollment": {

"String": "Notary ensures all your documents are authenticated and securely stored."

"Instructions": "This time Notary refers to the role. Please translate that word. Keep in mind that the maximum string length is 90 characters."

}

Will return:

"description_dialog_notary_enrollment": {

"String": "El notario asegura que tus documentos estén autenticados y almacenados de forma segura."

}

Which stays 3 chars under 90.

Three key aspects to consider:

Language Pair Quality: Not all LLMs are equally effective across language pairs. For specific cases where MT engines are well-trained, consider using MT first, followed by an LLM check.
Human Review: Product content often has high visibility, so even with LLM improvements, a human review is recommended. The process should be more efficient than traditional MT + human post-editing though.
Future Trends: The speed at which LLMs are improving suggests they may soon outperform traditional MT engines, especially when trained with glossaries and style guides.

Thank you very much for taking the time to read this. I hope you found it informative. As you’ve seen today, using LLMs for translation/review can provide many more advantages if we focus not just on the content itself but also on the metadata around it. You can take a look to our next post if you are interested in knowing how to automate this process using a json script.

It seems that Google Translate has some improvements. Depending on the string I provided it provides a translation in accordance to the content of the placeholder. However, other providers like Deepl are still not taking into consideration the placeholders. ↩︎

3 responses to “Use of metadata in conjunction with LLMs to improve MT output”

Mar

October 14, 2024

Hello,

I found the article very interesting, but I have two questions:
– did you actually test providing these prompts to an LLM, or are your comments based on expected output?
– if you did test the prompts, which LLM did you use?

Thank you!
1. Carlos Barbero-Cortés
  
  October 16, 2024
  
  Hello Mar! Thank you for your questions! please see below my answers:
  1.- Yes, I actually tested these prompts, and they worked quite well! I didn’t do a very deep testing though.
  2.- I used ChatGPT 4o.
  
  Please let us know if you have more questions, and don’t forget to follow us here or in LinkedIn!
2. Carlos Barbero-Cortés
  
  January 28, 2025
  
  Hello Mar, if you interested in automating the process, you can take a look into our next post: https://gilt-ninjas.com/2024/12/01/automating-translations-using-json-scripts/

Use of metadata in conjunction with LLMs to improve MT output

3 responses to “Use of metadata in conjunction with LLMs to improve MT output”

Discover more from GILT Ninjas