Those who have a passing knowledge of Danish can attest to just how wrong Google Translate can get it at times. Sure, the technology is great for taking text in a foreign language and putting it into something understandable, but it is often filled with mistakes that only someone familiar with the language in question can catch.
According to one University of Copenhagen researcher, translation technology has a built-in bias against smaller languages like Danish. The problem, according to associate professor Anders Søgaard, is that there just isn’t enough Danish text to feed into the computers.
“Most machine translations use statistics software to find correlations,” Søgaard told The Local. “The problem with that approach is that most of the translated data available comes from very specific types of text. There are a lot of news articles and political text available, for example, but there are a lot of other text types that haven’t been translated in large volumes.”
Søgaard said that translation technology’s weaknesses are more evident when applied to social media.
“We don’t have a lot of human translated Danish tweets, so the machines are just not as good at translating tweets as they are at translating something like parliamentary transcripts,” he said.
On social media, people write in a more informal manner that is in sharp contrast to official transcripts and news articles.
“Twitter is interesting for a lot of different reasons. You have spelling variations due to space constraints and you see a lot of dialect and slang,” Søgaard said. “This variation is challenging for translation technology.”
When a user puts a word into Google Translate that is new to the technology, that’s when errors occur.
“There are basically two types of errors: the first is when you have words the machines haven’t seen before and the second has to do with more structural problems,” Søgaard said. “When you’re translating from English to French, that is relatively straightforward, but if you want to translate from English to Japanese, that’s a lot more complicated.”
The relative low number of Danish speakers on a global scale means that the available translation technology simply hasn’t seen enough volume to be as good at translating a smaller language like Danish as it is at translating more widely-spoken languages like English and French.
“There is of course a pretty straightforward correlation between the number of speakers of a language and the quality of technology like Google Translate,” Søgaard said. “But as Google Translate gets more and more data by the hour, you start to see fewer errors.”
Søgaard said that it could lead to “a major democratic problem” if technology doesn’t keep up with smaller languages like Danish.
“If the technology only supports English, those of us who speak several languages will pick the language for which the technology is better. Thus, the lesser-spoken languages will have less technology geared toward them.”
In Søgaard’s thesis, Learning Linguistics Models under Bias, he tested different statistical methods for improving translation technology’s handling of Danish. Among them were giving higher statistical weight to text in standard news articles that closely resembled the informal language of social media, as well as purposely ‘corrupting’ articles by making spelling and grammar mistakes.
If you can understand Danish (without Google Translate, that is), you can see Søgaard explain his methods in this video.