Looking for automatic language identification/detection/recognition tool? (General technical issues)

Technical forums » General technical issues »
Looking for automatic language identification/detection/recognition tool?
Track this topic

Looking for automatic language identification/detection/recognition tool?

Thread poster: Michael Beijer

Michael Beijer

United Kingdom
Local time: 21:59
Member (2009)
Dutch to English
+ ...

Sep 17, 2021

OK, I have a bit of a difficult question. I am looking for an easy way to automatically detect, and mark in some way, lines in a text file. An example might help:

Let's say I have the following 4 lines:

AU = allergy unit(s) AUB = Abnormale Uterine Bloeding AUC = area under the ROC (receiver operating characteristic) curve AUC = area under curve

I am looking for a tool (or set of tools) that would allow me to end up with something like:

AU = allergy unit(s) = ENGLISH AUB = Abnormale Uterine Bloeding = DUTCH AUC = area under the ROC (receiver operating characteristic) curve = ENGLISH AUC = area under curve = ENGLISH

In case you're wondering what I'm up to, I have a large amount of messy glossaries, where the entries switch from language to language at random, and would like to sort the various languages into piles, so to speak.

Michael ▲ Collapse

Jean Lachaud

United States
Local time: 16:59
English to French
+ ...

Try MS Office

Sep 17, 2021

If it were me, I'd try Word's (Maybe it also works in Excel) Automatic Language Recognition, but I would separate target and source terms, maybe in columns.
Worth trying on a few samples, then using global replace to assign specific font, or color, to each language.

But, unless there are more than, say 1500 terms to sort, I'd do it manually. Experience has proven that going the trial and error macro or global change methods usually requires more time than sorting manually, for... See more

Jennifer Levey

Chile
Local time: 16:59
Spanish to English
+ ...

The task is not trivial...

Sep 18, 2021

This link will perhaps give you an idea of what you’re up against here: https://stackoverflow.com/questions/5579974/language-detection-for-very-short-text

I have a simple home-brew language detection app, based on MS file formats, which can detect seven languages with a reliability of around 90% in texts of flowing prose of more than 20 words. While programming this, I ran up against a number of problems – all of which will be exacerbated in your case because your strings are very short.

One potential solution (not implemented in my app) would involve an analysis of digrams or trigrams in the lines of text, for comparison with a representation of the frequency of occurrence of those di- or trigrams in a known sub-set of potential languages. I wouldn't expect this to be very reliable when working with English and Dutch, owing to the similarity of spelling in those languages.

Another approach might involve a word-by-word test against a batch of spell-check dictionaries (such as those available in Word or from Hunspell). However this will again be unreliable if the possible languages are rather similar, or if the strings contain many specialist words that are not in standard spell-check dictionaries. Some of the problems might be mitigated by an intelligent system that learns from corrective user intervention, or through the use of a large multilingual corpus of matched texts in the appropriate languages alongside the basic dictionaries.

The bottom line is: This is not a trivial task. If you're not a reasonably proficient programmer, you'd do well to abstain - and sort out your files manually. ▲ Collapse

Michael Beijer

United Kingdom
Local time: 21:59
Member (2009)
Dutch to English
+ ...

TOPIC STARTER

Thanks guys!

Sep 18, 2021

Yes, I was afraid I'd hear something like this. It's basically just a hobby project so I'll probably just have to do it manually, if and when I have a bit of time to waste.

Michael

Kay-Viktor Stegemann
Germany
Local time: 22:59
English to German

In memoriam

There are APIs for this

Sep 18, 2021

You could try to integrate web services that do this, for example https://languagelayer.com/

But of course the accuracy of this kind of APIs will be below 100% and might be particularly low for specialized content. Not sure if the result would be worth the effort.

Jennifer Levey

Chile
Local time: 16:59
Spanish to English
+ ...

languagelayer

Sep 18, 2021

A quick test of languagelayer using the four sample expressions listed by Michael gives:

AU = allergy unit(s) --> ITALIAN (100%)
AUB = Abnormale Uterine Bloeding --> NORWEGIAN (100%)
AUC = area under the ROC (receiver operating characteristic) curve --> English (100%)
AUC = area under curve --> English (100%) / ROMANIAN (98.27%) / LOGODURESE SARDINIAN (91.07%) / GALICIAN (90.07%)

Oops! Back to the drawing-board...

Philippe Locquet

Portugal
Local time: 21:59
English to French
+ ...

Human wins

Sep 22, 2021

Michael Beijer wrote:

Yes, I was afraid I'd hear something like this. It's basically just a hobby project so I'll probably just have to do it manually, if and when I have a bit of time to waste.

Michael

Hello Michael,
This kind of tasks is very hard to automate, especially for single-words or two-words entries an automated process will get so much of this wrong that the time fixing it won’t be worth it. Even big tech don’t bother do that kind of cleaning in data, they add a layer to their models to fix output, that tells you about the confidence in it being effective…
So human is the way to go. DIY it if that’s a hobby project as you said. For anyone else in the same situation, I would suggest outsourcing that job to someone they trust with their data. Sure someone will be happy to get to work on something like this.
All the best

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Laureana Pavon	[Call to this topic]

You can also contact site staff by submitting a support request »

Looking for automatic language identification/detection/recognition tool?

Forum rules

Help and orientation

Anycount & Translation Office 3000
Translation Office 3000 Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators. More info »

Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers! The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc. More info »

Recent posts | FAQ | Rules | Moderators | Article knowledgebase

Your current localization setting

English

Select a language

More languages...

Looking for automatic language identification/detection/recognition tool?

Looking for automatic language identification/detection/recognition tool?

You have native languages that can be verified

Your current localization setting

Select a language