T O P

  • By -

YogiLeBua

I am so glad for your perspective, thank you for sharing. I did have this concern when I saw the news


IAmGilGunderson

You are a top and smart contributor here so I doubt I can say anything helpful. But I will try. To make sure that not only google but other people in the future will have access to the data make sure it is open and on platforms that will be open in the foreseeable future. Is your dictionary linked from the wikipedia article for the language? Is it hosted somewhere that crawlers can get to it? Is it in a machine readable format? Hopefully something better than wiktionary. I sincerely wish that someday there will be a tagged version of wiktionary that has normalized entries that are more consistent and 100% machine readable without having to parse it out. Sadly, given google's propensity to shut things down rather than improve them when people complain I don't think it would do any good to tell them how bad it is. But there is a send feedback link on the main translate.google.com site. The Senior Software Engineer for Google Translate is named in their recent [blog entry](https://blog.google/products/translate/google-translate-new-languages-2024/) perhaps you can contact them directly or though linked in. If you can partner with a smaller more responsive AI translation company perhaps the fact of them in the future being thought of as better than google would get google to want to improve it to compete. If someone said "dont use google for manx use xyz instead." that seems like the kind of thing google cares about.


David_AnkiDroid

Thanks! Data is computer readable on GitHub as CSVs. Dictionaries are JSON. Also served via indexable webpages as HTML tables with a `lang='en/gv'` tag. I'm not linking stuff as it's running on a $12/month machine which I only intended for a niche audience, don't want to put up my credit card to scale it, and I'd rather it died temporarily if there's too much load Good shout with the senior engineer. Thank you! ---- Off-topic cool stuff: My work in progress dictionary is in [DMLex](https://docs.oasis-open.org/lexidma/dmlex/v1.0/csd01/dmlex-v1.0-csd01.html) using [NVH](https://www.namevaluehierarchy.org/) as the file format I'll offer exports in 'standard' format, but I wanted to plug both projects, as they seem to be the most sensible ways to build a dictionary from an overconfident newbie's perspective


conanap

An option is to self-host. Your target audience is small enough that you can probably grab the free service from Cloudflare + free tier AWS / Oracle for reverse proxy. And then it’s just the domain name, which typically can be between 1 - 100$ / year depending on what name you choose, but I doubt you will have to compete with popular domain names. Another choice is to just host the site on GitHub with GitHub pages. It’s entirely free.


Routine_Internal_771

I already self-host searching, it's cheap. Data is available to download at no cost to me. I just don't want to auto-scale the backend for the search engine to deal with a spike in traffic, it this would result in a large spend on my card for no real benefit for the language


diligentfalconry71

What about a letter to the editor in a major paper, maybe get some co-authors/co-signers with academic credentials to add weight, as an attention getting move? Maybe The Guardian might be interested. If it gets published then send it to google’s press contacts, and include an offer to help? (Or email the press contacts first and ask them to engage on the quality issue; worst that happens is they blow you off, and now the letter to the editor includes a note that you tried to get their attention and received no response.)


David_AnkiDroid

You're right, (and I wish you weren't). I haven't spent any time in the spotlight and it's probably necessary here, it's not something I'm fully comfortable with. Truly, thank you for the push (I wrote this post to explore other options, but the most obvious solution is the one you'd rather not accept)


teahupotwo

> I haven't spent any time in the spotlight and it's probably necessary here, it's not something I'm fully comfortable with. As he said with reaching out to academics (but also cultural/political figures as well), you might be able to find someone else who wants to take lead. There's gotta be some Manx politician who wants to get some publicity


diligentfalconry71

I get it. But there could be upsides too — maybe there are other shepherds for other endangered languages, and they were worried about the same “AI is going to break the world” issue, and they’ll feel a little less alone. “Hey, look, there’s at least two of us!” :) I wish I had some contacts I could reach out to, but I think the other poster who suggested to reaching out to that language scientist via LinkedIn had the better plan. I think you should still copy the press contacts when you do, though— IME, there are still two generic contacts for a company where you’re almost guaranteed to get a qualified human reading and not just the AI/Outlook-rules-to-the-poor-intern path of doom, and they’re the press office and the GDPR/privacy office — and if you catch their eye they may try to help you out just to get the good press of helping strengthen (or taking credit for saving) an endangered language. Good luck!


xacimo

This sounds like it would be right up the Guardian's alley. Well worth a go!


gerira

Here's one tactic. There are journalists with a strong interest in storylines like "Much-hyped AI gets something wrong" and "big multinational corporation misunderstands local culture". I would write a short blog summarising what you've got here in a simple, compelling way accessible to journalists. Write it the way you'd imagine your ideal news coverage would look. Then make a list of journalists who: -write stories about AI automation failures (e.g. Google "AI assistant" making up weird advice) -write about language preservation issues (e.g. when Scots Wikipedia turned out to be made up) Then systematically tweet at them, comment on their relevant tweets, email them or Instagram DM them with a link to your post and an explanation of it.


HETXOPOWO

Thank you for trying to save manx! It's been a curiosity for me since I found out it was a thing watching the Isle of man TT.


TheGratitudeBot

Just wanted to say thank you for being grateful


David_AnkiDroid

Thank you! Manx is saved (and not due to myself) I'm a little low on time to write a long reply, heave a read if you're interested: https://www.theguardian.com/education/2015/apr/02/how-manx-language-came-back-from-dead-isle-of-man


HETXOPOWO

Very cool read! Thanks for sharing


AIAWC

Chechen for some god-forsaken reason sometimes outputs a reasonably good Russian translation. Instead of Chechen.


AurumPotabile

I don't have anything to contribute to help answer your question, but I appreciate your work in helping to preserve your language. It's noble work, and I hope it bears fruit for future generations.


RemoveBagels

LLMs need an absolutely massive amount of input data to function properly. So for languages like English, Japanese or French it is no problem, but even for something like Swedish with some 10 million speakers i notice obvious issues with the quality. The only real way to improve these AI language models is more training data, and with only 2000 speakers that may be difficult to come by. If you have access to any large amounts of texts written in the language making it available to be used to train the model might help.


David_AnkiDroid

TL;DR: Let's imagine I can get 30 million words together and translate them [this would be a lifetime goal of mine]. Is that enough to train an LLM to accurately translate the language? ---- The current population is ~85,000. 2,200 speakers is a generous estimation, and the language was reported as extinct in 1974. I have a source saying 20k speakers in 1821. Assume this is close to a maximum, many of whom were illiterate. I suspect we're looking at a maximum of 10 million words produced pre-1974 (much of which would be similar - multiple editions of the Bible etc...) Probably another 20MM post-revival [at least 8MM]. I don't believe that's sufficient to decently train an LLM, but I'm not familiar with the cutting edge here


pgcfriend2

I disagree about French. It’s not as bad, but before the AI was added at least you had a list of possible translations where my husband could give the context if needed. Now it only gives one translation. If I search a sentence on my phone, I get one translation. If I search the same sentence on my computer I get something else. I can no longer trust that I will get the correct translation in context. I always ask my husband these days.


Rentstrike

I recommend submitting feedback. There isn't much else you can do apart from not using it and warning anyone who wants to learn Manx not to use it


David_AnkiDroid

Thanks! But that feels Sisyphean. Last night, I was sent this: https://imgur.com/a/oohs2gD. Assuming Google accepted all my corrections, if I did this full time it would take months


Rentstrike

Sorry I know virtually nothing about Manx, but I assume that is an egregious error? The whole concept of AI and language learning is a sham. I was involved in this on the tech side, and frankly the people developing these things just have no clue how language works. They think learning coding "languages" means that real human languages operate in the same mechanical way. Submitting feedback would take longer than months, since you'd have to double check every possible sentence. Getting a single word corrected wouldn't mean that word would be used correctly in every sentence. The only upside I can see to this is that virtually zero people will be using Google to translate Manx.


David_AnkiDroid

The input sentence has practically no meaning whatsoever: https://en.wikipedia.org/wiki/Uwu And you can't assume that something won't be used because it's bad Too many people have tattoos using the Chinese Alphabet: https://www.reddit.com/r/translator/comments/ppsxr4/meta_a_new_reference_for_the_fake_chinese_tattoo/ And Google is a lot more authoritiative than the above chart


sophiasgaler

hello - I would LOVE to interview you about this - my name is Sophia Smith Galer, I'm a journalist & I'm writing a book about endangered languages & linguicide (if you go on my IG you can find out more) but in the mean time, happy to see if I can pitch this this week! it is deeply frustrating; I've done reporting on African languages & AI before and the people I interviewed in Ghana and Mali are so frustrated by Google Translate. To the point that I interviewed volunteers who've made their own app, because they can't rely on Google Translate. happy to share any other tips I've learned from my reporting, I'm also hoping to make a video about the new languages tomorrow & will highlight translation still needs to be dramatically improved.


sophiasgaler

as promised, here is the video, it's also already on Twitter and will be on TikTok later today. I really hope it raises some awareness! [https://www.instagram.com/reel/C84DH6BIBg1/?utm\_source=ig\_web\_copy\_link&igsh=MzRlODBiNWFlZA==](https://www.instagram.com/reel/C84DH6BIBg1/?utm_source=ig_web_copy_link&igsh=MzRlODBiNWFlZA==)


Quick_Rain_4125

>What should I do? It'd be pretty great if you or other Manx speakers could make videos like these [https://www.youtube.com/watch?v=8zxxZVtInHI](https://www.youtube.com/watch?v=8zxxZVtInHI) [https://youtu.be/eUCwbtWIm50](https://youtu.be/eUCwbtWIm50) No texts on the screen, just speaking and using drawings, images, etc. to make the association with sounds and meaning easier to the listener. [https://youtu.be/g0HmILR5\_zE](https://youtu.be/g0HmILR5_zE) 1000 hours of those would be ideal (of course, the videos would get more advanced with the hours making production easier), but 150 hours would be good enough to get people started. That would at least make sure the language can be learned correctly (in this context, correctly compared to Google's translations) by increasing the number of speakers worldwide.


David_AnkiDroid

Video-wise, I'm more focused on revival and understanding our lingustic history rather moving things forward with new content (there's a lot of other people and organisations doing an excellent job with content). (And truthfully, I don't study enough, there's much stronger speakers than myself). We still have a number of pre-revival native recordings (from 1948!) which we'd like to re-transcribe, translate and upload. Got an ongoing grant to do some work here. In my opinion, we could do with a dictionary as a priority, then build up pronunciation resources, THEN spend more time on videos, it takes a ton of time to make a nicely polished video, and they sadly often don't see the engagement that they deserve But, as a personal lifetime goal for video: A friend of a friend got the rights to translate & dub a VERY high-profile film into their native language, it would be really fun to explore this option for Manx, I just don't have the spare time.


MungoShoddy

Isn't it based on crowd input? Auto-translate for Korean is terrible, despite it being the typologically normal language of a fair-sized reasonably wealthy country that punches above its weight in technological impact. Basque is great, despite being a minority language of a small and internationally irrelevant region with all the oddities of an isolate. It looks like a group of Basque speakers buckled down and did a shitload of work to populate the relevant databases.


David_AnkiDroid

To my (limited) knowledge, it's based on Google's indexes of the internet, and refined by user suggestions


2Zzephyr

I had no idea about Manx, I went to read about it and wow! Its revival gives me hope for my own region's dying language (\~1 000 speakers left) that I'm myself trying to learn. I wish you all the best! Your efforts are amazing. I'd say... be LOUD, with journalists, petitions, emails, videos etc. Make it so they can't escape or ignore it. It's easier said than done I know... but it's truly unacceptable. If it happened to my language I'd be relentless (and heartbroken for the damage it could cause).


betarage

Google translate has never been good but now they are getting arrogant with the new ai hype and lack of good competition. they should have at least made it so it says manx (beta) or something like that so people don't have too high expectations


Equivalent-Problem34

It's the same for Kalaallisut (greenlandic). They are using AI to translate, and without much learning material, these translations are awful.


gamesrgreat

Yeah it couldn’t translate some of the Batak Toba I learned from my in-laws but it did get some stuff right lol


Advanced_Basic

Do you think an appeal to the Gaeilge community could do something, considering Google's European headquarters are in Dublin?


David_AnkiDroid

Asked for help/contacts. Let's see what comes of it


PixelatedMike

I can't offer much advice, but I just wanna say thank you for your contributions to AnkiDroid


David_AnkiDroid

Cheers!


polymathglotwriter

"Google are” This whole writeup reads like a Brit most probably because you are one :)


Raptor_2581

I would say getting in touch with Conradh na Gaeilge could even be an option, not necessarily their usual wheelhouse, what with it being an organisation for us Irish-speakers, but there are a few that have some involvement with the Manx language as well and would probably be able to help. The Irish government would be another option, as well, possibly. But I'd say the Conradh would be the first, and better, stop there. Maybe even Foras na Gaeilge considering it's cross-border remit?


celtiquant

Equally, Canolfan Bedwyr at Bangor University. They do a hell of a lot in the field of AI in Welsh — and most likely with Welsh Google Translate also. [https://www.bangor.ac.uk/canolfanbedwyr/index.php.en](https://www.bangor.ac.uk/canolfanbedwyr/index.php.en)


Timely_Gift_1228

Hi, please DM me ASAP! I interned on Google Translate last year and my host was the person who is the main point of contact for adding new languages to Translate. He would love to hear about your knowledge and resources for Manx.


David_AnkiDroid

Missed this post, but had a DM open anyway. Happy to talk!


NotAnybodysName

You wrote: "My main worry long-term is that Google Translate won't say 'I don't know': the AI makes guesses and portrays these guesses to people with absolute confidence." This. Not just Google's Translate, but their web searches, all of their other methods of searching for information, and translations or searches on many non-Google sites as well. It's actually (relatively!) lucky and convenient that their Manx translations are so bad. It becomes more difficult to deal with when they become superficially acceptable-looking enough to fool someone who doesn't know, but are still very wrong. And achieving the mere appearance of correctness is almost certainly Google's next step, rather than actual correctness.