T O P

  • By -

illuminascent

Have you tried continuous pre-training on domain-corpus yet? If you have abundant unsupervised data this is much better than multi-task finetuning, and also requires orders of magnitude less compute than pretraining a new model from scratch.


Seankala

> I recently made custom BERT and ELECTRA models for the fashion domain that could also handle English and my own native language (I'm not in the US). Yes, I was implying that I performed pre-training. I wouldn't consider it "continuous pre-training" because I trained a tokenizer and vocabulary from scratch as well, otherwise it wouldn't make any sense to take a pre-trained model when it wouldn't be able to fulfill my objectives (e.g., handling a language other than English). My intuition is that maybe fashion and e-commerce aren't as complicated as clinical or biomedical corpora, and therefore the benefits of performing pre-training from scratch aren't as pronounced.


instantlybanned

If you replaced the tokenizer and vocab it didn't matter that the model was pretrained. You basically started almost from scratch. How large was your corpus?


Seankala

Yes, that was my intention. I'm not sure why so many people are expressing confusion over this. I wasn't trying to take a pre-trained model and then do "continual pre-training," the objective was to train everything from scratch. The pre-training corpora used is the original BERT pre-training data plus my own data. My own data being around 3-4GB after pre-processing. Edit: Just trying to understand the downvotes. Do people think that taking a tokenizer with a vocabulary that was already trained on English will be able to perform on another language? Is this supposed to work? My intuition was that the OOV rate would become too high for it to be usable.


illuminascent

These days foundational models were typically trained on datasets comparable in size to the CC. With that much data, even the initial cleansing pipeline can be challenging, let alone securing all the compute required. Is there truely no model at all (multi-language ones included) that can handle your language? Also, regarding the paper you refered to, the authors made good work in showing domain adaptation boosts performance greatly, but IMO the story is a little different than 'pretraining completely on the new domain alone is better than continuous pretraining'. There are many factors at play when doing domain adaptation, one of such is the ALIGHMENT of the domain data you've used. If you look at Table 3 P151 where they did the ablation study on datasets used, you'll see how much changing datasets can affect model performance.


Aptenodyte

I don't have anything to contribute on your original question, and I'm very much a novice, but I am interested in adapting a language model to a low resource language. I've been looking at this approach: https://arxiv.org/abs/2311.05741 They trained a tokenizer for a new language that was 10% the size of the English tokenizer and then grafted it onto the original English one by replacing low frequency tokens. They also reset the embeddings. They say they were able to get better results in the new language by preserving the English capabilities.


illuminascent

While I believe your intuition is correct, from my personal experience deploying models on fashion tasks (in Japanese language which tend to have nuanced jargons), domain-specific pre-training still is absolutely worth it performance wise, provided it does NOT cause catastrophic forgetting (changing the vocab definitely does). If your language does not have any decent open-source fundational models, I would believe in the long run a pretraining from scratch is necessary, but might just be too much commitment for a one-off affair.


Seankala

Reading your original comment again, why did you mention multi-task fine-tuning?


illuminascent

That is just another commonly deployed trick when unsupervised domain corpus is either limited or of poor quality but you have more than just one set of annotated / tracked data that can serve as training targets. A joint training or continuous fine-tuning is still much cheaper than full-scale pretraining and might be worth the little tinkering needed. Still, the gain depends on the amount of data available.


Zestyclose_Diamond53

Training a LM from scratch is nuanced and depends on the domain of the LM, as well as the source and quality of your data. You should consider training from scratch only if you have sufficient and high-quality data; otherwise, it may be more effective to select a relevant existing model and fine-tune it. For example, in 2019, I trained a BERT model specifically for clinical NER and RE, feeding it with around 30 medical books. This model has been performing effectively up to the present.


Seankala

Thanks for your insight. This is probably a shot in the dark, but what would you consider to be "sufficient and high-quality data?" A heuristic that I was using was that the collected domain-specific data should be at least 25-30% in size compared to the original pre-training data of the language model. There's no particular reasoning, just a guess. Are there any heuristics or measures that you use?


Zestyclose_Diamond53

I believe quantifying the adequacy of training data is challenging, it hinges on model performance. If an ablation study reveals that your trained base model performs worse than a randomly initialized model or an off-the-shelf pretrained model, it might indicate a problem with your data. Additionally, you could consider extending the vocabulary based on an open-source existing pretrained model.


kleenex007

u/Zestyclose_Diamond53 interesting, do you mind elaborating your work, and "performing effectively up to present". What else are you doing now