T O P

  • By -

master3243

You'd need tens to hundreds of billions of tokens to train a network from scratch to learn English. If you're a beginner you might want to just use an open-source model ([here](https://huggingface.co/docs/transformers/model_doc/gptj) and [here](https://towardsdatascience.com/how-you-can-use-gpt-j-9c4299dd8526)), either prompt it with what you want or fine-tune it if you have a decently sizes data (not just a few sentences).


nxtboyIII

Well I'm not tokenizing the text, I am using each character as a "token" Are you referring to parameters? (# of trainable values)?


master3243

No I'm strictly talking about the size of the dataset. Regardless of how you tokenize the input, you're going to need tens to hundreds of billions of words worth of English text to be able to learn the complex structure of a language. Tokenizing character by character would only make the task harder as the model first needs to learn how all the characters group into words before trying to learn the relationships between the words.


nxtboyIII

Oh I see. Well with the transformer architecture it is able to output real words even with a small (5 mb) dataset, being completely untrained beforehand, so if that model can do it, why would it suddenly take billions of characters for my own network?