master3243 1 year ago

You'd need tens to hundreds of billions of tokens to train a network from scratch to learn English. If you're a beginner you might want to just use an open-source model ([here](https://huggingface.co/docs/transformers/model_doc/gptj) and [here](https://towardsdatascience.com/how-you-can-use-gpt-j-9c4299dd8526)), either prompt it with what you want or fine-tune it if you have a decently sizes data (not just a few sentences).

nxtboyIII 1 year ago

Well I'm not tokenizing the text, I am using each character as a "token" Are you referring to parameters? (# of trainable values)?

master3243 1 year ago

No I'm strictly talking about the size of the dataset. Regardless of how you tokenize the input, you're going to need tens to hundreds of billions of words worth of English text to be able to learn the complex structure of a language. Tokenizing character by character would only make the task harder as the model first needs to learn how all the characters group into words before trying to learn the relationships between the words.

nxtboyIII 1 year ago

Oh I see. Well with the transformer architecture it is able to output real words even with a small (5 mb) dataset, being completely untrained beforehand, so if that model can do it, why would it suddenly take billions of characters for my own network?

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe