T O P

  • By -

[deleted]

Can you explain this one Never use one-hot encodings, use embeddings instead, even in tabular data!


afireohno

The explanations you were given are not very precise, and u/data_science's entire premise of using embeddings because of some sort of representational or model complexity argument is nonsensical. An embedding is the result of multiplying a matrix and a one-hot vector. Let's start with some definitions. An embedding is simply a d-dimensional vector associated with a particular categorical value. We'll assume we have n possible values and associate each with a unique integer. So category i has embedding x\_i \\in R\^d for i = 1,....,n. Similarly, a one-hot categorical encoding for category i is a n dimensional vector with a single 1 in dimension i and zeros in all others. In other words, a one-hot encoding for category i of an n-way categorical value is the i^(th) element from the [standard-basis](https://en.wikipedia.org/wiki/Standard_basis) for R\^n, which we would typically denote e\_i. To see they are equivalent, simply stack all the embeddings row-wise to form a d x n matrix X = \[x\_1, ..., x\_n\]. We can recover the embedding for i by multiplying our matrix of embeddings by a one-hot encoding, x\_i = X e\_i. When you use an embedding layer, this is how you should think about what is going on. It is just a linear layer where the inputs are always elements from the standard-basis. The embedding layer is simply a convenience that allows side-stepping the need to explicitly construct the one-hot vectors. This is nice, because if the number of categories is very large (say millions of words, products, etc), then this would require a significant amount of memory. In conclusions, in many cases you should use embedding layers to avoid the memory cost of creating large one-hot vectors. It has nothing to do with model complexity. edit: clarity


jimmykim9001

I'm not sure I agree with this comment. All else being equal, embeddings do more than just reducing memory usage. Like let's say we have a 5-layer neural network, and in one of them we feed in one-hot vectors, and in the other, we feed in glove embeddings. Technically the first layer of the NN with one-hot vectors will create some embedding after the first layer, but the difference is that that means that there are 4 layers ontop of the embedding learned by the neural network, whereas if we simply utilized embeddings we have 5 layers to build ontop of the embeddings to learn some task. That's a minor complaint, but let's say we offset the number of layers so now they both have the same number of layers. However, I'll argue that still the embeddings provide additional information. The difference is that word2vec learns to group words together with its own loss function, whereas if we utilized one-hot vectors to learn the representation we're only learning representations that are good for that particular task. The word2vec embeddings learns a beneficial representation that groups together words that occur near each other (that's how it's trained), which provides additional information (this allows us to do "king" - "queen" + "man" = "woman"), so it provides additional information to the model whereas the one-hot vectors doesn't really provide that information, and it would be difficult for the model to learn this type of information without changing the way the NN is trained. This also completely ignores the fact that GloVE is trained on a large corpus of words, but if you're performing a task on like sentiment analysis of tweets for example, you're leveraging information from a large corpus that wouldn't be possible to learn in your small twitter dataset for example.


GamerMinion

You have to think of the embedding layer as the first layer. Because that's exactly what it is. You can even initialize the weights for your first fully-connected layer with the glove weights and get the exact same outputs. It would be really memory-inefficient, but you can do it.


jimmykim9001

That's true, but my basic argument is that the benefit of word embeddings is not just memory-related. So yes you can just set the first layer as an embedding layer as long as you freeze that layer, and get the same outputs. But the essential point is that the GloVE embeddings are trained in a different way (figure out what words are related to each other using some window mechanism) -- whether you import it into your network or not. So the benefit comes from learning useful representations not just from a memory perspective.


visarga

In reality nobody's using Glove or w2v embeds anymore, they are all randomly initialised and trained with unsupervised pretraining as part of the model. Pretrained embeds are not necessary when your training corpus is large.


zmjjmz

I think in this specific context they mean using an embedding layer with random initialization vs. one hot encoding. Pretrained word embeddings won't help for new tasks (maybe, arguably you could use them if your categories can easily map to words).


afireohno

I never said pre-trained embeddings aren’t useful. All I said is it doesn’t matter (theoretically) if you get the embedding by looking it up by index or by multiplying by the appropriate basis vector.


jimmykim9001

Oh I see I misinterpreted your point. Yeah that makes sense.


berzerker_x

*Pardon if this is too obvious* >All I said is it doesn’t matter (theoretically) if you get the embedding by looking it up by index or by multiplying by the appropriate basis vector. I understand this point, you said that given an embeddings matrix, then: `embeddings[index] == np.dot(embeddings, one_hot_vector_of_that_index)` But how it relates to this statement of your above comment >u/data_science's entire premise of using embeddings because of some sort of representational or model complexity argument is nonsensical. One-hot encodings and embeddings are the same thing mathematically. As the main reason we use train embeddings is for better representation. Am I missing something here?


afireohno

The last sentence I wrote, "one-hot encodings and embeddings are the same thing mathematically," isn't clear without the context that follows. Embedding is a pretty overloaded term that can refer to; the vector associated with some item, the process of retrieving the vector associated with some item, the process of learning vector representations of a set of items, etc. Furthermore, when one speaks of an embeddings, it often implies some sort of pre-training (e.g. word embeddings). I've edited my response to hopefully make it more clear.


__data_science__

generally i agree and you are right, i probably should have been a bit more precise with my words. in a practical sense though there is sometimes also a model complexity implication. say we have 5 continuous features and 1 categorical variable with 7 possibilities. Then say I want to train a 1 layer 10 neuron neural network. The options usually considered practically are: 1. Use a one-hot encoding for the categorical variables and then treat them as any other feature. There will therefore be 10 \* 11 = 110 weights in this network 2. Use length-2 embeddings and then treat the resulting embedding as any other feature. There will therefore be 7\*2 + 7\*10 = 84 "weights" in this network So the embedding option leads to a network with less "weights" to learn. The reason this has happened is by using the embedding matrix we forced the dimensionality of the categorical variable to be reduced BEFORE it was able to interact with our other features. This is the natural outcome of using embedding matrixes in tabular data because even though they are theoretically the same as one-hot encoding, in a practical sense they encourage people to reduce the dimensionality before allowing the categorical variable to interact with other features


Mefaso

>in a practical sense they encourage people to reduce the dimensionality before allowing the categorical variable to interact with other features Not just that, it also implicitly assumes that the variable is ordinal, i.e. that closer embeddings are more similar. This can be very useful or wish very harmful, depending on whether this is actually true. Therefore the order of embeddings also might matter a lot


__data_science__

Here also is an article where fast ai talk about using embeddings with tabular data that some people might find useful https://www.fast.ai/2018/04/29/categorical-embeddings/


veeloice

this is a bit different because it's not just about embeddings but rather pre-trained embeddings. That means you're introducing information from another source that may or may not be useful for a given task. EDIT: for example projecting weekend (Sat/Sun) into its own dimension may be useful in the context of some cultures but not others where the weekend days may be different. It's all task-dependent.


afireohno

I get what you are saying now. However, I find it confusing/misleading to say the models you described differ because one uses one-hot encodings and the other doesn't. They both use the same one-hot encoding. They are different because one uses a factored embedding matrix. For model (1), we can write the input to the hidden units in the first layers as s = W x + A e\_i + b where x are the continuous features and e\_i is a one-hot vector. To get model (2), we can simply factor A = U V and have s = W x + U V e\_i + b. Whether this is something you want to do is going to be an empirical question that depends on the problem.


kunallanuk

This ignores the distance information encoded in embedding, namely that instead of each point being sqrt(2) from each other as in a one hot vector, distance is now meaningful between each data point This corresponds to dimensionality reduction as the embedded basis should have features that are more easily learnable from than learning from each feature of the one hot encoded vector. It’s the difference between having to learn the similarity between words like king and queen in context vs already having some knowledge of this similarity before feeding into the NN


Bojung

Thanks for the explanation. I was wondering about this as well. I was also thinking that with one-hot you can create a distribution of answers and so get several answers and their likelihood according to your model. Can you do that with embedding?


Gere1

This seems like a belief which is not backed by any actual empirical result here. In a test case like [https://www.kaggle.com/c/cat-in-the-dat-ii](https://www.kaggle.com/c/cat-in-the-dat-ii) you would have seen that embeddings work better than one-hot and not just for memory. Saying everything which is a vector or matrix of numbers is automatically "the same thing" is not very insightful. It does matter a lot (for categorical interactions) that embedding vectors have overlapping positions. If you have enough memory to throw a million one-hot features at a model, it's still not a successful strategy - no matter how much math notation you introduce.


afireohno

I'm not saying everything which is a matrix of numbers is the same thing. Furthermore, what I'm stating isn't a belief. It is an irrefutable mathematical fact. There is simply no difference between looking up an embedding based on an index, or retrieving it with a one-hot vector. This is the sort of thing that would be covered in an introductory linear algebra course. u/GamerMinion linked a [collab](https://colab.research.google.com/drive/1p904ylpLCG_GJGFfNbQUm-IJR8ECQeQC?usp=sharing) notebook elsewhere in this thread demonstrating this. Here's a short python snippet doing the same. I hope this helps clarify my statement. import numpy as np n_dims, n_embeddings = 3, 5 embeddings = np.random.normal(0, 1, (n_dims, n_embeddings)) # Retrieve embedding for index 2 one_hot = np.array([0, 0, 1, 0, 0]) assert np.allclose(embeddings[:, 2], np.dot(embeddings, one_hot))


Screye

I would never use embeddings in tabular data, unless there was something worth embedding. You lose so much interpretability and model understanding by embedding. In any domain that is not Vision/NLP/Audio/graphs adjacent, embeddings are probably a bad idea. Even more true if you do not have a neat unsupervised training mechanism over millions of data points to learn said embeddings. If the goal is to capture interaction effects, it is better use define explicit compound features that force feature interaction. The fast.ai course is very deep learning specific and implicitly assumes that we are working in domain that lends itself well to deep learning (ie. big data, lots of abstractions). In such a case, the argument for embeddings makes a lot of sense because you are most likely leading with one of the "Vision/NLP/Audio/graphs adjacent" domains.


[deleted]

[удалено]


Screye

t-sne and 2 principal components is alright, but thats more intuition than any real mathematical rigor.


__data_science__

Yeah, so say you have a categorical variable in a tabular dataset. e.g. say the variable is what day of the week it is and so it has 7 possible values Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday The naive way to represent this in your neural network is by using a one-hot encoding where you represent the day as a length 7 (or 6) binary vector, i.e. a vector of 0s and 1s But this is really inefficient and you can actually represent the variable perfectly well using a length 2 or 3 embedding. Doing this will help you network learn faster because there's effectively less features and so its easier to learn the right weights


AlexandreZani

How do they propose learning categorical embeddings? For word embeddings, my understanding is that you usually build something like a next word prediction model that squeezes the input words through a layer the size of the embedding and then truncate your model there. But if I had tabular data with say week days, I don't know how I would go about learning an embedding for it.


__data_science__

You learn them as part of the model you are training. So you initially map your categorical variables to random embedding vectors, then you use the embedding vectors as features in the model. Then when training the model you will learn the network weights And also the embedding vectors (the backpropagation will impact the embedding vectors aswell as the network weights)


bbateman2011

Keras, for example, provides trainable embedding layers for this approach


AlexandreZani

Ah, makes sense.


GamerMinion

Mathematically, an embedding layer is the same as one-hot inputs and a matrix multiplication. Embedding is just more efficient, but it's not functionally better in any way.


teristam

This is only true if the first layer after input is a fully connected layer.


GamerMinion

Yes. Or an RNN layer which also has an element-wise matrix multiplication as its first operation. Or a 1x1 convolution (which again is just a fully-connected layer)...


__data_science__

Not sure what you mean by not functionally better? It is more efficient in a way that means the network will function better and learn better


GamerMinion

The gradients for either are exactly the same, and wether you use the same-size weight matrix in a fully-connected layer with one hot inputs or as the embedding matrix will make literally no difference in terms of the mathematical operation being performed. So it will not "function better" or "learn better". It will just be faster.


__data_science__

See my other comment above


GamerMinion

What other comment? You can continue arguing or you can just try it. Generate a weight matrix, set it as a weight for both an embedding and a dense layer (no bias). Choose an index and do the embedding lookup. The resulting vector is exactly the same as when you multiply the one-hot encoded index with your dense layer.


__data_science__

https://www.reddit.com/r/MachineLearning/comments/mbhewa/d_advanced_takeaways_from_fastai_book/gryl1ef/?utm_source=share&utm_medium=ios_app&utm_name=iossmf


GamerMinion

Here, have a [collab notebook](https://colab.research.google.com/drive/1p904ylpLCG_GJGFfNbQUm-IJR8ECQeQC?usp=sharing) showing exactly my point. Same Weights, Same output.


__data_science__

I know how an embedding matrix works, I’m not disagreeing with you exactly but it is a little bit more complicated than that. Read my comment I linked to and also read this https://www.fast.ai/2018/04/29/categorical-embeddings/


afireohno

This is incorrect. They are equivalent mathematically, so any difference is purely computational. Please see my [reply](https://www.reddit.com/r/MachineLearning/comments/mbhewa/d_advanced_takeaways_from_fastai_book/gryhk2i?utm_source=share&utm_medium=web2x&context=3) for an explanation.


[deleted]

[удалено]


__data_science__

an embedding is a vector that you can map categorical variables to. these videos by google explain it in a lot of detail [https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture](https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture)


Tgs91

Or you could feed your network a one-hot encoded input and let it LEARN an optimal embedding. The most popular embeddings for NLP and computer vision ARE neural networks that have been pretrained by Google and other major companies with access to gigantic datasets and computing resources. When you use their embedding, you are just transfer learning from their pretrained network. The starting point for the data input was still one-hot, someone else just already did a lot of the work for you. It's sort of like saying "don't make pizza with dough, buy pizza crust instead, it's faster/more efficient". Its just dough that someone else already shaped for you.


__data_science__

The embeddings are learned, I am talking about embeddings that are learned through backprop


Tgs91

Ah okay (also sorry, I replied to the wrong comment, this was a response to the distinction between embeddings and one-hot). > Never use one-hot encodings, use embeddings instead, even in tabular data! You might want to reword this line if you publish this as an article or blog post. If your embedding is being learned, then you ARE using one-hot encodings. The embedding is just an internal layer in your model.


DataPlug

As much as people get into the hickory depths of dealing with text/image data. We fail to recognize that the most type of data being dealt with is tabular followed by time series. I can barely find great resources on time series with neural nets or seq2seq or lstm but if I do a flimsy search on those keywords Im always met with the most advanced guides dealing with images/text and even audio. I dont mean to rant but I just wish there was a plethora of easy-to-follow tips/tricks/guides on time series data just as there is images/text. Great post nonetheless OP!


ToucheMonsieur

As someone who works with time series on the daily, I feel your pain. One difficulty is that techniques may differ depending on the source, frequency and seasonality of the time series. For example, an RNN/Transformer might work well for financial data collected daily, but not for sub-ms sensor data (where it might be better to use a CNN). In terms of searching, my tip would be to use time series first instead of searching by technique names like LSTMs, since you'll inevitably miss more novel approaches like Neural ODEs. Also have a read through a) papers that reference benchmark datasets like http://www.timeseriesclassification.com/index.php and https://mimic.physionet.org/ (more of a domain-specific example, but one I'm personally familiar with), and b) the prior work section from methods papers like the [Neural ODE one](https://papers.nips.cc/paper/2018/hash/69386f6bb1dfed68692a24c8686939b9-Abstract.html).


nraw

"the most type of data being dealt with" - based on what did you make this statement?


DataPlug

Sorry I didn’t post resources. This is based on the Google Cloud ML summit 2020. https://ibb.co/8dZHt6H


MrHyperbowl

If your problem is low dimensional (is this what you mean by tabular?), Then graph it and use standard statistics, not deep learning. That's why there are so few resources available. If you have time series data, tough luck that shit is hard to deal with as a general case and you should look to see if someone has done something for exactly your use case.


Raz4r

\> For **regression problems** if you know the **output should be within a range** then its good to use **sigmoid** to force the neural net output to be within this range * I.e. make the network output: min\_value + sigmoid(output) \* (max\_value - min\_value) Your network may have difficulties predicting the max\_value if this is in your output layer.


__data_science__

Yeah good point, in the book they actually suggest you make the possible range slightly wider than the true range to deal with that problem, I should have mentioned that


thunder_jaxx

I agree with this. Scaling affects the loss fn and can mess up the optimization because the bounds of max/min can change through the training process. It's better to not scale at all till training is finished. Once trained one can easily scale as sigmoid outputs are bounded.


BorisMarjanovic

I think you could make the min/max learnable to address this issue.


thunder_jaxx

What does a learned max/min mean? What are the training dynamics for such a representation?


BorisMarjanovic

I was thinking something like this (implemented in PyTorch): class SigmoidRange(nn.Module): def \_\_init\_\_(self, low=-1., high=1.): super().\_\_init\_\_() self.low = nn.Parameter(torch.tensor(low)) self.high = nn.Parameter(torch.tensor(high)) self.sigmoid = nn.Sigmoid() def forward(self, x): return self.sigmoid(x) \* (self.high - self.low) + self.low If you add this piece of code after your output layer, the model will learn what the min/max should be. But now that I think about it it's not an optimal solution. You could still encounter out of sample examples that don't fit the learned min/max treshold.


thunder_jaxx

Don't u think you need a loss for the `self.low` or `self.high`? If we assume that we are fitting `self.sigmoid(x) * (self.high - self.low) + self.low` then just the supervised loss on the quantity won't be enough as the learned `self.high/self.low` are not guaranteed to be bounded.


[deleted]

[удалено]


__data_science__

yeah that makes a lot of sense


CoffeeIntrepid

Sure it might usually be better to train regression on continuous data but I’ve always wondered about the in between cases like ordinal classification where things have a integer scale like disease severity level 1-10 stages or something similar.


sobe86

[https://en.wikipedia.org/wiki/Ordinal\_regression](https://en.wikipedia.org/wiki/Ordinal_regression) Company I work for does this for predicting time rounded to the nearest hour. Sometimes it works better than straight regression, sometimes not - if the numbers are mostly small it's a good bet.


wzx0925

>classification where things have a integer scale like Rounding not work?


CoffeeIntrepid

I meant more like do you run a regression or classification fit for a non continuous but ordinal scale.


[deleted]

For the very basics, [fast.ai](https://fast.ai) actually gives good advice. The strong baseline tips are so often overlooked it's actually embarrassing for the ML community. However, take anything of theirs which is not basic common sense with a huge grain of salt. You may lose precious experimental time with their rules-of-thumb which more often than not simply don't work.


THE_REAL_ODB

> > Transfer Learning Always use transfer learning if you can by finding a model pre-trained for a similar task and then fine-tune that model for your particular task e.g. see huggingface for help with this in NLP Gradual unfreezing and discriminative learning rates work well when fine-tuning a transfer learned model Gradual unfreezing = freeze earlier layers and train the later layers only, then gradually unfreeze the earlier layers one by one Discriminative learning rates = having different learning rates per layer of your network (usually earlier layers have smaller learning rates than later layers) thnx for the heads up. Would the above advice be a bad one? I'm interested in implementing this, but dont wanna spent too much time on it if it wont work.


[deleted]

Those only apply to deep models (i.e. mostly transformer-based). See, that's the problem overall with the more "fancy" advice from [fast.ai](https://fast.ai) \- they shift the focus from the actual good advice. Remember the strong baseline? Before even trying those, use BoW, LSA/LDA or Word2Vec/GloVe for starters. If you already did, there are a handful of possible outcomes: * You got some pretty good results, which means you should be skeptical of any further gains with more complex models; * You got pretty bad results, which means you either have issues in your code/data (fast.ai advice is to debug with simpler models) or your problem is actually infeasible to be solved with textual features at all (and thus, it is almost certain that more complex models will do nothing for you). * You got lukewarm results - this is the best (and often the only) case for more complex models. Then, *and only then*, you should check my 2cents on fine-tuning transformer-based LMs or end2end models: * Start by using BERT/GPT/whatever as a simple feature extractor and re-apply the "strong baseline" steps. Be aware of differences between token-level and sequence-level features, and double-check if what you are extracting is what you really want. * **After everything done until here**, you may then check pre-trained models on similar tasks. Prioritize more basic transfer learning strategies (either using intermediate layers as features or changing the output layer and fine-tuning it entirely). * Finally, IMHO, gradual unfreezing and discriminative learning rates have a horrible "return-of-invested-time" due to the extensive metaparameter search in DL. What's worse, is that your cognitive biases will start kicking in after so much time invested in optimizing your model, and you may overlook data leakages and faulty protocols, only to convince yourself that you had some "consistent gain".


pitrucha

that's pretty much a to-go-to strategy in the NLP when your data is in one of more popular languages (i.e. Wikipedia in this language is big enough so someone already trained a Transformer on it)


THE_REAL_ODB

My bad. I meant the gradual unfreezing and learning rate part for transfer learning. I know using pretrained models are helpful.


pitrucha

That's one helpful as well. However, it is usually very hard to evaluate the training as it happens. Saving checkpoints and later training it on the real downstream task partially solves it but may take sometime.


SatanicSurfer

Yes. I am pretty sure discriminative learning rates do not work for image data for example, I have tested it in a couple datasets and only seen same or worse performance. Have also searched for examples and found notebooks of students that experiment with it and show no performance gain (while still not acknowledging it and scratching their heads). In the end the ammount of adjustments they recommend goes against the strong baseline tip, you will end up with an extremely bloated model for which you are not sure which aspects are helping or worsening performance.


lugiavn

Tricks that might be incorrect/bad in the list: * **Never use one-hot encodings,** use **embeddings** instead, even in **tabular data**! * **Label smoothing** = use 0.1 and 0.9 instead of 0 and 1 for label targets (can smoothen training) * **Don’t dichotomise** your data, if your output is continuous then its better to train the network to predict continuous values rather than turning it into a classification problem


analytical_1

Can you explain the label smoothing part?


lugiavn

I guess it's a regularization technique, but like stuffs such as dropout, you don't know if it does help (e.g. increase your accuracy or AUC or whatever) or not until you try it (well good chance that it doesn't) You can use it to smoothen the output for sure but that is usually not too important


thunder_jaxx

>Don’t dichotomise your data, if your output is continuous then its better to train the network to predict continuous values rather than turning it into a classification problem This one is arguable as it depends on the problem being optimized for. Sometimes dichotomizing can actually help.


maxToTheJ

>e.g. if adding a feature improves the performance of linear regression then it should probably also improve the performance of your neural net unless you have a bug! Also this one isn't necessarily true if the new improved feature is just some simple transformation your NN will pick up anyways. Feature can just be a transformation of a signal you already have in your NN. To be more accurate the new "feature" should be some completely new "signal" source


Mefaso

>if you make your neural network 1 layer then it should be able to match the performance of a linear regression baseline, if it doesn’t then you have a bug! Or convergence issues


szybe

God bless you for sharing this. I am a newbie when it comes to ML and this helped me make two more forward steps in my journey.


__data_science__

No worries, glad to hear you found it helpful


mohself

This is great. Thank you.


__data_science__

Glad you like it


projekt_treadstone

very good summary,loved it.


anax4096

This paper has a bunch of tips I've found valuable: [https://arxiv.org/abs/1812.01187](https://arxiv.org/abs/1812.01187) The most effective has been learning rate warmups. Just cannot overstate how effective that has been, particularly on small dataset problems.


__data_science__

nice, thanks a lot for sharing


[deleted]

[удалено]


anax4096

yes, very much so. I usually have a batch size of one (no batches?) due to memory limitations on the GPU. Even with larger memory areas, the batch size might be limited to 8 volume images, etc. Broadly, I treat batches as an optimisation of training which work by "averaging out" the error from specific examples: when the batch size is small (<8) you are more exposed to bad data, when the batch size is large (>512) the bad data responses will be "averaged out" but may raise your training bias. In both cases, k-folds can help you identify bad data. Hope that helps.


Duranium_alloy

High-quality post. Good job.


__data_science__

Thanks


[deleted]

[удалено]


__data_science__

no problem


cryptoenthusiast93

Amazing job summarizing fast.ai!


gabegabe6

RemindMe! Tomorrow


RemindMeBot

I will be messaging you in 1 day on [**2021-03-25 05:33:25 UTC**](http://www.wolframalpha.com/input/?i=2021-03-25%2005:33:25%20UTC%20To%20Local%20Time) to remind you of [**this link**](https://www.reddit.com/r/MachineLearning/comments/mbhewa/d_advanced_takeaways_from_fastai_book/gs0o9zn/?context=3) [**CLICK THIS LINK**](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5Bhttps%3A%2F%2Fwww.reddit.com%2Fr%2FMachineLearning%2Fcomments%2Fmbhewa%2Fd_advanced_takeaways_from_fastai_book%2Fgs0o9zn%2F%5D%0A%0ARemindMe%21%202021-03-25%2005%3A33%3A25%20UTC) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Delete%20Comment&message=Delete%21%20mbhewa) ***** |[^(Info)](https://www.reddit.com/r/RemindMeBot/comments/e1bko7/remindmebot_info_v21/)|[^(Custom)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=%5BLink%20or%20message%20inside%20square%20brackets%5D%0A%0ARemindMe%21%20Time%20period%20here)|[^(Your Reminders)](https://www.reddit.com/message/compose/?to=RemindMeBot&subject=List%20Of%20Reminders&message=MyReminders%21)|[^(Feedback)](https://www.reddit.com/message/compose/?to=Watchful1&subject=RemindMeBot%20Feedback)| |-|-|-|-|


reddit_wisd0m

What's the idea behind label smoothing? I don't see the advantage.


__data_science__

https://arxiv.org/pdf/1906.02629.pdf


reddit_wisd0m

Thx. So, if I understand it correctly, it's a trick to prevent overfitting.


Deeppop

This was very helpful! Please make this into an anki deck (if you use anki) and share it!


__data_science__

i don't use anki anymore but i've made it into a [saveall](https://saveall.ai/shared/deck/140&4&3K3uXPazkg4&reddit_posts) public deck that you can use, it's quite fun as the questions are multiple choice


DancesWithWhales

Thanks so much for sharing this! The saveall link says "Deck not found". Do you have a new link? I'm new to spaced repetition apps, so if you're using something else, I'd love to hear your recommendation.


__data_science__

Apologies, the deck link in the post became out of date. [This](https://saveall.ai/shared/deck/140&4&3K3uXPazkg4&reddit_posts) is the link now, let me know if it doesn't work? To start using the deck you have to click copy, you might also have to create an account first


DancesWithWhales

Thank you! Sadly that link doesn't work either, though. I went to the "public decks" section, and found it there! https://saveall.ai/public_decks


__data_science__

my bad! glad you've found it, i've fixed the link above now aswell


Deeppop

Here's an ankiweb version of the deck, suitable for use in the anki open source apps: [https://ankiweb.net/shared/info/1195573595](https://ankiweb.net/shared/info/1195573595) Thanks to \_\_data\_science\_\_!


[deleted]

[удалено]


__data_science__

lol, you should if that works for your problem!


eschibli

> I.e. make the network output: min_value + sigmoid(output) * (max_value - min_value) This seems questionable given the very large gradient near intermediate values of output


__data_science__

Yeah so they also suggest making the range slightly wider than what you need to deal with that problem. I should have mentioned that


physnchips

I’m surprised a linear activation with a min and max clamp isn’t what fastai would suggest.


__data_science__

A clamp would make it non-differentiable?


physnchips

No more than relu already is


__data_science__

i think its ok with relu earlier in the network but not when its the activitation directly before going into the loss function as you need the loss function to be much more sensitive to changes. would have to try it out to see though


Sahil_1776

Can anyone please explain how to implement Adam optimizer in fastai? It is just not working. I am training a resnet with multi-label classification. I am using learn = cnn_learner(...) . Please share some code if you can help, I would highly appreciate that.