Week 1 5
Mainly focused on topics like linear models like linear regression / logistic regression and neural networks
Gradient Descent requires simultaneous update - updating all parameters at the same time using the same learning rate instead of updating one parameter first, and then use updated parameter to calculate another parameter.
This is very similar to Pytorch's Autograd mechanics, gradients are accumulated within a batch and are updated at the end using an optimizer once only.
Logistic regression model
logistic = feeding linear regression model into sigmoid function
Cost functions for above model - It has been really hard for me to memorize the cost function for logistic regression. Hopefully, the picture below helps capture the intuition of how to design cost function.
means hypothesis of model regarding relationships between and
Another post explaining loss function in logistic regression:
In multi-classification problem, one-vs-all provides a brute force way to solve the problem:
Calculate manually in logistic regression
Intuition in regularization: Cost Function | Coursera
Note generally we don't know which to shrink i.e. which parameter OR parameters to shrink explicitly, but we want to remove the overall influence and penalize all parameters. That's why we use in which capture the intuition of the extend of penalization.
The whole cost function consists of original cost function and an additional regularization term.
So the larger means the higher weight it has in the final cost function, meaning the regularization has higher amount of impact. Thus, in extreme case with very large regularization, the function can be underfitted.
Likewise, the larger which is the parameter for original cost function (non regularization part) means less impact of the regularization. Thus, in extreme case with no regularization, the function is trying to do perfect job even it comes with overfitting.
The main motivation behind neural network is that it can introduce non-linearity into the problems that can't be solved through linear models
Backpropagation - The following post, I believe, gives a better explanation: A Step by Step Backpropagation Example – Matt Mazur. The network usually starts with random initialization for all initial weights.
Backpropagation provides the "formal" way to compute gradient using chain rule i.e. you know exactly where to go upfront to optimize cost function. In contrast, gradient descent is like testing all directions to find the best direction.