Gradient descent

3/16/2023

def linear_regression(x, y, b =0, b0 =0, epochs =1000, learning_rate =0.001): N = float(len(y)) for i in range(epochs): y_predicted = b0 + (b * X) cost = sum() / N b_gradient = -(2 /N) * sum(X * (y - y_predicted)) b0_gradient = -(2 /N) * sum(y - y_predicted) b = b - (learning_rate * b_gradient) b0 = b0 - (learning_rate * b0_gradient) # print(f'b0:') return b, b0, cost Ok, then we are ready and we will create a function that will in several different steps calculate and update the weights. Some advanced gradient descent algorithms such as ADAM, RMSProp and others, vary the learning rate, making it smaller when the slope is smaller, so making some fine tuning around optimal □ value(s). Also, these updates with different jumps are done through iterations that are called epochs. It has its price - the higher the learning rate those jumps will be bigger and computation faster but it can overshoot and overseen the optimal spot, minimum of the cost function, so it can and with not so accurate solution. This parameter basically regulates how big jumps in □ will be done for observing the slope change. the ‘jump’ in □, which can be regulated by a parameter which is called learning rate. This slope calculation can be done at different intervals, i.e. Ok, but now we must mention a couple of new concepts. It takes some weight, predictor feature and target. the derivative of the cost function: def calculate_error_j_(theta1, x_i, y_i): return ((theta1 * x_i) - y_i) * x_i Let’s first simply write the calculation of error, i.e. Now we will make a simple function that will implement all this for Linear regression. Programming Gradient Descent from The Scratch The slope tells us the direction to take to minimize the cost. This function dJ(□)/d□ is in the essence of our attention because it takes a value of the parameter □ we want to optimize and returns the slope of the tangent of the cost function for this value of □. More generally, derivatives of linear and nonlinear functions can be calculated with the following formula: In this simple example, this could be like this, but the rule is similar even for more complicated cases. We want the smallest possible gradient in respect to □J because in this spot we will get to the minimum value of J which we want, because we want the error function to be as small as possible. The slope (gradient) of the function is the ratio between □□ and □J (□□/□J). Basically, we want to find the value of □ where the error function is the smallest. In order to get to the gradient, we must calculate derivatives for both parameters, in this case for cost function (J) and (□). The gradient is basically the slope of some function, the measure of the change in the function. But what is the derivative? How we will get to the optimal □ which will give us the smallest possible error function (in this case MSE)? By derivative! But the optimal □ will be same for every data point (student, i) because we can rewrite our cost function as:īecause in this example we have only one feature the derivative of the total cost function and partial derivative of the cost function will be the same, but in reality, usually the total derivative of the cost function is made from all partial derivatives. if we want to predict the student achievement based on hours of learning, than for each student). This formula is applied for each data point (i.e.

Our prediction value ( y’) is dependent on that prediction but we must find the weight (□) that gives the prediction value: Let’s say we have one feature (predictor) - x. The Goal of the Gradient Descent is to Find Optimal Weights Where n is the number of data points (i.e. For each of our data points, we can see how this prediction differs for each data point from true value ( y). With the model m, we get some prediction value y’ for each row (data point). In task with the continuous target cost function is the loss error function (usually called Mean Squared Error), while in task with the categorical target the loss function is cross entropy. But what do we want to minimize? The Cost Function. In the gradient descent algorithm, the parts are the gradient, which is, let’s say simply the relationship of change between two parameters and descent, some point in this relation which minimize some desired parameter. This algorithm is used for many AI applications, from face recognition to other computer vision products, for different predictions, both for predictions of the continuous target (i.e. The Gradient Descent algorithm is the essence of many machine learning algorithms especially in neural networks as well as in any prediction task. With Gradient Descent Code from the Scratch The Magic of Machine Learning: Gradient Descent Explained Simply but With All Math

0 Comments

Gradient descent

Leave a Reply.

Author

Archives

Categories