3 minute read

Polynomial Regression in Multiple Features

A linear regression model represented by one independent variable x and one dependent variable y is called Simple Linear Regression. Expressing this as a formula \(y = \theta_{1} x + \theta_{0}\)

This is easy to understand because it can be simply expressed as a graph with x and y axes. However, when building a linear regression model that needs to predict the dependent variable y, there may not be only one independent variable. In this case,we need a different equation.

What is Polynomial Regression

Polynomial regression is a linear regression model with multiple independent variables instead of one.

This can be expressed in the following way.

\[\begin{align} y &= \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n} \\ y &= \sum_{i=1}^{n} \theta_{i}x_{i} + \theta_{0} \\ \end{align}\]

The point to note here is that not all features are used, but only as many features as necessary.

image

For example, if data is given as above, it is sufficient to use only one feature.

image

Given this data, if you use two features (x, x*x), you will get a model like red. This is a model that does not represent data well.

image

If 3 features are used (x,xx,xx*x), it will become a cubic function, making a model like the one above.

image

However, if the features are used differently (x,xx x(1/2) (root x) ), it will become a model that increases gradually as above and becomes a model that expresses data well.

In the case of the above data, even if a feature is given as many as x1,x2…xn, it can be expressed using only an arbitrary feature xk, and the data can be expressed well depending on how the index of this feature is set.

Learning Polynomial Regression

After setting up the model, we need to train the model. Polynomial regression model is also trained using gradient descent.

Gradient Descent in Multiple Variables

First, if we write the gradient descent in the existing linear regression as an equation, it is as follows. $$ \begin{align} \theta_{0} := \theta_{1} - \alpha * {\partial J(\theta) \over \partial\theta_{0}}
\theta_{1} := \theta_{1} - \alpha * {\partial J(\theta) \over \partial\theta_{1}}

\end{align} \(In the case of polynomial regression, the number of variables increases, so you can update it several times.\) \begin{align} \theta_{0} &:= \theta_{1} - \alpha * {\partial J(\theta) \over \partial\theta_{0}}
\theta_{1} &:= \theta_{1} - \alpha * {\partial J(\theta) \over \partial\theta_{1}}
\theta_{2} &:= \theta_{2} - \alpha * {\partial J(\theta) \over \partial\theta_{2}}
\theta_{3} &:= \theta_{3} - \alpha * {\partial J(\theta) \over \partial\theta_{3}}
&…

\end{align} \(If we generalize this, we get the following eqation:\) \begin{align}

\theta_{k} &:= \theta_{k} - \alpha * {\partial J(\theta) \over \partial\theta_{k}} \

(k &= 1 … n)

\end{align} $$

Convergence Speed

We have seen how to train a polynomial regression model. You can start training right away, but the faster the model converges, the less time it takes to train the model. The convergence time of this model can be accelerated by several hyperparameters.

Scaling Variables

image

If the scale difference between Variables is large, the Loss Function will show the above form. When the parameter of a variable with a large scale is updated, the value of the loss function changes relatively large, and when the parameter of a variable with a small scale is updated, the value of the loss function changes relatively small.

Learning Rate

\[\begin{align} \theta_{k} &:= \theta_{k} - \alpha * {\partial J(\theta) \over \partial\theta_{k}} \\ (k &= 1 ... n) \end{align}\]

When updating the model’s weight, the term \(\alpha\) is multiplied. This value is called the learning rate. If this learning rate value is large, the value of the loss function will converge as follows. The convergence value will not decrease uniformly.

image

If the learning rate value is small, it will take a long time for the value of the loss function to converge.

image

If the learning rate value has an appropriate value, the value of the loss function will converge relatively well.

image

In other words, an appropriate learning rate value is a value that decreases the loss function value for every update, but not decrease the function too small.

References

Standford-ml 기계 학습 Coursera

Leave a comment