# Logistic Regression

Marketing example data Medical example data
Logistic regression is a **special case of regression analysis** and is used when the
**dependent variable is nominally scaled**. This is the case, for example, with the
variable purchase decision with the two values *buys a product* and
*does not buy a product*.

Logistical regression analysis is thus the counterpart of linear regression, in which the dependent variable of the regression model must at least be interval-scaled.

With logistic regression, it is now possible to explain the dependent variable or estimate the probability of occurrence of the categories of the variable.

### Business example:

For an online retailer, you need to predict which product a particular customer is most likely to buy. For this, you receive a data set with past visitors and their purchases from the online retailer.

### Medical example:

You want to investigate whether a person is susceptible to a certain disease or not. For this purpose, you receive a data set with diseased and non-diseased persons as well as other medical parameters.

### Political example:

Would a person vote for *party A* if there were elections next weekend?

If you need to calculate a logistic regression, you can easily use the Regression Analysis calculator here on DATAtab.

## What is a logistic regression?

In the basic form of logistic regression,
**dichotomous variables ( 0 or 1)** can be predicted. For this purpose,
the probability of the occurrence of

**value 1 (=characteristic present)**is estimated.

In medicine, for example, a frequent application is to find out which variables have an
influence on a disease. In this case, *0* could stand for *not diseased* and
*1* for *diseased*. Subsequently, the influence of age, gender and smoking
status (smoker or not) on this particular disease could be examined.

## Logistic regression and probabilities

In linear regression, the independent variables (e.g., age and gender) are used to estimate the specific value of the dependent variable (e.g., body weight).

In logistic regression, on the other hand, the dependent variable is dichotomous (*0*
or *1*) and the probability that expression *1* occurs is estimated. Returning
to the example above, this means: How likely is it that the disease is present if the
person under consideration has a certain age, sex and smoking status.

## Calculate logistic regression

To build a logistic regression model, the linear regression equation is used as the starting point.

However, if a linear regression were simply calculated for solving a logistic regression, the following result would appear graphically:

As can be seen in the graph, however,
**values between plus and minus infinity** can now occur. The goal of logistic
regression, however, is to estimate the probability of occurrence and not the value of
the variable itself. Therefore, the this equation must still be transformed.

To do this, it is necessary to restrict the value range for the prediction to the range
between 0 and 1. To ensure that only values between 0 and 1 are possible, the
**logistic function f** is used.

### Logistic function

The logistic model is based on the logical function. The special thing about the logistic function is that for values between minus and plus infinity, it always assumes only values between 0 and 1.

So the logistic function is perfect to describe the
**probability P(y=1)**. If the logistic function is now applied to the upper regression equation the result
is:

This now ensures that no matter in which range the *x* values are located, only
values between 0 and 1 will come out. The new graph now looks like this:

The probability that for given values of the independent variable the dichotomous
dependent variable *y* is *0* or *1* is given by:

To calculate the probability of a person being sick or not using the logistic regression
for the example above, the model parameters *b _{1}*,

*b*,

_{2}*b*and

_{3}*a*must first be determined. Once these have been determined, the equation for the example above is:

## Maximum Likelihood Method

To determine the model parameters for the **logistic regression equation**, the
**Maximum Likelihood Method** is applied. The maximum likelihood method is one of
several methods used in statistics to estimate the parameters of a mathematical model.
Another well-known estimator is the least squares method, which is used in
linear regression.

### The Likelihood Function

To understand the **maximum likelihood method**, we introduce the
**likelihood function** *L*. *L* is a function of the unknown parameters in
the model, in case of logistic regression these are *b _{1}*,...

*b*,

_{n}*a*. Therefore we can also write

*L*(

*b*,...

_{1}*b*,

_{n}*a*) or

*L(θ)*if the parameters are summarized in

*θ*.

*L(θ)* now indicates how probable it is that the observed data occur. With
the change of *θ*, the probability that the data will occur as observed
changes.

## Maximum Likelihood Estimator

The **Maximum Likelihood Estimator** can be applied to the estimation of complex
nonlinear as well as linear models. In case of logistic regression, the goal is to
estimate the parameters *b _{1}*,...

*b*,

_{n}*a*, which maximize the so-called

**log likelihood function**

*LL(θ)*. The log likelihood function is simply the logarithm of

*L(θ)*.

For this nonlinear optimization, different algorithms have been established over the
years such as, for example, the **Stochastic Gradient Descent**.

## Multinomial logistic regression

As long as the dependent variable has two characteristics (e.g. *male*,
*female*), i.e. is dichotomous, **binary logistic regression** is used. However,
if the dependent variable has more than two instances, e.g. which mobility concept
describes a person's journey to work (*car*, *public transport*,
*bicycle*), **multinomial logistic regression** must be used.

Each expression of the mobility variable (*car*, *public transport*,
*bicycle*) is transformed into a new variable. The one variable mobility concept
becomes the three new variables:

*car is used**public transport is used**bicycle is used*

Each of these new variables then only has the two expressions *yes* or *no*,
e.g. the variable *car is used* only has the two answer options *yes* or
*no* (either it is used or not). Thus, for the one variable "mobility concept" with
three values, there are three new variables with two values each: *yes* and
*no* (*0* and *1*). Three logistic regression models are now created for
these three variables.

## Interpretation of the results

The relationship between dependent and independent variables in logistic regression is
not linear, hence the regression coefficients cannot be interpreted in the same way. For
this reason,
**odds** are interpreted in **logistic regression**.

### Linear regression:

An independent variable is considered "good" if it correlates strongly with the dependent variable.

### Logistic regression:

An independent variable is said to be "good" if it allows the groups of the dependent variable to be distinguished significantly from each other.

The odds are calculated by relating the two probabilities that y is "1" and that y is "not 1".

This quotient can take any positive value. If this value is now logarithmized, values between minus and plus are infinitely possible

These logarithmic odds are usually referred to as "logits".

## Pseudo-R squared

In a linear regression, the coefficient of determination
*R ^{2}* indicates the proportion of the explained variance. In logistic
regression, the dependent variable is scaled nominally or ordinally and it is not
possible to calculate a variance, so the coefficient of determination cannot be
calculated in logical regression.

However, in order to make a statement about the quality of the
**logistic regression model**, so-called pseudo coefficients of determination have
been established, also called pseudo-R squared.
**Pseudo coefficients of determination** are constructed in such a way that they lie
between 0 and 1 just like the original coefficient of determination. The best known
coefficients of determination are the **Cox and Snell R-square** and the
**Nagelkerke R-square**.

### Null Model

For the calculation of the Cox and Snell R-square and the Nagelkerke R-square, the
likelihood from the so-called null model *L _{0}* and the likelihood

*L*from the calculated model (full model) is needed. The null model is a model in which no independent variables are included,

_{1}*L*is the likelihood of the model with the dependent variables.

_{1}### Cox and Snell R-square

In the **Cox and Snell R-square**, the ratio of the likelihood function of the null
model *L _{0}* and

*L*is compared. The better the model being fitted (full model) is compared to the null model, the lower the ratio between

_{1}*L*and

_{0}*L*. The Cox and Snell R-square is obtained with:

_{1}### Nagelkerkes R-square

The Cox and Snell pseudo-determination measure cannot become 1 even with a model with a
perfect prediction, this is corrected with the
**R-square of Nagelkerkes**. The Nagelkerkes pseudo coefficient of determination
becomes 1 if the model being fitted gives a perfect prediction with a probability of 1.

### McFadden's R-square

The McFadden's R-square also uses the null model and the model being fitted to calculate
the R^{2}.

## Chi^{2} Test and Logistic Regression

In the case of logistic regression, the Chi-square test tells whether the model is overall significant or not.

Here two models are compared. In one model all independent variables are used and in the other model the independent variables are not used.

Now the Chi-square test compares how good the prediction is when the dependent variables are used and how good it is when the dependent variables are not used.

The Chi-square test now tells us if there is a significant difference between these two results. The null hypothesis is that both models are the same. If the p-value is less than 0.05, this null hypothesis is rejected.

## Example logistic regression

As an **example** for the **logistic regression**, the purchasing behaviour in an
online shop is examined. The aim is to determine the influencing factors that lead a
person to buy *immediately*, *at a later time* or *not at all* from the
online shop after visiting the website. The online shop provides the data collected for
this purpose. The dependent variable therefore has the following three characteristics:

- Buy now
- Buy later
- Don't buy

Gender, age, income and time spent in the online shop are available as independent variables.

Load this data set and try it outPurchasing behaviour | Gender | Age | Time spent in online shop |
---|---|---|---|

Buy now | female | 22 | 40 |

Buy now | female | 25 | 78 |

Buy now | male | 18 | 65 |

... | ... | ... | ... |

Buy later | female | 27 | 28 |

Buy later | female | 27 | 15 |

Buy later | male | 48 | 110 |

... | ... | ... | ... |

Don't buy | female | 33 | 65 |

Don't buy | female | 43 | 34 |

## Logistic regression result display

.
Logistic regressions, similar to linear regression models, can be easily and quickly
calculated with DATAtab. If you want to recalculate the example above, simply copy and
paste simply copy the table on purchasing behavior in the online store into DATAtab's
statistics calculator. Then select the *Regression* tab and click on the desired
variables. You directly get the results below in table form.

### Statistics made easy

- many illustrative examples
- ideal for exams and theses
- statistics made easy on 276 pages
- 3rd revised edition (July 2023)
**Only 6.99 €**

*"Super simple written"*

*"It could not be simpler"*

*"So many helpful examples"*