## Search This Blog

### LOGISTIC REGRESSION

The chi-square test allows us to determine if the pairs of categorical variables in research  are related. But what if you want to test a model using two or more independent variables? Most of the inferential procedures we have discussed so far require that the dependent variable be a continuous variable. The most common inferential statistics in research such as t-tests, regression, and ANOVA, require that residuals have a normal distribution, and that the variance is equal across conditions. Both of these assumptions are likely to be seriously violated if the dependent variable is categorical. The answer is to use logistic regression, which does not make these assumptions and so can be used to determine the ability of a set of continuous or categorical independent variables to predict the value of a categorical dependent variable. However, standard logistic regression assumes that all of your observations are independent, so it cannot be directly used to test within-subject factors.

Logistic regression generates equations that tell you exactly how changes in your independent variables in research affect the probability that the observation is in a level of your dependent variable. These equations are based on predicting the odds that a particular observation is in one of two groups. Let us say that you have two groups in research : a reference group and a comparison group. The odds that an observation is in the reference group are equal to the probability that the observation is in the reference group divided by the probability that it is in the comparison group. So, if there is a 75% chance that the observation is in the reference group, the odds of it being in the reference group would be .75/.25 = 3. We therefore talk about odds in the same way that people do when betting at a racetrack.

In logistic regression, we build an equation that predicts the logarithm of the odds from the values of the independent variables (which is why it.s called log-istic regression). For each independent variable in our model, we want to calculate a coefficient B that tells us what the change in the log odds would be if we would increase the value of the variable by 1. These coefficients therefore parallel those found in a standard regression model. However, they are somewhat difficult to interpret because they relate the independent variables to the log odds. To make interpretation easier in research , people often transform the coefficients into odds ratios by raising the mathematical constant e to the power of the coefficient (eB). The odds ratio directly tells you how the odds increase when you change the value of the independent variable. Specifically, the odds of being in the reference group are multiplied by the odds ratio when the independent variable increases by 1.
One obvious limitation of this procedure is that we can only compare two groups at a time. If we want to examine a dependent variable in research with three or more levels, we must actually create several different logistic regression equations. If your dependent variable has k levels, you will need a total of k-1 logistic regression equations. What people typically do is designate a specific level of your dependent variable as the reference group, and then generate a set of equations that each compares one other level of the dependent variable to that group in research . You must then examine the behavior of your independent variables in each of your equations to determine what their influence is on your dependent variable.

To test the overall success of your model in research , you can determine the probability that you can predict the category of the dependent variable from the values of your independent variables. The higher this probability is, the stronger the relationship is between the independent variables and your dependent variable. You can determine this probability iteratively using maximum likelihood estimation. If you multiply the logarithm of this probability by .2, you will obtain a statistic that has an approximate chi-square distribution, with degrees of freedom equal to the number of parameters in your model. This is referred to as .2LL (minus 2 log likelihood) and is commonly used to assess the fit of the model. Large values of .2LL indicate that the observed model has poor fit. This statistic can also be used to provide a statistical test of the relationship between each independent variable and your dependent variable in research. The importance of each term in the model can be assessed by examining the increase in .2LL when the term is dropped. This difference also has a chi-square distribution, and can be used as a statistical test of whether there is an independent relationship between each term and the dependent variable.

To performing a logistic regression in SPSS
Choose Analyze thengoto Regression thengoto Multinomial Logistic.
Move the categorical DV to the Dependent box.
Move your categorical IVs to the Factor(s) box.
Move your continuous independent variables to the Covariate(s) box.
By default, SPSS does not include any interaction terms in your model. You will need to click the Model button and manually build your model if you want to include any interactions.
When you are finished, you click the Ok button to tell SPSS to perform the analysis.

If your dependent variable only has two groups in research, you have the option of selecting Analyze thengoto Regression thengoto Binary Logistic. Though this performs the same basic analysis, this procedure is primarily designed to perform model building. It organizes the output in a less straightforward way and does not provide you with the likelihood ratio test for each of your predictors in research. You are therefore better off if you only use this selection if you are specifically interested in using the model-building procedures that it offers.

NOTE: The results from a binary logistic analysis in SPSS will actually produce coefficients that are opposite in sign when compared to the results of a multinomial logistic regression performed on exactly the same data. This is because the binary procedure chooses to predict the probability of choosing the category with the largest indicator variable, while the multinomial procedure chooses to predict the probability of choosing the category with the smallest indicator variable in research.

The Multinomial Logistic procedure will produce output with the following sections.
Case Processing Summary. Describes the levels of the dependent variable and any categorical independent variables in research.
Model Fitting Information. Tells you the .2LL of both a null model containing only the intercept and the full model being tested. Recall that this statistic follows a chi-square distribution and that significant values indicate that there is a significant amount of variability in your DV that is not accounted for by your model in research
Pseudo R-Square.  Provides a number of statistics that researchers have developed to represent the ability of a logistic regression model to account for variability in the dependent variable. Logistic regression does not have a true R-square statistic because the amount of variance is partly determined by the distribution of the dependent variable in research. The more even the observations are distributed among the levels of the dependent variable, the greater the variance in the observations. This means that the R-square values for models that have different distributions are not directly comparable. However, these statistics can be useful for comparing the fit of different models predicting the same response variable. The most commonly reported pseudo R-square estimate is Nagelkerke.s R-square, which is provided by SPSS in this section.
Likelihood Ratio Tests. Provides the likelihood ratio tests for the IVs. The first column of the table contains the .2LL (a measurement of model error having a chi-square distribution) of a model that does not include the factor listed in the row. The value in the first row (labeled Intercept) is actually the .2LL for the full model. The second column is the difference between the .2LL for the full model and the .2LL for the model that excludes the factor listed in the row in research. This is a measure of the amount of variability that is accounted for by the factor. This difference parallels the Type III SS in a regression model, and follows a chi-square distribution with degrees of freedom equal to the number of parameters it takes to code the factor. The final column provides the pvalue for the test of the null hypothesis that the amount of error in the model that excludes the factor is the same as the amount of error in the full model. A significant statistic indicates that the factor does account for a significant amount of the variability in the dependent variable that is not captured by other variables in the model in research.
Parameter Estimates.  Provides the specific coefficients of the logistic regression equations. You will have a number of equations equal to the number of levels in your dependent variable  in research. 1. Each equation predicts the log odds of your observations being in the highest numbered level of your dependent variable compared to another level (which is listed in the leftmost column of the chart). Within each equation, you will see estimates of the standardized logistic regression coefficient for each variable in the model. These coefficients tell you the increase in the log odds when the variable increases by 1 (assuming everything else is held constant). The next column contains the standard errors of those coefficients. The Wald Statistic provides another statistic testing the significance of the individual coefficients, and is based on the relationship between the coefficient and its standard error in research. However, there is a flaw in this statistic such that large coefficients may have inappropriately large standard errors, so researchers typically prefer to use the likelihood ratio test to determine the importance of individual factors in the model. SPSS provides the odds ratio for the parameter under the column Exp(B). The last two columns in the table provide the upper and lower bounds for a 95% confidence interval around the odds ratio.