## Search This Blog

### RELIABILITY

Ideally, the measurements that we take with a scale would always replicate perfectly. However, in the real world there are a number of external random factors that can affect the way that respondents provide answers to a scale. A particular measurement taken with the scale is therefore composed of two factors: the theoretical "true score" of the scale and the variation caused by random factors. Reliability is a measure of how much of the variability in the observed scores actually represents variability in the underlying true score. Reliability ranges from 0 to 1. In psychology it is preferred to have scales with reliability greater than 0.7.

The reliability of a scale is heavily dependent on the number of items composing the scale. Even using items with poor internal consistency, you can get a reliable scale if your scale is long enough. For example, 10 items that have an average inter-item correlation of only .2 will produce a scale with a reliability of .714. However, the benefit of adding additional items decreases as the scale grows larger, and mostly disappears after 20 items. One consequence of this is that adding extra items to a scale will generally increase the scale’s reliability, even if the new items are not particularly good. An item will have to significantly lower the average inter-item correlation for it to have a negative impact on reliability.

Reliability has specific implications for the utility of your scale. The most that responses to your scale can correlate with any other variable is equal to the square root of the scale.s reliability. The variability in your measure will prevent anything higher. Therefore, the higher the reliability of your scale, the easier it is to obtain significant findings. This is probably what you should think about when you want to determine if your scale has a high enough reliability.

It should also be noted that low reliability does not call into question results obtained using a scale. Low reliability only hurts your chances of finding significant results. It cannot cause you to obtain false significance. If anything, finding significant results with an unreliable scale indicates that you have discovered a particularly strong effect, since it was able to overcome the hindrances of your unreliable scale. In this way, using a scale with low reliability is analogous to conducting an experiment with a small number of participants.

Calculating reliability from parallel measurements
One way to calculate reliability is to correlate the scores on parallel measurements of the scale. Two measurements are defined as parallel if they are distinct (are based on different data) but equivalent (such that you expect responses to the two measurements to have the same true score). The two measurements must be performed on the same (or matched) respondents so that the correlation can be performed. There are a number of different ways to measure reliability using parallel measurements. Below are several examples.

Test-Retest method. In this method, you have respondents complete the scale at two different points in time. The reliability of the scale can then be estimated by the correlation between the two scores. The accuracy of this method rests on the assumption that the participants are fundamentally the same (i.e., possess the same true score on your scale) during your two test periods. One common problem is that completing the scale the first time can change the way that respondents complete the scale the second time. If they remember any of their specific responses from the first period, for example, it could artificially inflate the reliability estimate. When using this method, you should present evidence that this is not an issue.

Alternate Forms method. This method, also referred to as parallel forms, is basically the same as the Test-Retest method, but with the use of different versions of the scale during each session. The use of different versions reduces the likelihood that the first administration of the scale influences responses to the second. The reliability of the scale can then be estimated by the correlation between the two scores. When using alternate forms, you should show that the administration of the first scale did not affect responses to the second and that the two versions of your scale are essentially the same. The use of this method is generally preferred to the Test-Retest method.

Split-Halves method. One difficulty with both the Test-Retest and the Alternate Forms methods is that the scale responses must be collected at two different points in time. This requires more work and introduces the possibility that some natural event might change the actual true score between the two administrations of the scale. In the Split-Halves method you only have respondents fill out your scale one time. You then divide your scale items into two sections (such as the even-numbered items and the odd-numbered items) and calculate a score for each half. You then determine the correlation between these two scores. Unlike the other methods, this correlation does not estimate your scale’s reliability. Instead, you get your estimate using the formula:

p         =          2   r
1 + r

where p. is the reliability estimate and r is the correlation that you obtain.

Note that if you split your scale in different ways, you will obtain different reliability estimates. Assuming that there are no confounding variables, all split-halves should be centered on the true reliability. In general it is best not to use a first half/second half split of the questionnaire since respondents may become tired as they work through the scale. This would mean that you would expect greater variability in the score from the second half than in the score from the first half. In this case, your two measurements are not actually parallel, making your reliability estimate invalid. A more acceptable method would be to divide your scale into sections of odd-numbered and even-numbered items.

Calculating reliability from internal consistency
The other way to calculate reliability is to use a measure of internal consistency. The most popular of these reliability estimates is Cronbach's alpha. Cronbach’s alpha can be obtained using the equation:

α          =                      Nr
1 + r (N – 1)

where α is Cronbach’s alpha, N is the number of items in the scale, and r is the mean inter-item correlation. From the equation we can see that á increases both with increasing r as well as with increasing N. Calculating Cronbach’s alpha is the most commonly used procedure to estimate reliability. It is highly accurate and has the advantage of only requiring a single administration of the scale. The only real disadvantage is that it is difficult to calculate by hand, as it requires you to calculate the correlation between every single pair of items in your scale. This is rarely an issue, however, since SPSS will calculate it for you automatically.
To obtain the á of a set of items in SPSS:

Choose Analyze then go to Scale then go to Reliability analysis.
Move all of the items in the scale to the Items box.
Click the Statistics button.
Check the box next to Scale if item deleted.
Click the Continue button.
Click the OK button.

Note: Before performing this analysis, make sure all items are coded in the same direction. That is, for every item, larger values should consistently indicate either more of the construct or less of the construct.
The output from this analysis will include a single section titled Reliability. The reliability of your scale will actually appear at the bottom of the output next to the word Alpha. The top of this section contains information about the consistency of each item with the scale as a whole. You use this to determine whether there are any .bad items. in your scale (i.e., ones that are not representing the construct you are trying to measure). The column labeled Corrected Item-Total Correlation tells you the correlation between each item and the average of the other items in your scale. The column labeled Alpha if Item Deleted tells you what the reliability of your scale would be if you would delete the given item. You will generally want to remove any items where the reliability of the scale would increase if it were deleted, and you want to keep any items where the reliability of the scale would drop if it were deleted. If any of your items have a negative item-total score correlation it may mean that you forgot to reverse code the item.

Inter-rater reliability
A final type of reliability that is commonly assessed in psychological research is called “interrater reliability”. Inter-rater reliability is used when judges are asked to code some stimuli, and the analyst wants to know how much those judges agree. If the judges are making continuous ratings, the analyst can simply calculate a correlation between the judges’ responses. More commonly, judges are asked to make categorical decisions about stimuli. In this case, reliability is assessed via Cohen’s kappa.

To obtain Cohen’s kappa in SPSS, you first must set up your data file in the appropriate manner. The codes from each judge should be represented as separate variables in the data set. For example, suppose a researcher asked participants to list their thoughts about a persuasive message. Each judge was given a spreadsheet with one thought per row. The two judges were then asked to code each thought as: 1 = neutral response to the message, 2 = positive response to the message, 3 = negative response to the message, or 4 = irrelevant thought. Once both judges have rendered their codes, the analyst should create an SPSS data file with two columns, one for each judge.s codes.

To obtain Cohen’s kappa in SPSS
Choose Analyze then go to Descriptives then go to Crosstabs.
Place Judge A.s responses in the Row(s) box.
Place Judge B.s responses in the Column(s) box.
Click the Statistics button.
Check the box next to Kappa.
Click the Continue button.
Click the OK button.

The output from this analysis will contain the following sections.
Case Processing Summary. Reports the number observations on which you have ratings from both of your judges.
Crosstabulation. This table lists all the reported values from each judge and the number of times each combination of codes was rendered. For example, assuming that each judge used all the codes in the thought-listing example (e.g., code values 1 - 4), the output would contain a cross-tabulation table like this:
 Judge A * Judge B Crosstabulation Count Judge B Total 1.00 2.00 3.00 4.00 Judge A 1.00 5 1 6 2.00 5 1 6 3.00 1 7 8 4.00 7 7 Total 5 7 8 7 27

The counts on the diagonal represent agreements. That is, these counts represent the number of times both Judges A and B coded a thought with a 1, 2, 3, or 4. The more agreements, the better the inter-rater reliability. Values not on the diagonal represent disagreements. In this example, we can see that there was one occasion when Judge A coded a thought in category 1 but Judge B coded that same thought in category 2. Symmetric Measures. The value of kappa can be found in this section at the intersection of the Kappa row and the Value column. This section also reports a p-value for the Kappa, but this is not typically used in reliability analysis.

Note that a kappa cannot be computed on a non-symmetric table. For instance, if Judge A had used codes 1 - 4, but Judge B never used code 1 at all, the table would not be symmetric. This is because there would be 4 rows for Judge A but only 3 columns for Judge B. Should you have this situation, you should first determine which values are not used by both judges. You then change each instance of these codes to some other value that is not the value chosen by the opposite judge. Since the original code was a mismatch, you can preserve the original amount of agreement by simply changing the value to a different mismatch. This way you can remove the unbalanced code from your scheme while retaining the information from every observation. You can then use the kappa obtained from this revised data set as an accurate measure of the reliability of the original codes.