head
left
 
ISSN: 1738-1460
Homeome
Commercial
Conferences
Contact
Editorial Board
Hard Cover
International
Introduction
Privacy Policy
Related Links
Search
Site Map
Special Editions
Submissions
I
J


| March 2008 home | PDF Full Journal | SWF Full Journal |

Volume 10. Issue 1

Article 8.


Article Title
Another Look at the C-Test:
A Validation Study with Iranian EFL Learners

Author
Mahmood Rouhani
Mashad, Iran

Biography:
Mahmood Rouhani holds a masters degree in ELT from the University of Isfahan, Iran. He has been teaching for seven years in high schools and private English schools in Iran. He currently teaches English in high schools in Mashad, Iran

.

Abstract
This study probes into the validity and discrimination power of the C-Test for the assessment of overall language proficiency. A total of 144 university students participated in this study. A Michigan Test of English Language Proficiency (MTELP) and a C-Test developed by the researcher were administered. The results indicated that the C-Test enjoyed high reliability and acceptable content relevance. Also the C-Test proved to have fairly high criterion-related validity. The extracts used in the C-Test turned out to measure, to a large extent, the same underlying trait as the MTELP –significant evidence of construct validity for the C-Test. However, the C-Test texts did not prove to behave consistently with examinees of different proficiency levels. Also it came out that the C-Test could not consistently classify the subjects in their appropriate proficiency levels. This finding was further affirmed by an ANOVA whose results demonstrated that the C-Test had difficulty discriminating between participants of lower and upper intermediate levels.

Key words: language proficiency, reliability, discriminatory / discrimination power, construct validity, criterion-related validity, content validity, cloze test, C-Test.

1. Introduction
Cloze test is now a well-known and widely-used integrative language test. Wilson Taylor (1953) first introduced the cloze procedure as a device for estimating the readability of a text. However, what brought the cloze procedure widespread popularity was the investigations with the cloze test as a measure of ESL proficiency (Jonz, 1976, 1990; Hinofotis, 1980; Bachman, 1982, 1985; Brown, 1983, 1993; Laesch & van Kleek 1987; Chapelle & Abraham 1990; see also Oller, 1979 for an overview). The results of the substantial volume of research on cloze test have been extremely varied. Furthermore, major technical defects have been found with the procedure. Alderson (1979, 1980, 1983), for instance, showed that changes in the starting point or deletion rate affect reliability and validity coefficients. Other researchers like Carroll (1980), Klein-Braley and Raatz (1984), Klein-Braley (1983, 1985), Farhady (1983b), and Brown (1993) have questioned the reliability and different aspects of validity of cloze tests. In view of all the criticisms made against the cloze procedure, Klein-Braley and Raatz proposed the C-Test as a modified form of the cloze test.

The C-Test consists of four or five short texts in each of which the first sentence is left intact, then the C-principle (or the rule of two) is applied: the second half of every second word is deleted, beginning with the second word of the second sentence. If a word has an odd number of letters, the ‘larger’ half is omitted. Numbers, proper names, abbreviations, and one-letter words such as ‘I’ are ignored in the counting. In the canonical C-Test each text will have either 20 or 25 blanks. The students’ task is to restore the missing parts. Only entirely correct restorations are counted as correct (i.e., spelling problems are considered errors). The testees would have roughly five minutes to answer each text, so that a test with five parts would take twenty five minutes to complete.
The C-Test is believed to have a number of advantages over the cloze test (Klein-Braley & Raatz, 1984; Klein-Braley, 1997). Some of the most important rewards of the C-Test are as follows:

1. The use of a variety of passages allows for a better sampling and representation of the language and content. Also, a person with special knowledge in a certain field cannot have an unfair advantage all through the test.
2. Since every second word is damaged, it is possible to obtain a better sampling of all the different language elements in a text.
3. C-Tests are very easy for native speakers. But someone who doesn’t know the language at all normally scores zero or close to zero.
4. C-Tests are easy to construct, administer, and score.
5. As there is only one acceptable solution in most cases, the scoring is more objective.

Ever since it was introduced, the C-Test has been the subject of many research studies and scholarly controversies. On one hand, some researchers have found the C-Test a highly integrative, reliable and valid measure of overall language ability (Klein-Braley & Raatz, 1984; Cohen, Segal & Weiss, 1984; Klein-Braley, 1985, 1997; Chapelle & Abraham, 1990; Dörnyei & Katona, 1992; Huhta, 1996; Connelly, 1997; Ikeguchi, 1998; Babaii & Ansary, 2001; Eckes & Grotjahn, 2006; see Sigott, 2004 for an extensive review). More specifically, Klein-Braley (1997) empirically compared the C-Test with a group of other reduced redundancy tests – classical cloze test, cloze elide test, multiple-choice cloze test, and standard dictation. She found that the best test to represent general language proficiency was the C-Test. Also, Eckes and Grotjahn (2006), using a Rasch model and confirmatory factor analysis, found clear evidence that the C-test is a highly reliable and unidimensional measure of general language proficiency. They found that “lexis and grammar are important components of general language proficiency as measured by C-tests” (p. 316).

However, research findings have not always been very consistent. Cohen et al. (1984), for instance, reported acceptable reliability and validity indices for their Hebrew C-Test, but they could not find any clear pattern for macro-level processing in the C-Test, though they found indications of micro-level processing (i.e., language processing at or below sentence level).

Dörnyei and Katona (1992) validated a C-Test against four different language tests including an oral interview, and a TOEIC (the Test of English for International Communication). Their results confirmed that the C-Test is a reliable and valid instrument. They also reported that their C-Test was a random and representative sample of the original text. Nevertheless, they noted that the C-Test was less efficient in testing grammar.

Analysing retrospective verbal protocols of 32 C-Test takers, Babaii and Ansary (2001) found four categories of cues used by the participants: automatic processing (16.6%), lexical adjacency (54.9%), sentential cues (22.4%), and top-down cues (6.1%). They reported that the test takers did fully exploit macro-level cues (though less frequently (28.5%) than micro-level cues (54.9%)) to restore the mutilations and concluded that the C-Test taps various aspects of language proficiency to varying degrees and, as such, it is a valid operational of the reduced redundancy principle. Notwithstanding, they maintained that their subjects mostly relied on their grammatical judgments to restore the mutilations –a finding which was contradictory to that of Dörnyei and Katona (1992).

On the other hand, researchers like McBeath (1989, 1990), Hughes (1989), Weir (1990), and Jafarpur (1995, 1999a, 1999b, 2002) have doubted some of the claims made on the part of the C-Test. In more specific terms, Hughes (2003, p. 195) referred to the puzzle-like nature of the C-Test as a disadvantage: “It is harder to read than a cloze passage, and correct responses can often be found in the surrounding text. Thus the candidate who adopts the right puzzle-solving strategy may be at an advantage over a candidate of similar foreign language ability.” Along the same lines, Weir (1990) believes that the face validity of the procedure is low as it is irritating for students to have to process heavily mutilated texts.

Jafarpur’s (1995) findings showed that C-testing is suffering from the same shortcomings as the cloze procedure. He found that “the rule of two” is not a proper tool to obtain a representative sample of the basic elements of a text. He was able to show that different deletion starts and deletion ratios produce different tests with different results –which he interpreted as suggestive of the invalidity of the procedure (but see Hastings, 2002 for counter arguments). More interestingly, his analysis of his subjects’ answers to 10 attitudinal questions after taking the C-Test led him to the conclusion that “C-Tests do not possess face validity” (p. 209). The subjects on the whole believed that the C-Test is more of an IQ test or a test of spelling than a test of overall language ability. They believed it is more like a puzzle and is basically good for children.

In much the same vein, Jafarpur (1999b) substantiated that there is nothing magical about the rule of two. He was able to empirically demonstrate that other deletion rates and deletion starts yield more or less similar results.

In another study, Jafarpur (1999a) pretested a C-Test comprising 5 texts and 126 items with 146 subjects. On the basis of a classical item analysis, he discarded unsatisfactory items and developed a ‘tailored’ C-Test version with 100 items and tried it with 60 other subjects. The results indicated that classical item analysis does not improve the psychometric and statistical characteristics of the C-Test.

Furthermore, Jafarpur (2002) compared the performance of a C-test and a cloze test against a standardized criterion measure. The results showed that the C-Test enjoys a high reliability and concurrent validity and the deletions in a C-Test represent a more comprehensive coverage of different language elements than the cloze test. Yet he concluded that: (a) the C-Test  is not an easily constructed, automatically reliable and valid measure of language competence, (b) the application of the ‘rule of two’ does not guarantee acceptable discrimination power for all items, (c) scoring does not offer any advantage over the cloze.  

In the light of the variability and inconsistency of the results obtained with the C-Test, it seemed to the researcher that replicative investigations of the qualities of this testing device are in order before definitive decisions can be made as to its credibility as a measure of overall language ability. Therefore, the current study set out to empirically explore aspects of validity and discriminatory power of the C-Test among Iranian EFL learners.

2. Method      
2.1. Instrumentation
a. The C-Test: To construct the C-Test, thirteen texts were chosen from various EFL/ESL materials. The excerpts were authentic and self-contained and they varied in subject matter. The texts were of different levels of difficulty as judged by the Flesch Reading Ease readability scale (Microsoft Word, 1983–99) and a group of eight university EFL instructors. Every first sentence of each passage was left intact to provide a complete context. Beginning from word two of sentence two, the second half of every other word was deleted. In each mutilation, exactly half of the word was omitted, but if the number of letters was uneven, one extra letter was left out. Numbers, proper names and one-letter words were ignored in the counting and thus were not mutilated either. In this way, thirteen mutilated texts were produced with each one containing 20 gaps.

To facilitate pretesting, the extracts were randomly divided into two C-Tests which were, then, randomly given to 49 Iranian foreign language learners of English, 6 Iranian EFL teachers, and 3 native speakers. The completed test papers were scored giving one point for each exact restoration. The scores were item analyzed and five texts with superior discriminability and facility values were chosen. These texts were about culture, education, listening, bees, and underwater discoveries. They varied in difficulty with Flesch Reading Ease values of 62, 40, 75, 82, and 64, respectively. Dörnyei and Katona (1992) recommend the use of extracts with various difficulties in order to obtain equal measurement accuracy in both tails of a sample distribution.

The C-Test thus prepared comprised 100 gaps, fulfilling the recommended minimum number of mutilations (Klein-Braley, 1997; Raatz & Klein-Braley, 1995). The instructions were given in Persian along with a short English C-Test example and its restored answer. The final version of the C-Test can be found in Appendix I.

b. The criterion measure: The Form Q of the Michigan Test of English Language Proficiency (MTELP) (Corrigan, Dobson, Kellman, Spaan, & Tyma, 1979) was used as the criterion for determining concurrent validity coefficients. This test is a retired component of the Michigan English Language Assessment Battery (MELAB) which is a discrete point language proficiency measure. The MTELP lasts 75 minutes to administer and comprises three subtests: ‘Grammar’, ‘Vocabulary’, and ‘Reading comprehension’. The subtests contain 40, 40, and 20 four-choice items, respectively. The total score is the sum of the subtest scores. The manual reports reliability estimates of over .90 for the test and its subtests.

2.2. Participants
A total of 144 university students participated in this study. From these, 101 subjects took both the C-Test and the MTELP. They include: (a) 14 freshman, 22 sophomore, 23 junior, and 31 senior English majors studying at Khurasgan Azad University and the University of Isfahan, and (b) 11 engineering majors enrolled at an ESP course at Isfahan University of Technology. The other 43 subjects were all MA students of TEFL. They include 23 students at Najafabad Azad University, 14 students at Khurasgan Azad University, and 6 students at the University of Isfahan. These examinees took the C-Test only. The participants (mostly in their twenties) were of both sexes and enjoyed different levels of proficiency.

2.3. Test administration and scoring

In neither of the two tests had the participants been informed beforehand; so there was no preparation of any kind for the exam. The MTELP was first administered to the testees within the time limit of 75 minutes. The subjects were told that they would be informed of their grades, that their high scores on the test would affect their final term grades, and that high-ranking students would receive a prize. They were all informed that marks would be taken away for their wrong answers. The answer sheets were scored by the researcher. The MTELP scores were corrected for guessing in order to reduce the effect of chance (cf. Harris, 1969; Jafarpur, 1997). However, to remove the effect of practice, the subjects were not told that they were going to be tested again.

The C-Test was administered to the same subjects. However, since the subjects studied at different universities, the administration date varied from 10 to 14 days to cope with some limitations. It was assumed that the examinees’ level of language proficiency had not changed significantly over the period. The completed C-Test papers were scored using the more convenient exact word scoring and counting spelling mistakes as incorrect. Alternative scoring procedures (acceptable word scoring, and tolerating spelling mistakes) have been shown to produce practically the same results as the one adopted in this study (Dörnyei & Katona, 1992; Huhta, 1996).

3. Results and discussions
The scores of the participants on all the tests and subtests were processed using the Statistical Package for the Social Sciences, Release 9.0.0 (SPSS, 1989-99). Table 1 shows descriptive statistics obtained from the C-Test, the MTELP, and their respective subparts, along with item facility (IF) and item discrimination (ID) indices of each C-Test text (C-Text, hereafter). In computing item facility and item discrimination indices each C-Text was considered a ‘super-item’ (see below). A sample separation procedure was adopted for computing item discrimination indices (Henning, 1987; Farhady, Jafarpur, & Birjandi, 1994).

Table 1 - Descriptive statistics for the scores of the subjects on all measures

 

Test

No. of Items

 

Mean

 

SD

 

Min.

 

Max.

 

IF

 

ID

 

(N = 144)
C-test:
C-Text 1
C-Text 2
C-Text 3
C-Text 4
C-Text 5

 

100
20
20
20
20
20

 

54.69
14.25
11.79
12.65
9.45
6.60

 

14.70
3.09
3.49
3.52
4.36
4.35

 

16
5
3
3
0
0

 

93
20
20
20
20
18

 

 

.70
.58
.62
.49
.35

 

 

.29
.30
.32
.47
.42

 

(N = 101)
Michigan:
Grammar
Vocabulary
Reading

 

100
40
40
20

 

28.08
15.94
7.39
4.75

 

14.96
8.84
5.87
4.18

 

-3.33
-4.33
-.33
-2.33

 

70.33
36
33.33
18.66

m

m

 

 

 

 

 

 

 

 

 

 

The item discrimination values are in the range of .29 to .47 with mean value of .36 for the whole test. Jafarpur (1997, 2002) believes that item discrimination indices higher than .20 are acceptable. On this basis, texts in our C-Test demonstrate fairly low, yet acceptable item discriminability indices. The most attractive item facility and the highest item discrimination goes to C-Text 4 with an IF value of .49 and an ID value of .47.

The item facility indices for the five texts of the C-Test are in the acceptable range of .35 to .70 (cf. Raatz & Klein-Braley, 1995). The mean item facility for the whole C-Test is thus .55, which is very desirable (Henning, 1987). As far as item facility is concerned, except for C-Text 3, the other texts are arranged in an ascending order of difficulty.

As another index of relative difficulty, mean scores of the participants on each extract show the same pattern. They drop from 14.25 on C-Text 1, to 11.79 on C-Text 2, and after a slight increase to 12.65 on C-Text 3 continue a descending route to 9.45 on C-Text 4 and then 6.60 on C-Text 5.

3.1. Reliability
In order to allow better comparison, reliability coefficients for all the tests and subtests were estimated by the Kuder-Richardson Formula 21 (KR-21). The reliability estimate for the C-Test was also computed by the Cronbach’s alpha formula. Both these formulas are measures of internal consistency.

Raatz and Klein-Braley (1995) suggest that it is possible to perform an inner consistency analysis on C-Tests. They agree that it is not permissible to define the individual blanks in the C-Test as items, since they are dependent on each other as a result of text structure and content. But they propose a practical solution: to consider each C-Test text as a super-item and then enter these four or five super-items into the Cronbach’s alpha formula to estimate the reliability. Raatz (1985, p. 64) states:

Assuming that all the parts are independent of each other, but are equivalent and measure the same thing, then the total test score is the sum of the part scores. These parts can be viewed as super items. In this case one can calculate inter correlations and discrimination indices for the super items without going inside the test parts. The reliability of the whole test can be calculated using Cronbach’s alpha.

Table 2. Reliability indices for all tests

 

Test

 

KR-21

Cronbach’s Alpha

 

C-test:
C-Text 1
C-Text 2
C-Text 3
C-Text 4
C-Text 5

 

   .90
   .65
   .63
   .68
   .78
   .79

 

      .85

 

Michigan:
Grammar
Vocabulary
Reading

 

   .92
   .90
   .85
   .83

m

 

 

 

 

 

 

 

 

Table 2 shows reliability coefficients of the two tests and their subparts computed by KR-21 formula. It also shows the reliability estimate for the C-Test computed by the Cronbach’s Alpha formula. In doing so, each C-Test text was regarded as a ‘super-item’ and accordingly the alpha coefficient was calculated with five items.

Scores from both the C-Test and the MTELP show very high KR-21 reliability coefficients (.90 and .92, respectively). The reliability of the C-Test as estimated by the Cronbach’s Alpha formula is also reasonably high (.85). The reliability coefficients of the scores obtained from the components of the MTELP are quite acceptably high too, the coefficients for each being over .83. However, only two subparts of the C-Test show satisfactory reliability indices, namely C-Text 4 (.78) and C-Text 5 (.79). The other three C-Texts demonstrate only moderately acceptable reliability with coefficients of .65 for C-Text 1, .63 for C-Text 2, and .68 for C-Text 3.
The fact that the whole C-Test is almost as reliable as the criterion (MTELP) appears to support claims concerning the high reliability of the C-Test (e.g. Klein-Braley & Raatz, 1984; Klein-Braley, 1985, 1997; Dörnyei & Katona, 1992; Connelly, 1997, to name a few).

3.2. Validity
The primary concern for any test is that the interpretations and the uses we make from the test scores are valid. The evidence that we collect in support of the validity of a particular test can be of three general types: content relevance, criterion relatedness, and meaningfulness of construct (Bachman 1990). These categories have been separately discussed below with regard to the data presented in this study and the interpretations that can be legitimately made on their basis.

3.2.1. Content validity
A necessary stage in test validation is to investigate whether the test is relevant to a given area of content or ability. In the case of language tests, one principal concern of content validity is with the extent to which a test measures a representative sample of the language in question (Weir, 1990).

Table 3 represents the number and percentage of content and function words in the whole C-Test and each of its texts. In addition, it shows the number, percentage, and type of words mutilated in the same texts. In this analysis, auxiliary verbs, prepositions, conjunctions, pronouns, determiners, numbers, and adverbs (other than manner adverbs) have been counted as function words. The other words in the texts belong to categories of nouns, verbs, adjectives, and adverbs of manner, which are typically considered content words.

Table 3 - Number and percentage of content and function words (mutilated) in each C-Text and in the whole C-Test

m

Total (343 words)

Mutilated (100 words)

Content

Function

Content

Function

Freq. %

Freq. %

Freq. %

Freq. %

C-Test

160           47

183          53

46            29

54            30

C-Text 1

27            42

37           58

8             30

12            32

C-Text 2

39            53

35           47

12            31

8             23

C-Text 3

25            42

35           58

7             28

13            37

C-Text 4

29            37

50           63

10            34

10            20

C-Text 5

 40            61

26           39

 9             23

11            42

 

 

 

 

 

The truncated words in each C-Text represent different parts of speech. In C-Text 1, as an example, four prepositions, three adverbs, two determiners, one pronoun, one auxiliary verb, and one numeric expression are mutilated.  As for content words, there are five nouns, two verbs, and one adjective mutilated.
As is evident from the table, the percentage of content words mutilated in the whole C-Test (29%) is almost equal to the percentage of the function words mutilated (30%). Hence, the truncated words in the C-Test conform to the demands of content validity as they represent ‘a slice of reality’ (Raatz 1985, p. 63). Although, this finding does not accord with Jafarpur’s (1995) results, it compares very favorably with those of Dörnyei and Katona (1992) and Klein-Braley (1985) for it reveals that the C-principle is capable of obtaining a reasonably representative sample of all the word classes in a text.

 3.2.2. Criterion validity
Exploring the validity of a test by means of external criteria is seen as essential by many scholars (Weir, 1990; Bachman, 1990). Criterion-related evidence demonstrates a relationship between test scores and some criterion which is believed to be also an indicator of the ability tested. Concurrent validity is a kind of criterion-related validity which is obtained through concurrent administration of a newly developed test with another well-known standardized test of which the validity is already established (Hatch & Farhady, 1982; Brown, 1988).

Table 4 - Correlation coefficients among the scores of the two measures

 

Test

 

     MTELP

 

       C-Test

 

C-Test:
C-Text 1
C-Text 2
C-Text 3
C-Text 4
C-Text 5

 

(N = 101)
         .72
         .54
         .63
         .63
         .63
         .45

 

(N = 144)

         .71
         .73
         .80
         .85
         .77

 

MTELP:
Grammar
Vocabulary
Reading

 

(N = 101)

         .88
         .81
         .59

 

(N = 101)
         .72
         .70
         .47
         .46

 

 

 

 

 

 

 

 

 

All correlations are significant at p<.01 level (2-tailed).


   Table 4 provides product-moment correlations among the scores from the C-Test and the MTELP. The table delineates that total C-Test scores correlate comparatively highly with total scores from the criterion (.72). The correlation coefficients between the C-Test and each of its C-Texts are quite high (.71, .73, .80, .85, and .77, respectively). There is also considerable correlation between the MTELP and the five C-Texts (.54, .63, .63, .63, and .45, respectively).

The C-Test shows a reasonably high correlation with the grammar subtest (.70). However, its correlations with the vocabulary and reading subtests are not very much promising (.47, and .46, respectively). These coefficients seem to contradict Dörnyei and Katona (1992) who found that the C-Test is less efficient in testing grammar. By contrast, these results are comparable with Chapelle and Abraham (1990) who concluded that the C-Test is more of a grammatically based test. Also Babaii and Ansary’s (2001) finding that their subjects mostly utilized their grammatical judgments to reconstruct the text is supported here.

Notice that the correlation of the C-Test with the MTELP was only moderately high (.72). A reasonable hypothesis is that the low face validity associated with the C-Test (Hughes, 2003; Weir, 1990; Jafarpur 1995) could most probably have affected the subjects’ performance. If a test does not appear to the testees as face valid, then their adverse reaction to it results in a performance which is not a true reflection of their abilities. Weir (1990, p. 26) quotes Anastasi (1982, p. 136) who has argued:

Certainly if test content appears irrelevant, inappropriate, silly or childish, the result will be poor co-operation, regardless of the actual validity of the test. Especially in adult testing, it is not sufficient for a test to be objectively valid. It also needs face validity to function effectively in practical situations.

3.2.3. Construct validity
The main concern of language test makers is whether test performance truly reflects language abilities. Construct validation helps to substantiate the extent to which a testee’s performance on a particular test can be indicative of his/her underlying competence. Construct validity, as characterized by Bachman (1990, p. 254), refers to ‘the extent to which performance on tests is consistent with predictions that we make on the basis of a theory of abilities, or constructs’. In investigations of construct validity, therefore, we are concerned with empirically testing hypotheses about the relationships between test scores and underlying traits. Below there are reports on several analytical procedures conducted on the data obtained in this study to examine the construct validity of the C-Test.

3.2.3.1. C-Test and staged development of L2 competence
One theory in second language learning holds that there is an orderly progress in L2 learning and learners go through a number of developmental stages, “from very primitive and deviant versions of the L2, to progressively more elaborate and target-like versions” (Mitchell & Myles, 1998, p. 10). In an attempt to establish the construct validity of the C-Test, Klein-Braley (1985) provides evidence that C-Tests support the theory of a regular progression in language learning. That is, since language competence increases progressively, if “the same C-Test is administered to the subjects at different stages of language development, then the C-Test scores will become successively higher as the subjects become more proficient in the language” (Klein-Braley, 1985, p. 84).

To investigate the plausibility of this claim a special kind of subject grouping was required. Therefore, the undergraduate subjects were first classified into four proficiency groups based on the distance of their MTELP scores from the mean of the whole sample (the MA students were not included, for they had not taken the MTELP). The subjects whose scores were lower than -2/3 SD below the mean were operationally classified as elementary level. Similarly, the lower intermediate level comprised examinees with scores between the mean and -2/3 SD. Those whose scores were between the mean and +2/3 SD were placed in the upper intermediate level. And finally, the advanced level contained examinees with scores more that 2/3 SD higher than the mean.

 

 

Test

Elementary
(N = 26)

Mean          SD

Lower Intermediate
(N = 28)
Mean          SD

Upper Intermediate
(N = 23)
Mean          SD

Advanced
(N = 24)

Mean         SD

C-test:

36.38        10.59

51.79         7.78

56.39         8.84

65.21       15.44

C-Text1

11.08         3.46

13.75         3.92

14.35         2.60

16.04        3.16

C-Text 2

 7.54          2.49

11.25         2.82

12.74         2.70

13.00        3.18

C-Text 3

 8.88          3.22

12.11         2.48

13.83         2.55

14.83        3.41

C-Text 4

 5.35          2.86

 8.43          2.53

 9.74          2.83

12.88        5.30

C-Text 5

 3.38          2.84

 6.25          2.82

 6.22          4.28

 8.50         4.86

MTELP:

10.40         4.51

23.41         2.91

32.36         3.03

48.58        9.81

 

 

 

 

 

 

 

 

 

Table 5 - Raw means and standard deviations for four proficiency groups

Table 5 presents means and standard deviations for the scores of the undergraduate subjects. As it is observed, the mean scores of both the criterion and experimental measures for the four groups increase progressively. Specifically, the mean scores on the C-Test become increasingly higher from a mean of 36.38 to 51.79 to 56.39 to 65.21, respectively. The mean scores on each C-Text behave in the same fashion, i.e., they become successively higher as the level of proficiency increases. Though these means speak of validity for the C-Test, they should be subjected to further scrutiny to ensure their credibility. One way to do this is to examine the differences among the means of the four proficiency groups through an analysis of variance (ANOVA).

Table 6 - ANOVA results for the differences among means of four proficiency groups on the C-Test

Source of Variance

Sum of Squares

 

df

Mean Square

 

F

 

Sig.

 

Between Groups
Within Groups

 

10971.339
11638.305

 

3
97

 

3657.113
119.983

 

30.480

 

.000

Table 6 shows the ANOVA results for the test of differences among the means obtained by the four proficiency groups on the C-Test. The obtained F ratio is significant at p<.000 level suggesting that there is a difference among the means. However, it has to be noted that the significance of the F ratio in an analysis of variance merely indicates that there is a significant difference among the means of the compared groups as a whole; that is, it indicates that there is at least one significant difference between the means of at least one pair of the groups compared (Brown, 1988). All the same, it does not tell us where exactly this difference lies, i.e., exactly which two means are different. In order to determine exactly which means differ one has to resort to pairwise multiple comparisons, which are considered post hoc or follow-up tests (Hatch & Farhady, 1982). The only requirement for these tests is that the overall F in the ANOVA is statistically significant.

Table 7 represents the results of a Tukey’s honestly significant difference (HSD) test conducted on the means of the four proficiency groups. Tukey’s HSD test is a commonly used multiple comparison test which reveals the precise location of differences by analyzing every two means separately (Brown, 1988; Delavar, 2002). Table 7 denotes that there is significant difference between the means of every combination of two proficiency groups except for one: the upper and the lower intermediate groups. That is, the performances of the upper and the lower intermediate groups on the C-Test are not so much different that can be statistically acceptable.

Table 7 - Results of Tukey’s HSD multiple comparisons on the means of the four proficiency groups

,

 

Elementary

 

Lower Inter.

 

Upper Inter.

 

Advanced

 

Elementary

 

-----

,

,

,

 

Lower Inter.

 

7.31*

 

-----

,

,

 

Upper Inter.

 

9.02*

 

2.12

 

-----

,

 

Advanced

 

13.16*

 

6.25*

 

3.90*

 

-----

 

 

 

 

 

 

 

 

* Significant mean difference at p<.05 level

The fact that the C-Test has not been able to produce significant distinction between the two middle groups in this study is indicative of a lucid shortcoming for the C-Test, namely a low classification power. These results are not only in clear contrast to claims about the measurement accuracy of the C-Test (Dörnyei & Katona, 1992) but they also challenge the dependability of using C-Tests for placement purposes (Klein-Braley, 1997). This interpretation is further supported by an investigation of decision consistency described below.

3.2.3.2. Decision consistency
The scores from the C-Test were also studied for decision consistency. Decision consistency refers to the agreement between the classifications of the same examinees based on two tests of the same ability (Livingston & Lewis, 1995). In more practical terms, decision consistency is “the percent classifications of subjects by the experimental test that correspond correctly to those by the criterion” (Jafarpur, 2002, p. 42). Table 8 shows the percent correct classifications that are made if the C-Test was used as the criterion. As can be observed, the C-Test can on the average correctly place just over fifty percent of the subjects in their appropriate proficiency groups. It is by no means a promising quality for a test to fail to classify almost half of the examinees in their proper levels.

Table 8 - Percent of correct classification predicted by the C-test

Criterion for Placement

 

Elementary

Lower Inter.

Upper Inter.

 

Advanced

 

Average

C-test

69%

39%

35%

62.5%

51.5%

 

 

 

3.2.3.3. C-Test and text difficulty
In an attempt to establish the construct validity of the C-Test, Klein-Braley (1985, p. 88) claims that:
It is possible to show that while their empirically measured difficulty (as C-Test texts) varies according to the subject group involved, the group of texts used in any one C-Test remains more or less constant in terms of relative difficulty.

Therefore, one construct validity concern is to see whether C-Test texts (or C-Texts) function similarly across proficiency levels. In order to explore how similarly subjects  from different levels of proficiency perform on each C-Text an ANOVA was performed on the scores obtained from the five C-Texts for the four proficiency groups and the MA students. It was assumed that the mean performance of a group of subjects on a C-Text can be a good index of the difficulty of that C-Text for that particular group.
Table 9 provides the outcome statistics of the ANOVA. The significance of the F ratio found for each group (at p<.000 level) denotes that there are statistically meaningful differences among the means (i.e., average performances) of each group on the five C-Texts. Again, a Tukey’s HSD test was carried out to specify on exactly which C-Texts the performances of each of the groups differ. Table 10 depicts the significant mean differences found among the five C-Texts for the four proficiency groups and the MAs.

Table 9 - ANOVA results for differences among means of five C-Texts for five proficiency groups

Proficiency Group

Source of Variance

Sum of Squares

 

df

Mean Square

 

F

Sig.
(p)

 

Elementary
(N = 26)

 

Between texts
Residual

 

935.123
1119.000

 

4
125

 

233.781
8.952

 

26.115

 

.000

 

Lower Inter.
(N = 28)

 

Between texts
Residual

 

1006.857
897.286

 

4
135

 

251.714
6.647

 

37.871

 

.000

 

Upper Inter.
(N = 23)

 

Between texts
Residual

 

1056.617
1033.304

 

4
110

 

264.404
9.394

 

28.147

 

.000

 

Advanced
(N = 24)

 

Between texts
Residual

 

788.783
1918.917

 

4
115

 

197.196
16.686

 

11.818

 

.000

 

MA
(N = 43)

 

Between texts
Residual

 

1506.400
2480.233

 

4
210

 

376.600
11.811

 

31.887

 

.000

 

 

 

 

 

 

 

 

 

 

 

 

 

 

As observed, the mean performance of both the upper intermediate and the MA groups on the first three C-Texts are not significantly different. However, their means on C-Text 4 and C-Text 5 not only show a significant difference from the other three C-Texts but from each other as well. While the table indicates a similar pattern for the lower intermediate group, it is however visible that the performance of this group has changed noticeably from C-Text 1 to C-Text 2, too. On the other hand, the results obtained for the means of the advanced group on the five C-Texts represent a completely different pattern. For them it is simply the C-Text 5 which is significantly different from the other four C-Texts. The pattern of mean differences for the elementary group, however, is so complicated that it is almost impossible to interpret.

What is evident is that there are no less than four patterns of mean difference among these five groups. The fact that these five groups have performed differentially on the five C-Texts can be interpreted as a counter evidence to Klein-Braley’s (1985) claim concerning relative constancy of C-Test texts’ difficulty independent of the subjects’ proficiency level. These results are suggestive of the point that the C-Test suffers from one of the same problems as the cloze test does, namely the unpredictably variable nature of the cloze procedure (cf. Brown, 1993; see also Alderson, 1983; Klein-Braley, 1983). Jafarpur (1995) arrived at a similar conclusion as a result of comparing 20 C-Test versions developed based on the same text.

Table 10 - Results of Tukey’s HSD for differences among the means of each proficiency group on the five C-Texts

,

 

Elementary

Lower Inter.

Upper Inter.

 

Advanced

 

MA

 

C-Text 1

 

C-Text 2
C-Text 3
C-Text 4
C-Text 5

 

*

*
*

 

*

*
*

 

 

*
*

 

 

*

 

 

*
*

 

C-Text 2

 

 

C-Text 1
C-Text 3
C-Text 4
C-Text 5

 

*

 

*

 

*

*
*

 

 

*
*

 

 

*

 

 

*
*

 

C-Text 3

 

 

C-Text 1
C-Text 2
C-Text 4
C-Text 5

 

 

*
*

 

 

*
*

 

 

*
*

 

 

*

 

 

*
*

 

C-Text 4

 

C-Text 1
C-Text 2
C-Text 3
C-Text 5

 

*

*

 

*
*
*
*

 

*
*
*
*

 

 

*

 

*
*
*
*

 

C-Text 5

 

 

C-Text 1
C-Text 2
C-Text 3
C-Text 4

 

*
*
*

 

*
*
*
*

 

*
*
*
*

 

*
*
*
*

 

*
*
*
*

* Significant mean difference at p<.05 level

3.2.3.4. Factorial validity
One of the most extensively used approaches in construct validation of language tests is factor analysis (Bachman, 1990). Factor analysis attempts to identify underlying variables, or factors, that explain the pattern of correlations within a set of observed variables (Farhady, 1983a; see also Oller & Hinofotis, 1980). Therefore, in order to further investigate the construct validity of the C-Test the scores of the subjects on the two measures were subjected to a factor analysis. To ensure higher precision, a principal axis factoring (PAF), as opposed to a principal components factoring (PCF), was employed to extract the initial factors (cf. Sharma, 1996; see also Carroll, 1983; Farhady, 1983a; Baker, 1989).

In order to determine the number of factors to be extracted, the eigenvalue-greater-than-one rule was utilized (Sharma, 1996). The eigenvalue-greater-than-one rule suggests that those factors whose eigen values (sum of squared loadings) are less than unity be excluded from the analysis. It appeared that only the eigenvalue for the first factor exceeded unity. Accordingly, the one-factor solution was adopted as the most reasonable.

Table 11. Results of factor analysis (subtests only)

Test

Subtest

Factor 1

C-Test: 

C-Text 1
C-Text 2
C-Text 3
C-Text 4
C-Text 5

.67
.75
.80
.80
.60

MTELP:

Grammar
Vocabulary
Reading

.77
.57
.49