head
left
 
ISSN: 1738-1460
Homeome
Commercial
Conferences
Contact
Editorial Board
Hard Cover
International
Introduction
Privacy Policy
Related Links
Rod Ellis Award
Search
Site Map
Special Editions
Submissions
J


| Teaching Articles Home |

Volume 12
Professional Teaching Articles
May 2006
Article 2

Other Formats
PDF MS Doc


Article Title
Validating a Simulated Test of CET 4

Author
Yang Miao
Shantou University Medical College,
China


Abstract:

A simulated test of CET4 (College English Test, Band 4) was validated to check if it served the specific purposes of predicting and diagnosing. The study data came from a CET 4 simulated test sat by a class of sophomores who were to take a CET 4 test one month later. Based on Messick's framework of validation, the test's content coverage and representativeness were checked, and correlation analyses including inter-consistency reliability, item correlation, factor analysis and item analysis were computed. The analysis results show that the test is of modest reliability and validity, with the most serious problem in the reading section, which had too many misfit items and failed to effectively test the candidates' discourse reading ability. The contextual difficulties or inadequacy of efforts in other aspects of the validation framework implied a very unsatisfactory situation of simulated test practice in the Chinese context. It is stated that for a simulated test to effectively fulfill its purposes of predicting and diagnosing, trial tests and post hoc analysis are essential, and empirical investigations into the process of test taking, the effects of coaching and practice and the motivation problems should be advocated, and effective remedial support should be provided afterwards to ensure the positive washback of such a test.

Keywords: Messick's validation framework: Reliability; Validity; Test practice

Introduction
A simulated test paper of CET 4 (College English Test, Band 4) is evaluated in terms of its reliability and validity. The discussion is guided by the specific purposes of a simulated test and based on Messick's framework of validation. Possible approaches to validating the test paper are suggested and statistical measurements are conducted to get the related data. The introduction section begins with a brief introduction to CET and the simulated test, and moves to the explanation of Messick's framework of validation.

CET 4 and the simulated test
Put into practice in 1987, College English Test Band 4 & Band 6 (hereafter CET4 & CET6) is a national standardized English examination sponsored by the Higher Education Department of the Ministry of Education in China and administered by the National College English Testing Committee. It is a criterion-related norm-referenced test (Yang & Weir, 1998). The test criteria are based on the College English Syllabus which was designed in 1985 by the Education Department to guide the English teaching at university level. According to the Syllabus, the designing of CET should strike a balance between linguistic knowledge and linguistic competence, between accuracy and fluency, between semantic level and discourse level, and between conceptual abilities and expressive abilities (College English Syllabus, 1985). It is maintained that reliability and validity are important indexes of test quality in a standardized examination and increasing the test validity is the pivot of modern language testing research (Yang & Weir, 1998). In order to ensure scientific, objective, unified and standardized testing, the design of CET strictly follows the procedures of questions setting, initial examining, predicting, item analyzing, further examining, test composing, testing, scoring, statistic analyzing and bank building.

To check the validity of CET, the National College English Test Committee conducted a 3-year project (from 1995 to 1998) with the British Council, in which the construct validity, content validity, concurrent validity and face validity of CET were studied through comparison tests and large-scale surveys. It is concluded that CET is of high reliability (0.90) and validity (92% of the teacher subjects believe CET reflects students' actual English proficiency levels, 86% think the test contents are reasonable) (For more details of the project results, please see Yang & Weir, 1998).

Candidates of CET are undergraduates and postgraduates who have gone through with a general English course based on the College English Syllabus. Formerly, this test is composed of five components: listening, reading, vocabulary & structure, cloze and writing. Except for writing, all test items are in objective multiple-choice format. Since 1996, new test tasks such as compound dictation (a combination of partial dictation and dicto-comp), short answers to questions and English-Chinese translation have been adapted to measure students' pragmatic English competence.

In the last two decades, CET has developed to be one of the most important English exams in China. In 2005, as many as 11 million students participated in CET. Its results are regarded as authoritative evidence of English proficiency level and a pass in CET is one of the criteria for graduation in many institutions. To help students achieve higher scores in CET, simulated tests become common practice. The candidates usually have several simulated tests before CET. But contrary to the ubiquitous practice of simulated tests, its validation is seldom questioned and studied. In most cases, the test papers are ready-made, taken from CET preparation books published by different presses. As a result, the quality of simulated test papers is not guaranteed. And most of the time, the test designers give little or even no explanation of how the test is designed. The claim that the test papers follow the CET test specifications seems to be self-evident. And more often than not, users of the test papers (the English teachers) just use the papers, score the results and arrange another test without statistic treatment and analysis. Contrary to the little effort in test validation is the great amount of time, energy and resources spent in preparing and managing the simulated tests, indicating the significance of powerfully validating CET simulated test in such highly test-oriented context as China.

Messick's framework of validation
The key to understanding Messick's framework is the concept of unitary validity. A conventional view of validity identifies different types of validity, i.e. face validity, content validity, criterion-related validity and construct validity (Hughs, 1989). But according to Messick, such a view is inadequate (Bachman, 1990; Wood, 2001). He distinguishes a number of complementary facets of validity within a unified theory of validity, in which the social nature of assessment (values and consequences of score use) is a key feature and construct validity is essential in each aspect (Bachman, 1990; Messick, 1996; McNamara, 2001; Chapelle, Jamieson & Hegelheimer, 2003). In this framework, six distinguishable aspects of validation are identified to provide 'an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores' (Messick, 1989, p13 cited in Bachman, 1990, p236).

Judgmental/logical analyses:
involving the discovery of content relevance and representativeness which demonstrates that a test is relevant to and covers a given area of content or ability.

Correlation analyses: involving the quantitative analyses to gather evidence in support of the test scores and its interpretation, such as inter-consistency reliability, item correlation, multitrait-multimethod design, factor analysis and item analysis.

Analyses of process: involving the qualitative analyses to investigate the processes of test taking themselves employing approaches such as protocol analysis (concurrent or retrospective verbal reports), computer modeling, analysis of response time, analysis of reasons given by test takers for choosing a particular answer and analysis of systematic errors

Analyses of group difference and change over time: involving cross-sectional and longitudinal studies to examine the extent to which score properties and interpretations generalize to and across population group, settings and tasks.

Manipulation of tests and test conditions: involving getting the empirical knowledge about the effects of test intervention such as instruction and coaching that alter test scores in theoretically predicted ways.

Test consequences: involving the evaluation of value and intended or unintended consequences of score interpretation, which concerns issues that are associated with bias in scoring and interpretation, with unfairness in test use, and with positive or negative washback effects on teaching and learning. (Bachman, 1990; Messick, 1996)


In view of the complexity of the validation process, the suggestion that a test's use or purpose should serve as a guide to validation is accepted (Worthen, Borg & White, 1993; Read & Chapelle, 2001). But this does not mean that the validation of low-stake tests is not as essential as that of high-stake tests such as entrance tests or selection tests. Many studies of low-stake tests, such as Chapelle, Jamieson & Hegelheimer's (2003) study of a web-based ESL test, Guerrero's (2000) study of a Spanish proficiency exam and Wall, Clapham & Alderson's (1994) and Fulcher's (1997) evaluation of placement tests, also conduct rigorous validation. As one of the low-stake tests, the validation of a simulated test is underlined by its specific purposes, i.e. to locate candidates' proficiency, to predict possible pass rate and to diagnose existing problems in teaching and learning so that remedial support will be provided. In light of these purposes and within Messick's framework, the validation of a simulated test paper ideally includes

a. the analyses of test content in terms of the test (in this case CET) specifications;
b. the quantitative analyses of test scores and interpretation;
c. the qualitative analyses of test processes to identify individuals' test strategies;
d .the examination of possible discrepancies between different groups of candidates or between different times of test taking among the same group of candidates;
e. the investigation of the effects of examination practice, and
f. the consideration of both positive and negative washback such as remedial help and motivation problems.

But simulated tests are relatively little mentioned by research literature despites the fact that they are popular and important test practice in many EFL (English as a Foreign Language) context, which means that this study is only a trial one in this aspect. In later sections, possible way of validation will be discussed, but unfortunately most of the analyses are just armchair strategies and deserve further research.

Methods
Subjects

The subjects of this study are a class of sophomores in a medical college in Southeast China who are to sit for CET4 one month later. Of the 43 students in this class, 5 fail to finish the writing component. So finally only 38 complete sets of data are collected and analyzed.

Procedures
The simulated test was organized as scheduled, using paper taken from a collection of CET simulated test papers without adaptation (Luo, 2003). The test tasks are specified in Table 1. After the answer sheets were collected, the MC items were scored by optical mark reader (OMR). For Part V (writing), all essays were double marked by two independent raters using the CET rating criteria, which is holistic and based on scales ranging from 2 to 14 points. The average of the two raters' scores is taken as the candidates' final scores for writing. Statistical analyses were conducted using the Statistical Package for Social Sciences (12.0) and Gitest 3+ System, a software designed by GuangDong University of Foreign Studies in China for item analysis

Table 1.

Analysis
The test paper was reviewed and checked against the test specifications to see its content coverage and representativeness. To evaluate the reliability of this test, Conbach alpha was calculated to examine the overall reliability. Subtest and inter-rater reliability coefficients were also provided. Product-moment correlation coefficients were computed between subtests as well as between subtests and the total score and factor analysis was conducted, both of which provided validity evidence. In order to gather details of the test items, item analysis was done to find out the difficulty and discrimination of each item and identify misfit items for further discussion.

Results and discussion
Judgmental/logical Analyses

Because this test paper is to simulate CET 4, it is essential that it cover and represent the content and language abilities that are designated in the CET4 test specifications. Checked against the CET 4 test specifications provided by Yang & Weir (1998, pp. 198-200), this test paper appears to basically cover and represent the most important sub-skills of listening and reading comprehension that are identified in the specifications. As for the test points in vocabulary & structure, they are also representative to a large extend despites the fact that it is impossible to cover all of them in a single sample test. As the specifications require, some difficult grammatical points for Chinese learners are tested, such as verb forms including tenses and voices, non-finite verbs, adjective clause, noun clause and subjunctive mood. The only problem is the proportion of vocabulary items to grammar items. According to the test specifications, only 40% of this subtest should be devoted to vocabulary items while the other 60% to grammar items (Yang & Weir, 1998). But a close analysis of this subtest shows that half of the 30 items test vocabulary. Examples of vocabulary item and grammar item from this test paper are given as follows. To ensure the accuracy of this analysis result, the researcher consulted the English teacher who chose and scored the papers and agreement between them was made. If adjustment is made to include more grammar items and less vocabulary items in this part, the overall content validity can be improved.

Example of vocabulary item:
47. I can't _________ him from his brother. They look very much alike.
A) keep B) separate C) distinguish D) prevent

Example of grammar item:
36. If this university ________ such a good reputation, I would not have come here.
A) didn't have B) doesn't have C) hasn't had D) hadn't had

Correlation Analyses
Inter-consistence Reliability
The reliability (Cronbach alpha) of the whole test paper is 0.80. The subtest reliability coefficients (for MC items only) range from 0.04 to 0.75 (Table 2), among which the reliability coefficient of reading component is the lowest (0.04) and the MC listening component is not satisfactory (0.46). To check the reliability of the writing scores, Pearson product moment correlation analysis of the two sets of writing scores provided by two independent raters shows that two sets of scores are significantly correlated (correlation coefficient = 0.854) at the 0.01 level (2-tailed), and inter-rater reliability coefficient is 0.91, which prove the writing scores to be highly reliable. Both raters are experienced English teachers familiar with the marking system of CET. Every year when the simulated tests take place, they grade hundreds of essays of this kind. So it is reasonably unsurprising that the two raters achieve high consistency.

Basically, the whole test paper is reliable and is modestly adequate for low-stake tests. As for the subtests, listening (MC) is of low reliability and reading is extremely unreliable. More detailed discussion of these two parts is provided in the following sections

Table 2

Item Analysis
Item analyses done with Gitest identified 19 misfit items, i.e. 23.8% of the 80 MC items, which is far from satisfactory because only 5% is allowed for a reliable and valid test (LI, 1997). Of the 19 misfit items, 3 belong to listening, 9 to reading, 6 to vocabulary & structure and 1 to cloze (Table 2). The reading component is the most problematic with 9 out of 20 items misfit (45%), which is consistent with the reliability test result (reliability coefficient =0.04). The listening component is too easy (index of item facility = 0.89) with all 10 items falling on the easy and very easy scales. This explains a relatively low reliability (0.46).

Several kinds of problems are discovered with the misfit items. First, some items are too easy for the intended population so they show small discrimination figures and contribute little to the differentiation of different levels of candidates. They may be good ones for candidates of lower abilities. So retaining them for other tests or replacing them with more difficult ones are possible solutions. The second kind of problems lies in the given keys. Some items have more than one key. This problem is typical of reading comprehension items and is particularly warned of by Li (1997). Usually, reading comprehension questions are more controversial than other kinds, especially when higher levels of understanding, such as understanding implied meanings and making inferences, are concerned (ibid). As serious as the double-key problem is the wrong-key problem. To get rid of these misfit items, the extra keys should be changed to distracters and the wrong keys should be rewritten. The final problem lies in the distracters. Some distracters are so strong that they unreasonably attract more candidates than allowed. These distracters should be carefully examined and rewritten.

Item Correlation
The correlations between the subsets and the total test score are all significant at the 0.01 level, suggesting that every one of them reasonably contributes to the measurement of the whole test (Table 3). Since the subtests are intended to test different aspects of language, they are not expected to correlate very highly with one another. The intercorrelation coefficients are supposed to fall in between 0.3 and 0.7 (Yang & Weir, 1998). But some coefficients fall out of this scope, which indicates that some items intercorrelations are not satisfactory. Listening (MC) fails to significantly correlate with the other items except reading, and cloze only significantly correlates with dictation. Moreover, reading does not sufficiently correlate with dictation, cloze and writing.

Table 3.

According to Oller (1979), low correlations between different tests or measures are sometimes too simply taken to mean that they are measuring different skills. For example, the intercorrelation between listening (MC) and compound dictation is low, but they are both intended to test listening skills. Possible reasons for low intercorrelation may be found in Oller's explanation:

A low correlation may result from the fact that one of the tests is too easy or too hard for the population tested. It may mean that one of the tests is unreliable. Or that both of them are unreliable or a low correlation may result from the fact that one or both tests do not measure what they are supposed to measure (i.e., are not valid), or merely that one of them (or both) has (or have) a low degree of validity. (Oller, 1979, p56)

Item analyses show that the listening (MC) component is too easy (index of item facility=0.89, see Table 2). This may explain the low intercorrelation between this part and the others. Meanwhile, the results that the reading component is unreliable (reliability coefficient=0.04) may also be the reason why reading fails to sufficiently correlate with dictation, cloze and writing. At this point, two assumptions about validity are concerned: One is whether the test scores accurately reflect the trait they are intended to measure; the other is whether the differences in the scores obtained by various students represent different degrees of possession of that trait (Worthen, Borg & White, 1993). If these two assumptions are confirmed, the inferences or interpretations drawn from the test scores are accurate, or we can say that the test is valid. Since the listening (MC) component is too easy for the students, it fails to represent the differences of their ability though it may reflect the trait it is intended to measure. In this sense, the listening (MC) component in this test paper is of low validity.

As for reading, with 45% of misfit items in this part (Table 2), its discrimination ability is rather low, so it also fails to represent the differences of students' ability. Moreover, it does not accurately measure what it is designed to measure. According to Yang & Weir (1998), the reading comprehension abilities CET is expected to test should include three levels of processing: syntactical level, discourse level and inference level; and items involved in these three levels should be well-proportioned. A close analysis of the reading items reveals that students can obtain correct answers for 11 items of the total 20 (55%) by only using syntactical processing, contrary to 25% and 35% of the cases in which discourse and inference levels of reading processing are needed (Table 4). This reduces the degree of validity of this part because it does not accurately measure the reading comprehension abilities that are expected of CET.


Table 4

Normally, the correlation between reading and cloze is expected. Many studies of cloze tests (e.g. Bachman, 1981; Hanania & Shikhani, 1986) show that cloze tests can be reliable and valid measures of second language proficiency. In the studies of Streiff (1977, cited in Oller, 1979) and Hofman (1974, cited in Oller, 1979), cloze tests are even used as measures of reading proficiency. To confirm that it is the problematic reading component that leads to the low intercorrelation between reading and cloze, the validity of the cloze component deserves closer examination. Li (1997) proposes a method of analyzing different levels of test points in a cloze test, in which the levels of test points are identified as word, phrase, sentence and discourse. Accordingly, three categories of test point factors are recognized: grammar, collocation and meaning. According to Li (1997), the higher the level of test points is, the higher the degree of validity the cloze test achieves. Following her method, the cloze subtest is analyzed and, as shown in Figure 1, 12 items (60%) require discourse level of processing, which is in accord with the CET test specifications that the cloze component is aimed to test the candidates' comprehensive language abilities and should include substantial items that involve discourse comprehension (Yang & Weir, 1998). So the validity of the cloze component is of a high degree. Combined with its reliability index (0.73), it can be safely claimed that the reasons of low intercorrelation between cloze and reading do not lie in the cloze's part.

Figure 1

Factor Analysis
The correlation matrix in Table 3 was then subjected to factor analysis. As a result, 2 factors with Eigen values larger than 1 were extracted, contributing to 59.436% of the variance explained (Table 5). The loadings of each subtest on the 2 factors are shown in Table 6, where dictation, vocabulary & structure, cloze and writing are found to contribute to factor 1, and Listening (MC), reading and vocabulary & structure to factor 2.

Table 5

The above analyses of reading and cloze also help to explain the results of factor analyses. As shown in Table 6, the strongest loader on factor 2 is reading (0.890), followed by listening (MC) (0.672) and vocabulary & structure (0.532). The commonalities between listening (MC), reading and vocabulary & structure help to explain that they are testing more or less the same trait. The listening (MC) part is made up of 10 short conversations, in which syntactical level of listening processing is mostly involved to achieve successful understanding. And as discussed above, the reading part fails to include an adequate portion of items that require discourse level of processing, so it is testing more syntactical ability than discourse ability. Finally, vocabulary & structure is designed with an obvious aim to test the usage of words, phrases, collocations and grammatical structures (Yang & Weir, 1998). In this way, factor 2 can be interpreted as a specific factor that accounts for lower level of language ability such as words, grammar and sentence structures.

Meanwhile, the first factor extracted receives its highest loading from dictation (0.880), the second highest from writing (0.700), followed by two modest but still significant ones: cloze (0.571) and vocabulary & structure (0.560). This factor can be explained as a global one. Of the four strongest loaders on this factor, three (dictation, cloze and writing) are integrative and all invoke discourse processing skills. This is consistent with the results of some studies of second language proficiency that 'the global factor seems to be best measured by tests that are highly integrative in nature---especially discourse oriented tasks' (Oller & Khan, 1981, p14). The factor analyses results confirm the weak global factor hypothesis that 'there exists a general factor accounting for a large portion of the variance in all valid measures of language proficiency' (pp. 5-6) and at the same time, this general factor is complemented by 'various specific factors' (p. 16). Further support comes from Bachman and Palmer's (1981) construct validation study in which one general factor and two specific ones are found. Similarly, in their validation study of CET, Yang & Weir (1998) identify the most heavily loaded factor as 'general language ability'.

According to the quantitative analysis discussed above, the simulated test paper is modestly reliable and valid. But at the same time many problems are exposed. The listening subtest is too easy and not satisfactorily reliable. The reading subtest is extremely unreliable and not highly valid with too few items testing the discourse level of processing and too many misfit items. If pilot tests and post hoc analysis can be conducted, the too-easy questions and the misfit items can be singled out and revised before the simulated test is launched so that its reliability and validity can be guaranteed. And with higher reliability and validity, the test will be better judgment of the candidates' present language levels and better prediction of their performance in the real CET. It will then help teachers to identify students' weaknesses in language learning and take remedial actions.

Unfortunately, these practices are seldom found in the Chinese context. The reasons are two-fold. For one thing, people involved in the construction and administration of an English test do not have adequate knowledge of reliable and valid language testing to take these practices into consideration. For another, launching pilot tests and conducting post hoc analysis consumes a lot of time and energy. So in most cases the only chance of using statistical techniques is to count the pass rate, which is adequate enough to cater for the administrators.

Other aspects of validation analyses
The other aspects of validation analyses in Messick's framework tend to investigate the factors of different individuals, tasks, settings, and test conditions, as well as the influences of the test as a social act. Though they are not performed in the present study because of contextual difficulties, the possibility and significance of these analyses deserve thoughtful attention and systematical research in further studies of simulated tests.

Analysis of process
In Yang & Weir's (1998) validation study of CET, retrospective verbal reports are utilized to explore the reading strategies employed by candidates in different score bands. Yang & Weir find that contributory reading strategies are most often employed by candidates of higher scores, whereas non-contributory reading strategies are more frequently used by those of lower scores but fail to help in choosing correct answers, implying that contributory reading strategies deserve more training and learning. The time constraints of this study do not allow such an analysis but its significance is obvious. If it can be conducted, different test strategies can be identified and those contributing to better test performance will be encouraged among students. In this way, a simulated test can greatly help to guide the post test language teaching and learning.

Analyses of Group Difference and Change over Time
If a reliable and valid simulated test paper is used by both sophomores (to sit for CET 4 very soon) and freshmen (to sit for CET 4 at least one year later), discrepancies of language abilities can be discovered. Combined with the analysis of test process, a great deal of information will be drawn to guide curriculum planning for the freshmen group. Furthermore, checking the test against some external criteria, concurrent or predictive, can provide further proofs of reliability and validity. In the case of a simulated test, this analysis involves the seeking of other forms of assessment, such as teacher assessment or results of classroom tests. But disappointingly, these forms of assessment seldom exist in China's highly test-oriented context. Teachers here seldom assess their students in other ways than CET-format standardized tests. Furthermore, the confirmation of predictive validity is possible after the results of CET are reported. The correlation study between the simulated test and CET sat by the same group of students will show how well the scores of the simulated test predict students' performance in a real CET. Regrettably this study is not considered in this context either. And again the present study fails to include it due to the time constraints. But the necessity of doing it is suggested here for later researchers.

Manipulation of Tests and Test Conditions
Test preparation practice or coaching is ubiquitous in China. Its emphasis on test familiarization and anxiety reduction may improve validity but the testwiseness strategies that are encouraged in coaching correspondingly lower validity (Messick, 1996). Hamp-Lyones (1998) uses the case of TOFEL preparation as a general example of the problems of practice in this area. But the discussions of coaching for CET are few and far between. Whether test preparation is ethical or not is a big concern and deserves more empirical work.

Test Consequences
Among the issues concerned with test consequences, the washback issue is of special importance to a simulated test. As Messick (1996) points out, less valid tests could precipitate bad educational practices (negative washback) while more valid tests could facilitate good educational practices (positive washback). A simulated test of high reliability and validity will accurately reflect candidates' present level and existing weaknesses and positively result in proper remedial support in teaching. To this end, post hoc analysis of test items and follow-up revision are essential. Meanwhile, as essential is the investigation of teaching/learning context and persons (teachers/students) responses. Although formal survey or interviews are not conducted, inquiries into the reasons why some students (5 out of 43 in this case) give up writing do reveal some problems of de-motivation. The most important reason is they feel it useless writing the composition since they have done badly in the previous parts (their total scores except writing range from 30 to 41). The more simulated tests they take, the more frustrated they feel. If no remedial support is supplied by teachers after the tests, simulated tests are repetitions of frustration and failures for less proficient students. The suggestion of the present study is that the investigation of test consequences should be conducted and remedial actions should be subsequently taken in a context where simulated tests regularly take place.

Conclusion
The validation of this simulated test paper is discussed in Messick's framework from six aspects. Detailed analyses are mainly conducted in the first two aspects. Generally speaking, this simulated test paper is of modest reliability and validity. The most serious problem lies in the reading component, in which as high as 45% of the items are misfit and deserve revision. Moreover, it fails to test candidates' discourse reading ability effectively. A secondary problem lies in the listening component which is too easy for the intended candidates. These problems decrease the test's reliability and validity and hinder it from best fulfilling its function of predicting candidates' performance in future exams and diagnosing the current problems in language learning. Failure to conduct analyses in the other aspects due to contextual difficulties indicates a very unsatisfactory situation of simulated test practice in the Chinese context: the practices to study the test process, carry out comparative, concurrent or predictive tests, investigate test preparation practice and wash back effects are seldom seen, but they are of great significance for a simulated test to serve as remedial action in language teaching instead of preparation practice that leads to misjudgment and demotivation. For such a large-scale exam as CET, simulation tests costs great time, energy and resources. Its validation deserves thoughtful consideration and research.

The implications of this study are several. First, simulated test papers should undergo careful examination and only those proved to be reliable and valid can be kept for further use. Trial tests and post hoc analyses are essential. Second, empirical investigations into the process of test taking, the effects of coaching and practice and the motivation problems should be advocated. Furthermore, the establishment of various assessments in daily teaching activities is important to triangulate test results. Last but not least, effective remedial support can become one part of normal teaching and lead to positive washback. Only when the validation study is done in this way, can a simulated test be improved and reused, serving its purposes of diagnosis and prediction. Although the findings of this study might not contribute anything new to the language testing theories, by using the simulated test's purposes as a guide to validation, it indicates that if a simulated test is energetically validated, it can serve as a tool to improve language teaching and learning in addition to its function of assessment.

References
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.

Bachman, L. & Palmer, A. (1981). Basic concerns in language test validation. In J. Read (Ed.), Directions in language testing (pp. 41-71). Singapore: Singapore University Press.

Chapelle, C. A., Jamieson, J. & Hegelheimer, V. (2003). Validation of a web-based ESL test. Language Testing, 20(4), 409-439.

College English Syllabus, Shanghai: Shanghai Foreign Language Education Press, 1985.

Fulcher, G. (1997). An English language placement test: issues in reliability and validity. Language Testing, 14(2), 113-138.

Guerrero, M.D. (2000). The unified validity of the four skills exam: applying Messick's framework. Language Testing, 17(4), 397-421.

Hamp-Lyones, L. (1998). Ethical test preparation practice: The case of the TOEFL. TESOL Quarterly, 32(2), 329-337.

Hanania, E. & Shikhani, M. (1986). Interrelationships among three tests of language proficiency: standardized ESL, cloze and writing. TESOL Quarterly, 20(1), 97-109.

Hughes, A. (1989). Testing for language teachers. Cambridge: Cambridge University Press.

Li, Xiaoju (1997). The science and art of language testing. Hunan: Hunan Education Press.

Luo, L. (2003). Tsinghua version guidebooks to CET 4: simulated test papers. Beijing: Tsinghua University Press.

Messick, S. (1996). Validity and washback in language testing. Language Testing, 13(3), 241-256.

McNamara, T. (2001). Language assessment as social practice: challenges for research. Language Testing, 18(4), 333-349.

Oller, J. (1979) Language tests at school: A pragmatic approach. London: Longman Ltd.

Oller, J. W. & Khan, F. (1981). Is there a global factor of language proficiency? In J. Read (Ed.), Directions in language testing (pp. 3-40). Singapore: Singapore University Press.

Read, J. & Chapelle, C. A. (2001). A framework for second language vocabulary assessment. Language Testing, 18(1), 1-32.

Wall, D., Clapham, C. & Alderson, J.C. (1994). Evaluating a placement test. Language Testing, 11(3), 321-344.

Worthen, B. R., Borg, W. R. & White, K. R. (1993). Measurement & evaluation in the schools. London: Longman.

Wood, R. (2001). Assessment and testing: A survey of research. Beijing: Foreign Language Teaching and Research Press.

Yang, H. & Weir, C. (1998). Validation study of the national college English test. Shanghai: Shanghai Foreign Language Education Press.


right
 
Articles-Teaching
2008 Journals
2007 Journals
2006 Journals
2005 Journals
2004 Journals
2003 Journals
2002 Journals
Academic Citation
Author Index
Blog pages
Book Reviews
For Libraries
Indexes
Institution Index
Interviews
Journal E-books
Key Word Index
Subject Index
Teaching Articles
Thesis
Top 20 articles
Video
T
Announcements
Conference Material
I-TAA
Journals in Group
R & D in EFL
TESOL Asia
TESOL Certificate

 

foot
xx
Part of the Time-Taylor Network
From a knowledge and respect of the past moving towards the English international language future.

Copyright © 1999-2008 Asian EFL Journal ..........Contact Us .............last updated 20th/July/2008