CEP 920
Final Paper













Computerized Adaptive Testing

1.What is CAT?

Adaptive testing means that the sequence of test questions presented to each student and the questions themselves will vary because they are based on responses to prior test questions. The same examinee taking the same test twice in succession will almost always receive different questions. Each question is automatically chosen to yield maximum information about the examinee, based on the skill level indicated by the examinee's answers to previous questions. Although fewer questions are presented for each test than would be given in a paper-and-pencil test, accuracy is maintained. This process achieves several positive results. Examinees are tested more quickly, even though the tests are untimed, and are not frustrated or bored by questions that are too hard or too easy. The difficulty of the questions is quickly and automatically adapted to the capability of the individual examinee. So, challenging tests corresponding to each examinee's skill level are always provided. Because the tests are untimed, examinees may work at their own pace. Both examinees and administrators can benefit since test results may be displayed right away. Test administrators and guidance staff also benefit from immediately available scores, greatly reduced test security problems, and decreased paperwork. Because of the adaptive nature of the tests, the questions presented on successive tests will vary, which greatly reduces the effects of repeated practice on the tests. The elimination of repeated questions will be even more marked as time passes and the student's skills change. As the availability of interactive computers increased in the early 1970s, adaptive testing became computerized adaptive testing (CAT), in which test items are administered by interactive computers on terminals, and examinees respond on the terminal keyboard. The computer is used as a means of selecting the next time to be administered, and early research was based on mechanical branching rules not using item response theory (Betz &Weiss, 1973). As procedures for item response theory became practical, CAT and IRT merged into the current IRT-based strategies. The objective CAT is to construct an optimal test for each examinee. To achieve this, an examinee’s trait level (K) is estimated during test administration, and items appropriate to the examinee’s K are selected from an item bank. Items are selected to match the examinee’s estimated K according to an IRT model that is assumed to describe an examinee’s response behavior. Unlike paper-and-pencil tests, different examinees can receive different tests of differing length. In the United States, several tests have an operational CAT version, e.g., the Graduate Record Examination (Educational Testing Service, 1996) and the Computerized Placement test (College Board, 1993). CAT is also becoming increasingly popular outside the U.S. For example, in the Netherlands, the National Institute for Educational Measurement has currently released two CATs , one for assessing examinees to different levels of a mathematics course and one for assessing achievement in a specific mathematics course.

2. Important Theoretical and Practical Contribution in the History of Item Response Theory and Computerized Adaptive Test

1916            Binet and Simon were the first to plot performance levels against an independent variable and use the plots in the test development.

1936 Richardson derived relationship between IRT model parameters and classical item parameter, which provided an initial way for obtaining IRT parameter estimates.

1943,44  Lawley produced some new procedure for parameter estimation.

1952               Lord described the two-parameter normal ogive model, derived model parameter estimates, and considered applications of the model.

1957,58   Birnbaum substituted the more tractable logistic models for the normal ogive models, and developed the statistical foundation for these new model

1960                  Rasch developed three item response models and described them in his book, Probablilstic Model for Some Intelligence and Attainment Tests. His work influenced Wright in the United Ststes and psychologists such as Andersen and Fischer in Europe.

1967               Wright was the leader and catalyst for most of the Rasch model research in United States hrough the 1970s. His presentation at the ETS Invitational Conference on Testing Problems served as a major stimulus for work in IRT, especially with the Rasch model. Later, his highly successful AERA Rasch model Training programs contributed substantially to the understanding of the Rasch model by many researchers.

1968               Lord and Novick provided five chapters on the theory of latent traits(four of the chapters were prepared by Birnbaum). The authors’ endorsement of IRT stimulated a considerable amount of research.

1969               Wright and Panchapakesan described parameter estimation methods for the Rasch model and the computer program BICAL, which utilized the procedures described in the paper. BICAL was of immense importance because it facilitated applications of the Rasch model.

1972               Bock contributed several important new ideas about parameter estimaition.

1974                 Cord described his new parameter estimation methods, which were utilized in a computer program called LOGIST.

1974                 Fischer described his extensive research program with linear logistic models.

1976     Lord et al. made available LOGIST, a computer program for carrying out parameter estimation with logistic test models. LOGIST is one of the two most commonly used programs today ( the other is BICAL ).

1977               Baker provided a comprehensive review of parameter estimation methods.

1977     Researchers such as Bashaw, Lord, Marco, Rentz, Urry, and Wright in the Journal of Educational Measurement special issue of IRT applications described many important measurement breakthroughs.

1979                 Wright and Stone in Best Test Design described the theory underlying the Rasch model, and many promising applications.

1980                  Lord in Applications of Item Response Theory to Practical Testing Problems provided an up-to-date review of theoretical developments and applications of the three-parameter model.

1980      Weiss edited the Proceedings of the 1979 Computerized Adaptive Testing Conference. These Proceedings contained an up-to-date collection of paper on adaptive testing, one of the main practical uses of IRT.

1982  Lord and his staff at ETS made available the second edition of LOGIST. This updated computer program was faster, somewhat easier to set up, and had more additional worthwhile output than the 1976 edition of the program.

In general, computerized testing greatly increases the flexibility of test management. Tests are given "on demand" and scores are available immediately. Neither answer sheets nor trained test administrators are needed. Administration is consistent. Test administrator differences are eliminated as a factor in measurement error. Tests are individually paced so that a student does not have to wait for others to finish before going on to the next section. Self-paced administration also offers extra time for students who need it, potentially reducing one source of test anxiety. Test security is increased because hardcopy test booklets are never compromised. Computerized testing also offers a number of options for timing and formatting. Timing options range from self-paced administration to item-by- item timing. Also, different formats can be developed to take advantage of graphics and timing capabilities. For example, perceptual and psychomotor skills that are nearly impossible to assess with a paper-and-pencil test can be readily tested on a computer. In addition to having the advantages of computerized testing, CATs increase efficiency. Significantly less time is needed to administer CATs than a fixed-item test since fewer items are needed to achieve acceptable accuracy. CATs can reduce testing time by more than 50% while maintaining the same level of reliability. Shorter testing times also reduce fatigue, which can be a significant factor in students' test results. CATs can also provide accurate scores over a wide range of abilities while traditional tests are usually most accurate for average students; CATs can maintain a high level of accuracy for all students. By including more relatively easy and more relatively difficult items in the item pool, CATs can accommodate the abilities of both bright and slow students.

CATs should not be used for some subjects and skills. Most CATs are based on an item-response theory model, which assumes that all the information needed in selecting items can be summarized in one to three parameters that describe the item's difficulty for students who have different abilities. Many tests cover a number of different skills or topics, however. Specifications for traditional tests seek to ensure an even range across skills or topics. Most common CAT strategies do not accommodate such additional considerations. Hardware limitations further restrict the types of items that can be administered by computer. Items involving detailed art work and graphs or extensive reading passages, for example, are hard to present using the types of computers found in most schools. Another limitation of CATs stems from the need for careful item calibration. Since each student takes a set of items, comparable scores depend heavily on precise estimates of item characteristics. Therefore, relatively large samples must be used. A minimal number in a sample is 1,000 students; 2,000 are more common. Such sample size requirements are prohibitive for most locally developed tests. Finally, for CATs to be manageable, a facility must have enough computers for a large number of students and the students must be at least partially computer-literate. While the number of computers in schools continues to grow, many schools simply do not have the resources to use CATs as a standard practice.

5.New Direction:

The future possibilities of CAT are many. New item formats could be developed, and new media could be used. New cognitive characteristics could be identified that could be better understood using computers. Specifically, using CAT with polytomous items is an area that demands more research (Dodd, De Ayala, & Koch, 1995). CAT could even be used to constrain item selection to those items that refer to specific cognitive tasks, or to select items on the basis of error patterns. Although many tests are aimed at obtaining a K estimate to be used for placement or hiring decisions, diagnosis of learning process is also important in some applications, such as school psychology. Multicomponent IRT models or partial credit models can e used to select items from a bank of items that measure specific components of interest. CAT can also be used with personality tests to detect faking through inconsistencies in item responses, and additional items can be administered to the examinee to adjust for or identify those inconsistencies. In on-line testing, faking can be identified, verified, and corrected for. Examinees can even be confronted with their inconsistencies during test administration, which could improve the usefulness of these scales. Finally, the use of innovative item types could be use the full potential of CAT. Instead of the limited number of item types used in paper-and-pencil tests, items that give immediate feedback can be used. For example, in editing tasks in which examinees are presented with an incorrect text passage, their typed corrections could be compared with a table of correct solutions (Davey, Godwin, & Mittelholz, 1997). The possibilities of graphics, audio, and animation should also be explored.                         

6. My Interest: The Methods of Analyzing Mathematics Test Items-as Mathematics Achievements of Junior High School

Since the foundation of mathematics lies in judging a statement to be sure or false, its abstract nature makes students difficult to understand it. Learning relies heavily upon students’ inborn cognitive operations, but it is not easy to know their cognitive process. We must make explicit those operations by means of some activities. Generally speaking, I want use CAT to analyze mathematics test items for understanding a student’s learning situation. Recently, developments in cognitive theory and the progress in computer technology have enriched the methods of analyzing mathematics tests, among which the following three are most frequently employed by many researchers: the Item Response Theory, the Item Relational Structure Analysis and the Item Rating Level Analysis. The way is conducted in the following procedure: (a) the Non-parametric Item Response Model from IRT is utilized to generate the “Item Characteristics Curves” for depicting individual test item feature; (b) to determine the comprehension levels in test items by using the “Item Rating Level” analysis; (c) to construct and present the inter-connected diagram of the Item Relational Structure by focusing on the incomplete information from the Item Characteristics Curves; and (d) to uncover subjects’ multi-faced learning information by applying those three methods coherently.

IRT: Any theory of item response supposes that in test situations, examinee performance on a test can be predicted (or explained) by defining examinee characteristics, referred to as traits, or abilities; estimating scores for examinee on these traits (called “ability scores”); and using the scores to predict or explain item and test performance. (Lord & Novick, 1968)

Non-parametric Item Response Model is a statistical model that allows us to estimate item characteristics curve in a non-parametric format. Then, the equating model, based on non-parametric IRT, applies the estimated ICC to set all ability parameters in a common scale. (Jeng, 1991)

Item Rating Level: It is developed by the CSMS’s team of England to analyze students’ cognitive level.

Item Relational Structure Analysis: A new system of order analysis, developed by Takeya and called Item Relations Structure Analysis (IRSA), was described and used for examining the structural relations among a set of items on the addition and subtraction of fractions. A diagraph showing chains of items that had discernibly common features was generated by this method. (Tatsuoka, 1981)

Item Characteristics Curves: It is a mathematical and provides the probability of examinees answering an item correctly for examinees at different points on the ability scale. (Hambleton, 1985)



Betz, N. E. & Weiss, D. J. (1973). An empirical study of computer-administered two-stage ability testing. Research Report 73-4. Minneapolis: University of Minnesota.

College Board (1993). User’s notebook. New York: Author.

Educational Testing Service (1996). Graduate Record Examinations 1996-1997: Information and registration bulletin. Princeton NJ: Author.

Davey, T., Godwin, J., & Mittelholz, D. (1997). Developing and scoring an innovative computerized writing assessment, Journal of Education Measurement, 34, 21-41.

Dodd, B. G., De Ayala, R. J. & Koch, W. R. (1995). Computerized adaptive testing with polytomous items. Applied Psychology Measurement, 19, 5-22.

Hambleton, R. K., (1985). Item response theory: Principles and application. Boston: Klvwer Nijhoff. 

Jeng, F. S.(1991). Least Squares Estimation for Latent Variables with Dichotomous Item Response Data. Unpublished doctor’s thesis, University of Illinois, Urbana, IL. 

Loard, F. M. & Novic, M. R. (1968). Statistical Theory of mental test scores. Reading, Mass.: Addison-Wesley.

Tatsuoka (1981). Item Relational Structure Analysis Method. University of Illinois, Computer-Based Education Research Lab