Skip to main content
Michigan State UniversityYour Site Name Here

Data Science

  • Behrouz Minaei-Bidgoli, Deborah A. Kashy, Gerd Kortemeyer, and William F. Punch, Predicting Student Performance: An Application of Data Mining Methods with the Educational Web-Based System LON-CAPA, 33rd ASEE/IEEE Frontiers in Education Conference (2003)
    Newly developed web-based educational technologies offer researchers unique opportunities to study how students learn and what approaches to learning lead to success. Web-based systems routinely collect vast quantities of data on user patterns, and data mining methods can be applied to these databases. This paper presents an approach to classifying students in order to predict their final grade based on features extracted from logged data in an education web-based system. We design, implement, and evaluate a series of pattern classifiers and compare their performance on an online course dataset.


  • Behrouz Minaei-Bigdoli, Gerd Kortemeyer, William F. Punch, Mining Feature Importance: Applying Evolutionary Algorithms within a Web-Based Educational System, International Conference on Cybernetics and Information Technologies, Systems and Applications: CITSA 2004 (2004)
    A key objective of data mining is to uncover the hidden relationships among the objects in given data set. This paper represents an approach for classifying students in order to predict their final grades based on features extracted from logged data in an educational web-based system. By weighing feature vectors representing feature importance using a Genetic Algorithm we can optimize the prediction accuracy and obtain significant improvement over raw classification. This work represents a rigorous application of known classifiers as a means of analyzing and comparing use and performance of students who have taken a technical course that was partially/completely administered via the web.


  • Behrouz Minaei-Bidgoli, Gerd Kortemeyer, and William F. Punch, Optimizing Classification Ensembles via a Genetic Algorithm for a Web-based Educational System, Proc. Of Joint International Association for Pattern Recognition (IAPR) Workshops on Syntactical and Structural Pattern Recognition (SSPR 2004) and Statistical Pattern Recognition (SPR 2004) Lisbon (2004)
    Classification fusion combines multiple classifications of data into a single classification solution of greater accuracy. Feature extraction aims to re- duce the computational cost of feature measurement, increase classifier efficiency, and allow greater classification accuracy based on the process of deriving new features from the original features. This paper represents an approach for classifying students in order to predict their final grades based on features extracted from logged data in an educational web-based system. A combination of multiple classifiers leads to a significant improvement in classification performance. By weighing feature vectors representing feature importance using a Genetic Algorithm (GA) we can optimize the prediction accuracy and obtain a marked improvement over raw classification. We further show that when the number of features is few, feature weighting and transformation into a new space works efficiently compared to the feature subset selection. This approach is easily adaptable to different types of courses, different population sizes, and allows for different features to be analyzed.


  • Behrouz Minaei-Bigdoli, Gerd Kortemeyer, William F. Punch, Enhancing Online Learning Performance: An Application of Data Mining Methods, The 7th IASTED International Conference on Computers and Advanced Technology in Education (CATE 2004), Kauai (2004)
    The main purpose of data mining is to discover the hidden relationships among the data points within given data sets. Classification has emerged as an popular data mining task to find a model for grouping the data points based on extracted features of the training samples. This paper proposes a model for feature importance mining within a web-based educational system and represents an approach for classifying students in order to predict their final grades based on features extracted from logged data in the online educational system. This work represents a rigorous application of known classifiers as a means of analyzing and comparing use and performance of students who have taken a technical course that was partially/completely administered via the web.


  • Behrouz Minaei-Bidgoli, Pang-Ning Tan, Gerd Kortemeyer, and William F. Punch, Association analysis for a web-based educational system, in Data-Mining in E-Learning (C. Romero and S. Ventura (ed.)), WITpress (Southampton, Boston), ISBN 1-84564-152-3 (2006)
    This research focuses on the discovery of interesting characteristics of different segments of a population. In the context of web-based educational systems, contrast rules help to identify attributes characterizing patterns of performance disparity between various groups of students. We propose a general formulation of contrast rules that can improve web-based educational systems for both teachers and students - allowing for greater learner improvement and more effective evaluation of the learning process.


  • Peng Han, Gerd Kortemeyer, Bernd J. Krämer, Christine von Prümmer, Exposure and Support of Latent Social Networks Among Learning Object Repository Users, Journal of Universal Computer Science (J.UCS), Volume 14, Issue 10 (2008)
    Although immense efforts have been invested in the construction of hundreds of learning object repositories, the degree of reuse of learning resources maintained in such repositories is still disappointingly low. As the reasons for this observation are not well understood, we carried out an empirical investigation with the objectives to identify recurring patterns in the retrieval and (re-) use of learning resources and to design and test social networking functionality supporting communities of practice. The outcomes of this project, which are reported here, aim to affect the design of a new generation of learning object repositories, like CampusContent, that tries to eliminate deficits of current repositories and involve recent contributions in the area of social software. Object of our investigation was LON-CAPA, a cross-institutional learning content management and assessment system used since 2000. We analyzed hundreds of thousands of log data collected over a period of three years and detected various kinds of latent relationships among LON-CAPA users, such as the co-occurrence of learning resources from independent authors in instructional materials.


  • Gerd Kortemeyer, Gender differences in the use of an online homework system in an introductory physics course, Phys. Rev. ST Phys. Educ. Res. 5, 010107 [8 pages] (2009)
    The two genders make different use of being allowed multiple tries to solve online homework problems: male students frequently attempt to immediately solve the problem, while female students are more likely to first interact with peers and teaching assistants before entering answers.


  • Yoav Bergner, Stefan Dröschler, Gerd Kortemeyer, Saif Rayyan, Daniel Seaton, and David Pritchard, Model-Based Collaborative Filtering Analysis of Student Response Data: Machine-Learning Item Response Theory, The 5th International Conference on Educational Data Mining, 95-102 (2012)
    We apply collaborative filtering (CF) to dichotomously scored student response data (right, wrong, or no interaction), finding optimal parameters for each student and item based on cross-validated prediction accuracy. The approach is naturally suited to comparing different models, both unidimensional and multidimensional in ability, including a widely used subset of Item Response Theory (IRT) models which obtain as specific instances of the CF: the one-parameter logistic (Rasch) model, Birnbaum's 2PL model, and Reckase's multidimensional generalization M2PL. We find that IRT models perform well relative to generalized alternatives, and thus this method offers a fast and stable alternate approach to IRT parameter estimation. Using both real and simulated data we examine cases where one- or two-dimensional IRT models prevail and are not improved by increasing the number of features. Model selection is based on prediction accuracy of the CF, though it is shown to be consistent with factor analysis. In multidimensional cases the item parameterizations can be used in conjunction with cluster analysis to identify groups of items which measure different ability dimensions.


  • Gerd Kortemeyer, Stefan Dröschler, and Dave Pritchard, Harvesting Latent and Usage-based Metadata in a Course Management System to Enrich the Underlying Educational Digital Library, International Journal on Digital Libraries, 10.1007/s00799-013-0107-6 (2013)
    In this case study, we demonstrate how in an integrated digital library and course management system, metadata can be generated using a bootstrapping mechanism. The integration encompasses sequencing of content by teachers and deployment of content to learners. We show that taxonomy term assignments and a recommender system can be based almost solely on usage data (especially correlations on what teachers have put in the same course or assignment). In particular, we show that with minimal human intervention, taxonomy terms, quality measures, and an association ruleset can be established for a large pool of fine-granular educational assets.


  • Daniel T. Seaton, Yoav Bergner, Gerd Kortemeyer, Saif Rayyan, Isaac Chuang, and David E. Pritchard, The Impact of Course Structure on eText Use in Large-Lecture Introductory-Physics Courses, Physics Education Research Conference (PERC), Portland, OR (2013)
    Course structure - the types and frequency of learning activities - impacts how students interact with electronic textbooks. We analyze student-tracking logs generated by the LON-CAPA learning management system from nearly a decade of blended large-lecture introductory-physics courses at Michigan State University, as well as one on-campus course from MIT. Data mining provides estimates of the overall amount and temporal regularity of eText use, i.e., weekly reading versus review immediately before exams. For all courses studied, we compare student use of eTexts as it varies with course structure, e.g., from traditional (three or four exams, eText assigned as supplementary) to reformed (frequent exams, embedded assessment in the assigned eText). Traditional format courses are accompanied by little eText use, while high reading levels persist throughout reformed courses.


  • Gerd Kortemeyer, Extending Item Response Theory to Online Homework, Phys. Rev. ST Phys. Educ. Res. 10, 010118 (2014)
    Item response theory (IRT) becomes an increasingly important tool when analyzing "big data" gathered from online educational venues. However, the mechanism was originally developed in traditional exam settings, and several of its assumptions are infringed upon when deployed in the online realm. For a large-enrollment physics course for scientists and engineers, the study compares outcomes from IRT analyses of exam and homework data, and then proceeds to investigate the effects of each confounding factor introduced in the online realm. It is found that IRT yields the correct trends for learner ability and meaningful item parameters, yet overall agreement with exam data is moderate. It is also found that learner ability and item discrimination is robust over a wide range with respect to model assumptions and introduced noise. Item difficulty is also robust, but over a narrower range.


  • Gerd Kortemeyer and Wolfgang Bauer, System and method to facilitate creation of educational information, US Patent No. 8,831,997 (issued September 2014)
    The computer-implemented system to facilitate creation of educational information employs a networked computer system that stores at least one resource in association with a first electronic file, storing metadata information about usage of the resource. A resource assembly tool implemented by a computer is programmed to access the networked computer system to display information to an instructor about at least one resource, including the metadata information. This aids the instructor in selecting resources for inclusion in educational information being created. The resource assembly tool is configured to assemble the educational information to include resources selected for inclusion being created for dissemination to learners via said networked computer system. The networked computer system is further configured to capture information about usage of the resource and to update the stored metadata information to reflect said captured information.


  • Daniel T. Seaton, Gerd Kortemeyer, Yoav Bergner, Saif Rayyan, and David E. Pritchard, Analyzing the Impact of Course Structure on eText Use in Blended Introductory Physics Courses, American Journal of Physics 82, 1186-1197 (2014)
    We investigate how elements of course structure (i.e., the frequency of assessments as well as the sequencing and weight of course resources) influence the usage patterns of electronic textbooks (e-texts) in introductory physics courses. Specifically, we analyze the access logs of courses at Michigan State University and the Massachusetts Institute of Technology, each of which deploy e-texts as primary or secondary texts in combination with different formative assessments (e.g., embedded reading questions) and different summative assessment (exam) schedules. As such studies are frequently marred by arguments over what constitutes a "meaningful" interaction with a particular page (usually judged by how long the page remains on the screen), we consider a set of different definitions of "meaningful" interactions. We find that course structure has a strong influence on how much of the e-texts students actually read, and when they do so. In particular, courses that deviate strongly from traditional structures, most notably by more frequent exams, show consistently high usage of the materials with far less "cramming" before exams.


  • Gerd Kortemeyer, An Empirical Study of the Effect of Granting Multiple Tries for Online Homework, American Journal of Physics 83, 646-653 (2015)
    When deploying online homework in physics courses, an important consideration is how many tries learners should be allowed to solve numerical free-response problems. While on the one hand, this number should be large enough to allow learners mastery of concepts and avoid copying, on the other hand, granting too many allowed tries encourages counter-productive behavior. We investigate data from an introductory calculus-based physics course that allowed different numbers of tries in different semesters. It turns out that the probabilities for successfully completing or abandoning problems during a particular try are independent of the number of tries already made, which indicates that students do not learn from their earlier tries. We also find that the probability for successfully completing a problem during a particular try decreases with the number of allowed tries, likely due to increased carelessness or guessing, while the probability to give up on a problem after a particular try is largely independent of the number of allowed tries. These findings lead to a mathematical model for learner usage of multiple tries, which predicts an optimum number of five allowed tries.


  • Gerd Kortemeyer, Scalable Continual Quality Control of Formative Assessment Items in an Educational Digital Library, International Journal on Digital Libraries (accepted)
    An essential component of any library of online learning objects is assessment items, for example, homework, quizzes, and self-study questions. As opposed to exams, these items are formative in nature, as they help the learner to assess his or her own progress through the material. When it comes to quality control of these items, their formative nature poses additional challenges. e.g., there is no particular time interval in which learners interact with these items, learners come to these items with very different levels of preparation and seriousness, guessing generates noise in the data, and the numbers of items and learners can be several orders of magnitude larger than in summative settings. This empirical study aims to find a highly scalable mechanism for continual quality control of this class of digital content with a minimalist amount of additional metadata and transactional data, while taking into account also characteristics of the learners. In a subsequent evaluation of the model on a limited set of transactions, we find that taking into account the learner characteristic of ability improves the quality of item metadata, and in a comparison to Item Response Theory (IRT), we find that the developed model in fact performs slightly better in terms of predicting the outcome of formative assessment transactions, while never matching the performance of IRT on predicting the outcome of summative assessment.


  • Gerd Kortemeyer, The Psychometric Properties of Classroom Response System Data: A Case Study, Journal of Science Education and Technology (accepted)
    Classroom Response Systems (CRSs, often referred to as "clickers") have slowly gained adoption over the recent decade, however, critics frequently doubt their pedagogical value starting with the validity of the gathered responses: there is concern that students simply "click" random answers. This case study looks at different measures of response reliability, starting from a global look at correlations between formative clicker responses and summative exam performance to how clicker questions are used in context. It was found that clicker performance is a moderate indicator of course performance as a whole, and that while the psychometric properties of clicker items are more erratic than those of exam data, they still have acceptable internal consistency, and include items with high discrimination. It was also found that clicker responses and item properties do provide highly meaningful feedback within a lecture context, i.e., when their position and function within lecture sessions is taken into consideration. Within this framework, conceptual questions provide measurably more meaningful feedback than items that require calculations.
  • Emre Gönülateş and Gerd Kortemeyer, Modeling Unproductive Behavior in Online Homework in Terms of Latent Student Traits: An Approach Based on Item Response Theory, Journal of Science Education and Technology (accepted)
    Homework is an important component of most physics courses. One of the functions it serves is to provide meaningful formative assessment in preparation for exams. However, correlations between homework and exam scores tend to be low, likely due to unproductive student behavior such as copying and random guessing of answers. In this study, we attempt to model these two counterproductive learner behaviors within the framework of Item Response Theory in order to provide an ability measurement that strongly correlates with exam scores. We find that introducing additional item parameters leads to worse predictions of exam grades, while introducing additional learner traits is a more promising approach.