Establishing GPA and GMAT MBA Admission Admissions Criteria

The MBA program staff believes that students who are admitted to the program with either a low GPA or GMAT score will likely have difficulty completing the program. The MBA program aims to fill the program to capacity (40 students is the maximum) that are likely to be successful in the program. Students who do not complete the program or have difficulty completing not only degrade the reputation and quality of the program, they may have taken an admission slot that possibly could have been given to a more successful student. The admissions committee takes into account a number of factors in evaluating admissions, however they are uncertain that their current baseline minimum undergraduate GPA of 3.0 and GMAT score of 450 criteria are appropriate and effective for meeting the program admission objectives:

Admit the maximum number of students likely to succeed
Minimize admitting students who are likely to have challenges
Provide a competitive admissions criteria

Also, they have great difficulty in making “nuanced” admit decisions for applicants who fall short of one of these minimums.

In an effort to improve this, the program staff has compiled GPA and GMAT data from the last four years of graduated students and placed them into one of two groups – group 1 is “satisfactory” and group 2 “challenged.” A student was labeled challenged if they dropped out of the program, did not complete the program with their cohort, or was on academic probation for unsatisfactory progress or performance. Students with exceptional circumstances causing any of these conditions were excluded from the data. They randomly selected 35 students from group 1 and 35 from group 2.

Learning Objectives:

Understand the statistical classification problem and common supervised machine learning methods for addressing classification problems. 

Interpretation of and application of classification rules.

Evaluation of performance and utility tradeoffs of different classification methods. Recognize the value and limitations of classification methods.

Classification-based decision models and the use of classification to support decision making and policy.

Statistics Needed:

Naive Bayes
Box Plot
Density Plot
K-nearest neighbor 
Decision Tree method (CART)
Logistic regression

Data Sources:

MBAStudents.xlsm

Background and Getting Started:

Use the 6-step model for business analytics as a framework to determine how the MBA program can improve their admit decisions utilizing GPA and GMAT scores. For each of the steps, write a brief response to the questions asked:

Step 1. Recognizing the problem

What are the objectives for using GPA and GMAT for admissions?
Describe the current problems using scores for admissions.
What are the limitations of the data?

Step 2. Defining the problem

What questions do I need to ask to describe the scores-success relationship and how they would be used in addressing admissions decisions.
What would sufficient answers look like?
What considerations and assumptions need to be accounted for?

Step 3. Structuring the problem

How can we predict “challenged” and “satisfactory” applicants with certain scores?
Is it more important to accurately predict challenged or satisfactory?
How can we structure the problem to align with the goals/objectives?
How much can we generalize from the data set?
What are our constraints?

Step 4. Analyzing the problem

What models and techniques are needed to address the questions?
What are the assumptions we need to make?
Are the assumptions reasonable?
How will they affect our models?

Step 5. Interpreting Results and Making a Decision

How confident can we be in our results? What confidence matters here?
What assumptions were made and how do they affect these results?
Should we accept/deny students based off the results alone?

Step 6. Implementing the solution

What is the reasoning behind the choice of accepting certain students or not?
What is a useful decision model(s) to address the decision problems?
What resources or limitations do we need to consider?

Analytics to perform:

Create a scatterplot of GPA and GMAT scores with different colors for the two groups and 95% confidence ellipses for the two groups. Draw out different ways that approximately separate the two groups

a. Draw a line that approximately separates the two groups. What is the classification decision rule for this?

b. Draw a smooth curve that approximately separates the two groups. What is the classification decision rule for this?

c. Draw any number of segments that approximately separates the two groups. What is the classification decision rule for this?

d. What is the difference between these rules? Why would you to choose one over another?
Use Naive Bayes classification and create a rule where Prob(Rating| MBA, GPA) > 1/2. The e1071 and caret packages are helpful here.

a. How does the rule compare to your estimate compared to question 1? What is the accuracy and reliability of this classification?

b. Use the model to predict what groups the applicants in the New Applicants tab would be in. How confident can you be for each student that they will be in the group predicted?

c. Plot the decision boundary and compare with your estimates in 1a and 1b
Use linear regression with a dummy variable for group 2 as the dependent variable and GPA and GMAT as the independent variables.

a. Look at a boxplot and a density plot of the predicted values (discriminant scores) for each group. Does there seem to be clear differences between the groups? What would you estimate the cutoff discriminant score (discrimnator) value is? Can you reasonably interpret the discriminant score as a probability of being in a certain group?

b. Determine a discriminant cutoff value by averaging the mean discriminant scores for each group to create a classification rule. How does the rule compare to your estimate from before? What is the accuracy and reliability of this classification? Explain.

c. Plot the decision boundary discuss the implications of decision rule.

d. Use the linear discriminant model (LDA) to predict what groups the new applicants would be in. Start plotting the applicants points on a scatter plot and compare from your results before. How confident can you be for each student that they will be in the group predicted? Hint: change the prediction interval confidence until the interval does not contain the discriminant value.

d. If you believe outliers are affecting the discriminant, determine a discriminant cutoff by averaging the median discriminant scores for each group to create a classification rule. does the rule compare? What is the accuracy and reliability of this classification?

e. Perform a non-linear DA and compare this to LDA.

f. Compare the prediction of the new applicants results to the other approaches used. What are the advantages and disadvantages of this approach?
Use a nearest to the centroid of each group approach to classifying the new applicants. This is actually a k-means clustering approach where the clusters are given (the centers) rather than discovered. Start with a plot of the centroids and the distances from the points to be classified. What is the classification decision rule for this? Compare the results to the other approaches used. What are the advantages and disadvantages of this method?
Use a nearest k neighbors (knn) approach to classify the new applicants. What is the classification decision rule for this model? How does it compare with any of your answers before? How confident can you be for each student that they will be in the group predicted? Compare the results to the other approaches used. What are the advantages and disadvantages of this method?
Use a decision tree (classification and regression tree) approach to classifying the new applicants. What is the classification decision rule for this model? How does it compare with any of your answers prior? How confident can you be for each student that they will be in the group predicted? Compare the results to the other approaches used. What are the advantages and disadvantages of this method?
Use a logistic regression model to classify the new applicants. What is the classification decision rule for this model? How does it compare with any of your answers prior? How confident can you be for each student that they will be in the group predicted? Compare the results to the other approaches used. What are the advantages and disadvantages of this method?
[Do this exercise on your own] Support Vector Machines (SVM) is a general approach to supervised classification and non-linear regression whose discriminant maximizes the separation of groups. Read about SVM (Ch. 9 in the ISLR book or Wikipedia or the R libsvm package overview a. Perform an analysis of SVM using different kernels. Compare on accuracy, reliability, and ability to interpret the classification decision rule.

b. Compare the SVM decision rules with the other methods used previously. How are they similar? Where do they differ?

c. Use your SVM models to make predictions of the classification for new applicants. How do these compare with predictions with the methods used previously?

d. Would you recommend using SVM for this classification problem? If so, which kernel and why.
[Do this exercise on your own] You will need to recommend use of one or more classification methods to use for the admit decision making. To help you determine which methods to use, summarize each one on the following:

Predictive accuracy
Computational difficulty
Assumptions needed for the classification model (e.g. Normality, near neighbors share same classifications, etc.)
Limited to binary (2-groups) classification?
Interpretability of the classification rule
Ability to get reliable, meaningful impact of factors
Ability to get reliable, meaningful confidence in predictions
Tuning needed (e.g. number of neighbors to use)
Calibration needed (e.g. discriminant cut off value)
Data preparation needed (e.g. scale/normalize data for distance measure)
Complexity of method
Statistical inference capability

Report

Write a professional report (as if you were a hired consultant or employee) for the college recruiting group. The report should be written at the “executive” level that quickly gets to the point and gives specific, actionable advice or solutions based on the data and analytics. Avoid technical aspects and terms that are non-essential and any speculations not substantiated by the data. This report should be concise without lengthy explanations being necessary to understand it.

There is no min or max page limit as charts and tables can take up a highly variable amount of space. However, any charts or tables included need to be understandable to a layman at first glance (labeled and captioned if needed). The particular models you use, interpretations, and advice given are your choice and you should be prepared to explain or defend this if needed!

Use this as an outline:

A. Description of the business problem

What are the key decisions that need to be made decisions when sorting students into two groups? Indicate specifically what the options are (or give examples of options).
What are the overall goals, objectives and drivers for these decisions?
What are the important factors to consider for making the decisions?
What questions will be answered and how do these explicitly help address the decisions?

B. Data, methods, and models and results

At a high-level, discuss the basic approach, analytics used, and data and any concerns about the integrity and quality of the data used. For example, “There is concern about students with x scores because it was not part of the data set. There is insufficient data to apply this assumption to other scores at x,y, and z.”
Describe the models used (put the formula, tables, graphs, etc. here) indicating what they are used for (do not detail how they were developed or any technical details) e.g. “This calculation determines the probability of students in “satisfactory” group”. Provide detailed answers to the decision questions using the models.
Indicate all “important” assumptions made and why you think they are reasonable. An assumption is important if you don’t make it you cannot get a result. For example, consider the assumption “The data includes unbiased representation of both successful and challenged students.” The concern here is that maybe the data that was easy to get was only from students who were visible to the end of the program. If a student was challenged and dropped out they may not have been included in the data set and so there would be a bias toward successful students. Unfortunately we don’t really have any way to prove the data doesn’t have this kind of “survivorship bias” and so we must assume it’s not there. If this assumption were wrong or invalid it would significantly affect our predictions of student success. Do not list technical assumptions used for statistical analysis e.g. “We assume GMAT scores are normally distributed.”

C. Decision making

Explain specifically how the models and results are used to make the decisions indicated in Section I. This may be literal results such as “Never accept students with GPA and GMAT scores under 2.0 and 500 as we can be 99% confident they will be challenged.” Detail special considerations or issues to watch out for e.g. “There is not enough student data to accurately predict students’ acceptance rate”.
Describe how the improvements or benefits from using the results for making the decisions can be measured or observed. i.e. Does past and will future student GPA and GMAT data support the recommendations made.