When creating predictive models, it’s important to measure accuracy to be able to clearly articulate how good the model is. This article talks about two mistakes that are commonly made when measuring these accuracy values.
1. Measuring Accuracy on the Same Data Used for Training
One common mistake that gets made is measuring the accuracy of the same data that was trained. For example, say you have data from 2017 and 2018 for customer churn. Say you feed all that data to train the model and subsequently use the same data to predict and compare the predictions with the actual results. That is like you are given a question paper before the exam to study at home and the exact same question paper was given to you the next day in the exam. Obviously, that person is going to do great in the exam.
The right way to do it is to separate the data into 2 sections: training and test. Train the model on the training data and test for the accuracy on the data block marked as test, which the algorithm never saw during training. There are other techniques such as cross-validation, etc., which I will skip for the scope of this article.
In this customer churn example, maybe train using 2017 data and predict on 2018 data as an example. Or say you have 1000 rows. Train with 80% (800 rows) and test with 20% (200 rows). The key is the accuracy should not be seen by the algorithm that you are looking to measure.
For those 200 rows, you know the actual outcome. You can predict for those 200 rows that the algorithm has never seen before.
Now you have both the actual outcome as well as the predicted outcome for 200 rows. You can compare those two to get the accuracy definition. Say 160 rows were correctly predicted out of 200 rows. That means the accuracy is 160/200 = 80%.
2. Not Recognizing the Imbalance in Your Data
Let’s take fraud detection as an example. Say 95% of your transactions are not fraud. If the algorithm marks every transaction is not fraud, it is right 95% of the time. So, the accuracy is 95%, but the 5% that it is wrong can break the bank. This is an important part of the puzzle and is not captured in the standard accuracy measure. That is where one needs to deal with other metrics such as sensitivity, specificity, etc., which we will cover in a subsequent article. In addition, there are several techniques to deal with imbalance data. Refer to this article for more details.