Scikit-learn is known for its easily understandable API and for Python users, and machine learning in R (mlr) became an alternative to the popular Caret package with a larger suite of algorithms available and an easy way of tuning hyperparameters. These two packages are somewhat in competition due to the many people involved in analytics turning to Python for machine learning and R for statistical analysis.
One of the reasons people prefer Python could be because that current R packages for machine learning are provided via other packages that contain the algorithm. The packages are called through mlr but still require extra installation. Even external feature selection libraries are needed, and they will have other external dependencies that need to be satisfied as well.
Scikit-Learn is dubbed as a unified API to a number of machine learning algorithms that do not require the user to call any more libraries.
This by no means discredits R. R is still a major component in the data science world regardless of what an online poll might say. Anyone with a background in Statistics and/or Mathematics will know why you should use R (regardless of whether they use it themselves, they recognize the appeal).
Now we will take a look at how a user would go through a typical machine learning workflow. In Scikit-Learn, we will proceed with Logistic Regression and Decision Tree in mlr.
Creating Your Training and Test Data
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size). This is the simplest way to partition datasets in scikit-learn. The test_size is to determine what percentage of the data goes into the test set. train_test_split will create a train and a test set automatically in one line of code. x is the set of features and y is the target variable.
train <- sample(1:nrow(data), 0.8 * nrow(data))
test <- setdiff(1:nrow(train), train)
- mlr does not have a built-in function to subset datasets, so users need to rely on other R functions for this. This is an example of creating an 80/20 train test set.
Choosing an Algorithm
LogisticRegression(). The classifier is simply chosen and initialized by calling an obviously-named function that makes it easy to identify.
makeLearner('classif.rpart'). The algorithm is called a learner, and this function is called to initialize it.
makeClassifTask(data=, target=). If we are doing classification, we need to make a call to initialize a classification task. This function will take two arguments: your training data and the name of the target variable.
In either package, there is a process to follow when tuning hyperparameters. You first need to specify which parameters you want to change and the space of those parameters. Then conduct either a grid search or a random search to find the best combination of parameter estimates that give you the best outcome (i.e. either minimize error or maximize accuracy).
penalty = ['l2']
C = np.logspace(0, 4, 10)
hyperparameters = dict(C=C, penalty=penalty, dual=dual, max_iter=max_iter)
GridSearchCV(logreg, hyperparameters, cv=5, verbose=0)
makeParamSet( makeDiscreteParam("minsplit", values=seq(5,10,1)), makeDiscreteParam("minbucket", values=seq(round(5/3,0), round(10/3,0), 1)), makeNumericParam("cp", lower = 0.01, upper = 0.05), makeDiscreteParam("maxcompete", values=6), makeDiscreteParam("usesurrogate", values=0), makeDiscreteParam("maxdepth", values=10) )
ctrl = makeTuneControlGrid()
rdesc = makeResampleDesc("CV", iters = 3L, stratify=TRUE)
tuneParams(learner=dt_prob, resampling=rdesc, measures=list(tpr,auc, fnr, mmce, tnr, setAggregation(tpr, test.sd)), par.set=dt_param, control=ctrl, task=dt_task, show.info = TRUE) )
setHyperPars(learner, par.vals = tuneParams$x)
Both packages provide one-line code for training a model.
This is, arguably, one of the simpler steps in the process. The most arduous step would be tuning hyperparameters and feature selection.
Just like training the model, prediction can be done with one line of code.
predict(trained model, newdata)
Scikit-learn will return an array of predicted labels, while mlr will return a data frame of predicted labels.
The most popular method for evaluating a supervised classifier will be a confusion matrix from which you can obtain accuracy, error, precision, recall, etc.
performance(prediction, measures = list(tpr,auc,mmce, acc,tnr))OR
Both packages offer more than one method of obtaining a confusion matrix. However, for an informative view in the easiest possible fashion, Python is not as informative as R. The first python code will only return a matrix with no labels. The user has to go back to the documentation to decipher which columns and rows correspond to which category. The second method has a better and more informative output, but it will only generate precision, recall, F1 score, and support; but this is also the more important performance measures in an imbalanced classification problem.
Decision Thresholding (i.e. Changing the Classification Threshold)
A threshold in a classification problem is a given probability that classifies each instance into a predicted category. The default threshold would always be 0.5 (i.e. 50%). This is a major point of difference when conducting machine learning in Python and R. R offers a one-line-of-code solution to manipulating the threshold to account for class imbalances. Python does not have a built-in function for this, and it is up to the user to programmatically manipulate the threshold by defining their own custom scripts/functions.
- There is no one standard way of thresholding in Scikitlearn. Check out this article for one way that you can implement it yourself: Fine-Tuning a Classifier in Scikit-Learn
setThreshold(prediction, threshold). This one line of code in mlr will automatically change your threshold and can be passed as an argument to calculate your new performance metrics (i.e. confusion matrix etc.)
In the end, both mlr and Scikit-learn will have their pros and cons when dealing with machine learning. This is a comparison of using either for machine learning and does not serve as a reason to use one instead of the other. Having knowledge of both is what can give a true competitive advantage to someone on the field. The conceptual understanding of the process will make it easier to use the tool.