In this post we will analyze the data of the ADD HEALTH, a national longitudinal study that enrolled ~90000 adolescents and followed them up for 20 years. Using the R package caret
, we will train and evaluate different models (regression, decision tree, random Forest) in order to predict drug use behavior in adulthood from information (friends, social activities, parents and others) collected during adolescence. This will give us the opportunity to play and get familiarized with the most important caret
functions as well as to introduce seminal concepts of machine learning.
The bout: the data to analyze
ADD HEALTH is a unique study both in terms of the extent of information collected (“the largest, most comprehensive longitudinal survey of adolescents ever undertaken”) and duration (started in 1994 and is still on going study design).
We will analyze the publicly available data, a small fraction of the total subjects surveyed ~4%, but still a quite decent dataset of ~ 6000 observations X several thousand variables.
It is important to keep in mind that the current analysis will be performed on unweighted data (ADD sampling design is quite complex) and models will be evaluated in samples not necessaries representative of the population. This can affect the final estimates of the model effectiveness.
Users interested in the specific statistical procedures suggested for ADD Health weight adjustment can find info in this guide and this article.
Obtain the dataset and make it tidy
We start loading some libraries.


The number of variables collected for the ADD HEALTH is of several thousands. We will focus our analysis on most of the variables collected in the first wave (first survey in 1994) that include demographic information, friend and social network details, parental information and many others . Our classifier will be a response to a question of fourth wave (2006). A description of all the variables of the study can be found using the codebook web Explorer.
I downloaded the publicuse data from ICPSR and then saved all the files with dta (Stata) extension in a folder. At this point, we can i) import all the files, ii) select the columns that are in the vector var_select
and iii) join the dataframes together by the subject identification variable (AID
). Finally, iv) all the the columns that contain character data are transformed to factors.


Some columns are factors with only one level. We are going to remove them.


Some columns miss quite a lot of data. We can impute missing observations only if missing data does not represent a significant portion of the data of the column. We are going to drop all the columns that have more than 15% of missing data.


The H4TO117 column is our classifier, the response to the following question:
“Have you ever continued to use {favorite drug} after you realized using {favorite drug} was causing you any emotional problems (such as feeling depressed or empty, feeling irritable or aggressive, feeling paranoid or confused, feeling anxious or tense, being jumpy or easily startled) or causing you any health problems (such as heart pounding, headaches or dizziness, or sexual difficulties)?”
Let’s clean it up and see the responses.


Legitimate stands for Legitimate skips, that indicates those people that do not have a favorite drug.
The answers to this question not only allow to distinguish people that use (or had used drugs) from people that did not (legitimate skip), but also to identify those individuals that continue to take drugs despite negative consequences (yes). The class yes will be our class of interest because it indicates those people that are presumably more likely to develop health problems as a consequence of drug use patterns.
Therefore, we will try first to fit 2 class models, before trying something more complex. We replace Legitimate with No to obtain 2 classes: those that continued to take drugs despite negative consequences (Yes) and those that did not (No).


1 2 
## Yes No ## 245 4869 
Train (the Models) for the fight
Cross Validation, Class Imbalance
After some data exploration, we can finally develop (train) and evaluate (test) predictive models. In other words, we are going to find the best parameters for the different models and test them. Can we train and test models using exactly the same data? That would not be a good idea at all, as it would results in an overestimation of the accuracy of our models (we would end it up fitting also noise). Why?
Imagine to be Muay Thai fighter that is getting ready for his first fight. If you train and spar always with the same person, you might end it up knowing his/her strategies pretty well, and developing way to respond and exploit his/her way of attacking and defending. For an external observer you might look even very good when you spar with him/her. Does it mean that you are going to do well also in your fight? Not really. Your opponent might have a difference stance of your training partner or a completely different style. You might have not encountered those moves in training, and thus perform not well under those different conditions (and get your ass kicked). So what can you do to get ready for your fight and have a more accurate understanding of your potential? Don’t train and test your skill always with the same person. What you want to do is to train with as many different people as you can, and then occasionally test your skills with someone that you have never trained before. This will maximize your chances of having trained with someone similar to your actual fight opponent, and also give you a better idea if you are ready or not.
We are going to do the same thing with our data. We will divide our dataset in 10 samples, train our model with 9 of them and evaluate it with the 10th. Afte that we will select a different sample for testing, and reapply the same procedure until each one of the folds is used at least once for testing. In this way, we well never train and test our model with the same data and we will get an estimate of variability for the accuracy of the model. The entire procedure will than be repeated 5 times with reshuffled data (new partitions). This procedure is called repeated crossvalidation.
Unfortunately we have another problem. The people that keep taking drugs despite negative consequences are about 4% of the total number of people in our dataset. We have a severe class imbalance. Because of such a difference in the size of our 2 classes, it is more likely that the small class we will have much less weight in the parameter optimization of our model. There are several things that we can do to mitigate that problem. One possibility is to randomly upsample the smaller class to make the class distribution equal. We could run the caret function upSample
that does exactly that and than train and test our model on the resampled data. What is the problem in doing that? It’s that it is very likely that the same observations will be part of the data that used to train the model but also test its accuracy, that is because we are going to duplicate a lot of observations for the class “yes”. Here we are again, training and sparring with the same partners as before! We need to train our model on the up Sample data, but then test it on part of the not resampled data! Max Kuhn, the creator of caret
, made it easy for us, and not we can do that using an option when we invoke the function trainControl
for crossvalidating the data. Perfect!
TrainControl and Hyperparameters
There is another important detail to take in consideration when we partition data. We want to make sure to maintain as equal as possible the distribution of our class of interest in the train and test samples. In other words, we want to avoid that for example the “yes” responses to question H4TO117 will end up all in one of the training partitions but not the other. This is done automatically by caret::trainControl
.
However, our dataset has another relevant variable that we need to take in consideration: CLUSTER2. In the ADD HEALTH study, subjects were chosen in clusters (i.e. schools). It is likely that observations of same clusters will be more similar to each other than to those coming from a different cluster. Again, if the data that we use for training and testing the model are very similar, we might end up obtain overoptimistic estimates overoptimistic estimates of accuracy. This would be equivalent to test your fighting skills sparring with someone that you have never trained with before but that comes to your same gym. He/she has trained under the same coach, and thus it is very likely that is going to employ moves that you have seen before. Would you still perform well against someone with a completely different style and background? Difficult to say. It would be better to test your skills sparring with someone from a different gym. Similarly, we can limit this problem taking in account the variable CLUSTER2
when we partition our data and try to avoid as much as possible to train and test the model with data from exactly the same cluster. The function caret::groupKFold
will do it for us.


All the operations that we discussed in the last two paragraphs are taken care of by caret::trainControl
. You might have notice that in the code above we are not using a repeated crossvalidation but we have indicated instead method = "adaptive_cv
. What is that? It is an algorithm to adaptability look for the best tuning parameters for our model and, as Max Kuhn put it “concentrates on values that are the in the neighborhood of the optimal settings”.
Let’s clarify what are this tuning parameters. The tuning parameters (hyperparamters) are those parameters that are not derived by the data but are decided beforehand. For example, in order to increase the interpretability of regression and reduce colinearity, we might decide to limit the number of coefficients and/or their total magnitude, and thus develop a more parsimonious model. In the package glmnet
, we can do that setting the hyperparameters alpha
and lambda
. Similarly, in decision tree we need to decide how many variables to consider at each split (mtry
) or other parameters that control the tree size (cp
). Using method = "adaptive_cv
, combinations of hyperparmetes will be automatically searched and evaluated in each sample to obtain the best parameters and hyperparameters for the model (i.e. that maximize accuracy).
Preprocessing
Before training our first model we need to spend also some few words on preprocessing. This is getting long but who cares? No one will really read this post (maybe Alessandro will check it out). We have a lot of variables, more than 2000 and that put quite some computational burden on my personal laptop. We might no need all those variables, because some of them might have near zero variance (imagine a column made of only one identical value) or might be very correlated to each other. Moreover, we can employ principal component analysis (PCA) to use linear combination of variables that explain most of the variability of the data. Finaly, we can impute missing data using a knearest neighbors algorithm, that use the distance to the closest sample to calculate the values that will be substitute to the missing data. All this computations can be very conveniently done before training our models using the argument preProcess = c("nzv", "corr","knnImpute","pca")
, when calling the function caret::train
.
glmnet
We are ready! Let’s train our first model, “a generalized linear model via penalized maximum likelihood”, using the package above mentioned glmnet
. Because of the limited computational capability of my laptop we are going to try only 5 different combinations of hyperparameters for each resample (`tuneLength = 5
). This is definitely an inadequate number of combinations*_ (I would have tried at least 100 if I could) but will still give as the opportunity to see how to train a model an evaluate it.


All the info regarding our model, including the coefficients of the different predictors, are contained in mod_glmnet$finalModel
. A summary of the details regarding the preprocessing, resampling, hyperparmeter tuning can be easily retrieved `print
ing the model.
Notice that many variables, more than the number of columns, have been removed during the preProcessing
. This is because caret
automatically convert factors in
dummy variables.
Was the model any good? Was it able to predict if a person was going to take the favorite drug despite negative consequences?.
Some estimates of the model efficacy are reported above and we will discuss them in depth later in the post. Now, let’s train few other models before looking at them all together.
Classification Trees and Random Forest
Another type of very popular predictive model is classification tree. Classification trees are very easy to implement and interpret (if number of predictors are not too much) and employ recursive splitting the data in groups that are internally as homogeneous as possible in terms of measures of purity (i.e Gini index) of the target classifier. Here a good visual example of possible splits and of their relative level of purity. In other words, what the tree does is to create a concatenation of ifthen.
Continuing the analogy of Muay Thai, a branch of a classification tree might look like this : if fighter A trains 7 days a week if his/her opponent train only once a week and if fighter A has 150 fights more than his opponent and if fighter A is less than 30 years old and more than 18 than fighter A is going to win the fight.
We grow the first tree using the package rpart
.


Let’s grow also a second tree model using the package party
. The major difference with the preceding classification tree model is that party::ctree
implements inferential statistic to evaluate improvements in measures of purity, and thus to guide the recursive splitting.


Finaly, we will train and evaluate a Random Forest model, that is an essamble of multiple classification trees. Usually Random Forest perform better than classification trees but this increased accuracy comes at the expenses of the model complexity and interpreatability.


How did the models performed? We will look in dept at different measures of accuracy in the next section.
Fight night: Evaluation of the Models
Confusion Matrix
We can get a first feel of how the model is performing by looking at the confusion Matrix which indicates the correct prediction in each of the classes. Instead of using the caret::confusionMatrix
, I am going to create my own function for extracting that info from model$table
and arrange it in table with a different layout.
We will create 2 different method dispatches depending on the class of the object. This will allow us to easily apply the function to a single model or a list of models.
Of the 4 models, glmnet
is the one that has the highest rate of detection of Yes (True Positive), but also the highest rate of False Alarms (False Positive). The 2 treebased models detect only a very small fraction of True Positive, whereas the Random Forest does not get any hit and always bets for No.
Sensitivity (Recall), Specificity, Precision
The number of True and False predictions can be used to calculate other metrics of the performance of the models.
Sensitivity AKA Recall: True Positive Rate, TruePositives/ (TruePositive + FalseNegatives). It indicates the percentage of Positives that are detected by the model.
Specificity: True Negative Rate, TrueNegatives/ (TrueNegatives + FalsePositives). It indicates the percentage of Negatives identified by the model.
While the sensitivity informs us on how likely is that a positive case will be detected, it does not tell us how much confident we can be that a positive prediction will be indeed true. This is expressed by the precision.
Precision: TruePositives / (TruePositives+FalsePositives). It indicates how accurate are the positive predictions.
F: the harmonic mean of Precision and Recall.
We create the function summary_td
that calculates those metrics and then with get_metrics
we extract them from the different folds of the splitted dataset.


1 2 3 4 5 6 
## Observations: 28 ## Variables: 4 ## $ model <chr> "glmnet", "glmnet", "glmnet", "glmnet", "glmnet", "glmnet… ## $ metric <chr> "AUC", "F", "Precision", "Recall", "ROC", "Sensitivity", … ## $ mean <dbl> 0.12385818, 0.18717855, 0.11997301, 0.44701486, 0.7120873… ## $ sd <dbl> 0.03399199, 0.05527510, 0.03646493, 0.13523005, 0.0797699… 
We plot the results.


Again, the glmnet model has the highest sensitivity and *F*. The treebased models have very good specificity but poor sensitivity and precision. The performance of the Random Forest is even worse and, because no positive can be detected the precision and F cannot be calculated. Overall, all the 4 models have poor performances.
Thresholds, ROC curves and PrecisionRecall Curves
Thresholds
Let’s take a closer look at the predictions of one of the model.


1 2 3 4 5 6 7 8 9 
## Observations: 5,114 ## Variables: 7 ## $ mtry <dbl> 157, 157, 157, 157, 157, 157, 157, 157, 157, 157, 157, … ## $ pred <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No,… ## $ obs <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No,… ## $ Yes <dbl> 0.046, 0.030, 0.048, 0.040, 0.032, 0.048, 0.052, 0.060,… ## $ No <dbl> 0.954, 0.970, 0.952, 0.960, 0.968, 0.952, 0.948, 0.940,… ## $ rowIndex <int> 1623, 1655, 1657, 1636, 1628, 1632, 1634, 1627, 4555, 1… ## $ Resample <chr> "Fold03", "Fold03", "Fold03", "Fold03", "Fold03", "Fold… 
You will notice that the model provides for each case a certain probability of Yes and of No. This probabilities are then used to generate the corresponding prediction, with values greater 0.5 cutoff flagged as Yes and smaller as No. However, there are scenarios in which **we might be willing to trade same precision and tolerate higher number of False Positives in exchange of an higher sensitivity **, for example when a False Negative and an error of the model might result in a big explosion in the real life out there (e.s. algorithm that identifies mines from sonar data).
The graph and the table above reported display metrics calculated using a cutoff probability of 0.5. A model like the RandomForest that did not predict any Yes might be able to get some positive hits, under specific thresholds conditions. This introduce another important parameter and makes everything a little bit more complicated, right? What can we do?
To have a better idea of how the models will perform under those different conditions, we can simply vary systematical the cutof probability for assigning the prediction values (Yes or No) and then calculate the metrics above indicated for each of the threshold. That is exactly the concept behind the ROC and PrecisionRecall Curves.
ROC curves
ROC curves are generated by varying the threshold probability and plotting the sensitivity of the model ( TruePositives/ (TruePositive + FalseNegatives) ) as a function of 1  Specificity ( TrueNegatives/ (TrueNegatives + FalsePositives) ). In other words, the ROC curve plots the rate of True Positives (over the total positive) as a function of the rate of True Negatives.
To calculate the data of the curves we will use the package PRROC
. The package has same very handy functions to calculate (and plot) ROC and PrecisionRecall curves. We create our own function that extracts the data from multiple `train
objects using the commands PRROC::roc.curve
and PRROC::pr_curve
, and then apply it to the same unpacked arguments (quasiquoted) with purrr:invoke_map
`.


We obtain the data of the ROC curves applying our new created get_rocpr
function to the list of models.


1 2 3 4 5 
## Classes 'tbl_df', 'tbl' and 'data.frame': 5332 obs. of 4 variables: ## $ FPR : num 0 0.000205 0.000411 0.000616 0.000616 ... ## $ Sensitivity: num 0 0 0 0 0.00408 ... ## $ Threshold : num 0.983 0.98 0.971 0.969 0.967 ... ## $ modname : chr "glmnet" "glmnet" "glmnet" "glmnet" ... 
A perfect model that is able to capture all the “Yes” and all the “No” will have Sensitivity = Specificity = 1, and thus will generate a curve that passes through the upleft corner, continues parallel to the xaxes and reaches the topright corner. On the other hand, a model that gives random predictions will generate a curve that will lie in the vicinity of diagonal of the graph. Of course we might have also a model that make predictions worse than random, in that case the curve will be to the right of the dotted line.
Before looking at the data, let’s slow down a little bit and spend few more worlds on the concept of threshold. At least initially, it is appealing to set the cutoff probability for assigning Yes or No equal to 0.5. What does that mean?
It means that all the cases with probability of Yes less than 0.5 would be flagged as No, while all the other with probability more than 0.5 would be assigned to Yes. Our RandomForest
model generates probabilities that are all less than 0.5, and thus, with that threshold, it always predicts No (as we saw in the first table and graph). Because in ROC curves (and precisionrecall curves) the threshold is systematically varied, **a model with Yes probabilities always less than 0.5 ** (as our RandomForest) can still have good sensitivity, as long as there is a consistent probability differential between Yes and No observations.
Here an example. We create 30 observation: the first 29 are “No” and the last one is “Yes”(lab
). The probabilities generated by the model for the Yes cases are all less than 0.5 (sc
), with the first 29 equal to 0.4 and the the last one, the Yes observation, equal to 0.41.
We can automatically plot
the data of class PRROC
.


The legend indicates the threshold at which sensitivity and 1specificity are calculated. When the threshold is set to 0.41, the model correctly recognizes the last observation as Yes and all the others as No. The predictions of the model are perfect, even if the probabilities for Yes are all minor than 0.5. Bearing this in mind we can look at our models.
Let’s plot the data of our models using ggplot2
.


The curve of the glmnet
model is positioned more to the upleft of the graph than those of all the other models. The RandomForest
ROC curve is immediately below the glmnet
, followed by the rpart
and
ctree
` curves. These last two curves lie in close proximity of diagonal of the graph.
Therefore, if we consider these 2 metrics, the glmnet
has the best performance and it is followed by the RandomForest
. As in the example above, varying the probability threshold of the RandomForest
allows us to capture more true positives and optimize the predictions of the model. On the other hand, the predictions of rpart
and ctree
models remain only marginally better than chance.
The best model (glmnet
) detects 75% of the true positives (sensitivity=0.75), but has 50% chances of incorrectly identifying a negative (1  Specificity = 0.5)…better than chance but still far from perfect.
PrecisionRecall Curves
A good argument has been made that the ROC curve can be misleading when dealing with very imbalanced classes. When the positive class is very small, the probability of making False Positive predictions increases. Because False Positives might still be a small percentage in comparison with True Negative, this different probability might only minimally affect the specificity of the model.
PrecisionRecall curves take that into account and plot the Precision (TruePositives / (TruePositives+FalsePositives)), instead of the specificity, as a function of the sensitivity (refereed as recall). In particular, the threshold probability is systematically varied, the 2 measures calculated and their maximal value plotted.
Let’s plot precisionrecall curves of our models.


The glmnet
model show the highest precision, followed by the RandomForest
, the ctree
and finally the rpart
model. Even our best model glmnet
, has a maximal precision of 0.5 (only one over 2 positive predictions are true) at a recall of less than 10%. This performance is better than pure chance but not as good as we would have hoped.
The Area Under the Curve
We can summarize and compare the data of the ROC and precisionrecall curves calculating the Area Under the Curve (AUC). The previously obtainedmetrics
tibble contains the AUC of ROCs and precisionrecall curves for each of the folds of the dataset. Let’s extract the data, summarize it and plot it.


The glmnet
has the larger AUCs of ROCs and precisionrecall curves, followed by the RandomForest
. The 2 classificationtree models have the smallest AUCs, with values that approximate to those expected for random choices.
The glmnet
model produces the most accurate predictions among the models, but its precision and sensitivity are still quite limited.
Highlights, judges score and result of the bout!
Let’s recap…
The aim of this post was to predict drug use behavior in adulthood using info (friends, social activities, parents and others) collected during adolescence. The dataset used, the public portion of the ADD HEALTH PROJECT consists in over 2000 observations of more than 5000 subjects. The classifier is the response (Yes or No) that indicates drug use despite negative consequences.
This is what we did:
 got the data and made it tidy.
 scaled, normalized, removed nearzero variance variables, applied PCA to reduce dimensionality and impute missing data using Knn.
 splitted the data using a 5fold 10repetition adaptative crossvalidation.
 trained and found the optimal hyperparameters of a generalized linear model via penalized maximum likelihood (
glmnet
), 2 classificationtree models (
rpart
and
ctree
) and a Forest (
RandomForest
`).  UpSample one of the class to manage a severe class imbalance (only ~4% of No).
 visualized confusion Matrix, calculated metrics of accuracy, ROC curves and precisionrecall curves to identify the best model.
…so what is the outcome? Who won?
The best model, glmnet
, give us a consistent advantage over pure change for predicting who in adulthood will experince some drug abuse problems. However, the perdictions were disappointingly very far from perfect.
…so I would say that it’s…
The performance of the best model was very modest but we had the opportunity to explore a lot of the fundamental concepts in machine learning and predictive analysis, and overall that was the most important aim.
Moreover there are several things that can be done to increase the performance of the models:
 The poor performance was at least partially due to the severe class imbalance and the limited number of positive (~4%: ~300). The public dataset that we analyzed is just a small portion of the data collected by ADD HEALTH. Analyzing the whole dataset would give us access to more positive cases and most likely increase the performance of the models.
 We have evaluated only an handful of hyperparameters. With higher computational power we would be able to find better hyperparameters and probably increase the performance of the models.
 This post was mainly focused on the use of
caret
for predictive modeling, and thus not enough attention was placed on data exploration. Know your data and you will be able to apply the most appropriate tools…
I hope that you liked the post. Next post will be on Shiny, Python and Reticulate or on PCA or something else. I still have to decide.