Shanshan Li, Xiaomin Lin, Lei Niu, Zizhen Wu and Haimao Zhan
Tuesday 09 April 2019
In life insurance, it is critical to obtain accurate knowledge of an individual’s tobacco use in assessing mortality risks. Medically, it is well-established that smoking increases the risk of cardiovascular and respiratory diseases, cancers, and other ailments. In an accelerated underwriting program, lab tests to detect nicotine metabolites are forfeited. Relying solely on self-reported smoking status presents an obvious business risk as a smoking applicant may be motivated, by the potential for getting a better price, to lie about using tobacco.
In this project, we aim to build a smoker propensity model to predict the probability that an applicant is a tobacco user.
We pull the historical records of the MassMutual term product applicants. The inputs are applicants' family histories obtained from client medical interview and their credit risks and public records obtained from LexisNexis. The data have been cleaned and preprocessed. The preprocessed dataset includes ~232,000 applicants, among whom ~8% are smokers. There are 345 predictors, most of which are categorical variables.
Our dataset has a few characteristics:
We now have data ready, but before diving into modeling (and even after generating results), we carefully think about the following questions:
For majority of practical binary classification problems, data is not linearly separable unless mapped to higher dimensions (e.g., Gaussian kernel in SVM).
A key issue to determine whether a DNN model is worth considering for a dataset is the size of the data. In practice, deep learning, compared with traditional machine learning algorithms, achieves better performance given sufficient amounts of data (see Fig. 1). This statement fails to hold when insufficient data is available for training though. However, there is no consensus on what the threshold of "sufficiency" is. Some discussions over this topic can be found here.
When it comes to the issue of determining an appropriate "size" of the neural network, people still follow a manual and cumbersome trial-and-error process in practice, though some efficient search algorithms have been proposed to automate this process [1]. There are some, however, empirically-derived rules-of-thumb, for example,
At the early stage of our project, we found that when TensorFlow generated prediction probabilities based on its trained model, it returned (almost) always the same value. In other words, AUC was 0.5 and everyone was predicted as a non-smoker. To troubleshoot the problem, we made a few attempts:
Given the large number of hyperparameters, hyperparameter tuning has been a major barrier in applications of NN in broader areas. On one hand, this whole process could be very time-consuming. On the other hand, in our experiments, we found that the accuracy of the algorithm could be highly sensitive to the choices of hyperparameters, which makes it an imperative step in model development. Often times, grid search is impractical due to limited computation capability. Random search may work more efficiently but the resulting set of hyperparameters may not be optimal. In our applications, we tried to hand tune the hyper parameters in a systematic order to strike a balance between accuracy and efficiency. And the suggested order is as follows:
H2O is a machine learning platform developed by a startup h2o.ai. As an open-source machine learning platform, H2O has attracted increasing attention from industry and academics. H2O is competitive in the following aspects
Deep learning is an important part of H2O. So far, however, only feedforward neural network (FFNN) is implemented (no convolution, recurrence, etc.). From our limited experience, we notice the differences between the two as follows:
We did a random split of the original dataset into 80% training and 20% validation. Given the following hyperparameter setting,
'activation': <function tensorflow.python.ops.gen_nn_ops.elu>,
'batch_norm_momentum': 0.9,
'batch_size': 64,
'dropout_rate': None,
'learning_rate': 0.01,
'max_checks_without_progress': 20,
'n_hidden_layers': 5,
'n_neurons': 20
.The AUC on the validation set is 0.60. In addition, using the full dataset, the average of 10-fold cross-validated AUC is 0.64.
For comparison purposes, we again split the dataset into 80% training and 20% testing, and set the following hyperparameters in a H2O frame,
'distribution'=bernoulli,
'activation'=RectifierWithDropout,
'hidden'=[64, 128, 64],
'input_dropout_ratio'=0.2,
'hidden_dropout_ratios'=[0.5, 0.5, 0.5],
'l1'=1e-4,
'epochs'=100,
'rate'=0.01,
'adaptive_rate'=False,
'rate_annealing'=1e5,
'variable_importances'=True,
'nfolds'=10,
'stopping_rounds'=5,
'missing_values_handling'=Skip,
'balance_classes'=True
.The AUC on the validation set is 0.69. In addition, using the full dataset, the average of 10-fold cross-validated AUC is 0.70.
DNN has similar performance as the traditional machine learning algorithm in our application. Using Multi-Layer Perceptron in H2O, the 10-fold cross-validated AUC is 0.70. As a comparison, using random forest, the 10-fold cross-validated AUC is 0.68. These results are as expected, because deep learning achieves significantly better performance than traditional algorithms when amount of data is sufficiently large, while in our application the dataset is of moderate sample size.
Both TensorFlow and H2O have been used for implementation. In our experiences, we found that H2O is more user-friendly than TensorFlow.
Below are some useful tutorials and references: