Shanshan Li, Xiaomin Lin, Lei Niu, Zizhen Wu and Haimao Zhan

Tuesday 09 April 2019

In life insurance, it is critical to obtain accurate knowledge of an individualâ€™s tobacco use in assessing mortality risks. Medically, it is well-established that smoking increases the risk of cardiovascular and respiratory diseases, cancers, and other ailments. In an accelerated underwriting program, lab tests to detect nicotine metabolites are forfeited. Relying solely on self-reported smoking status presents an obvious business risk as a smoking applicant may be motivated, by the potential for getting a better price, to lie about using tobacco.

In this project, we aim to build a smoker propensity model to predict the probability that an applicant is a tobacco user.

We pull the historical records of the MassMutual term product applicants. The inputs are applicants' family histories obtained from client medical interview and their credit risks and public records obtained from LexisNexis. The data have been cleaned and preprocessed. The preprocessed dataset includes ~232,000 applicants, among whom ~8% are smokers. There are 345 predictors, most of which are categorical variables.

Our dataset has a few characteristics:

- Sparsity: many predictors are categorical variables with near zero variance.
- Noise: many attributes are self-reported, and missing values are imputed by draws of random values from the corresponding age-gender specific distribution.
- Weak correlation among predictors and outcomes.
- Imbalanced outcomes: 92% non-smokers and 8% smokers.

We now have data ready, but before diving into modeling (and even after generating results), we carefully think about the following questions:

- Is our data linearly separable?
- Is a deep neural network suitable for our data?
- For our specific problem, if we decide to use a deep neural network (DNN) algorithm (which is feedforward NN, for our problem), what is the appropriate number of layers and the size of each layer?
- Does the DNN perform better than other algorithms?

For majority of practical binary classification problems, data is not linearly separable unless mapped to higher dimensions (e.g., Gaussian kernel in SVM).

A key issue to determine whether a DNN model is worth considering for a dataset is the size of the data. In practice, deep learning, compared with traditional machine learning algorithms, achieves better performance given sufficient amounts of data (see Fig. 1). This statement fails to hold when insufficient data is available for training though. However, there is no consensus on what the threshold of "sufficiency" is. Some discussions over this topic can be found here.

When it comes to the issue of determining an appropriate "size" of the neural network, people still follow a manual and cumbersome trial-and-error process in practice, though some efficient search algorithms have been proposed to automate this process [1]. There are some, however, empirically-derived rules-of-thumb, for example,

- For the majority of practical problems, it is worth starting from a simple architecture of single hidden layer with (input layer size + output layer size)/2 neurons. Sometimes it performs no worse than other more complex structures (MLP), without regularizations. An extensive discussion over this issue can be found here.
- Although complex structure tends to bring overfitting, regularization (including hidden layer drop-out, L1 and L2) helps reduce it, by shrinking the weights on each hidden, input and output neurons. Therefore, it is not worthless to try a MLP with appropriate regularizations through a hyper-parameter tuning process, particularly when a single layer perceptron (SLP) generates an unsatisfactory result. Nevertheless, it is still an open question. Attempts to offer an algorithmic solution include [2] and [3].

At the early stage of our project, we found that when TensorFlow generated prediction probabilities based on its trained model, it returned (almost) always the same value. In other words, AUC was 0.5 and everyone was predicted as a non-smoker. To troubleshoot the problem, we made a few attempts:

- To mitigate the issue of imbalanced outcome, we tried to include a weight vector in the tf.estimator.DNNClassifier, where the weight is inversely proportional to the sample size of each group.
- As an alternative approach, we also tried to downsample the non-smokers in the training dataset, such that the proportions of smokers and non-smokers are relatively balanced.
- Because many variables are weakly correlated with the outcome, we ranked the importance of the variables by their information values, and removed a few ones with zero information values.
- We carefully selected the weight initialization and the activation function of NN.

Given the large number of hyperparameters, hyperparameter tuning has been a major barrier in applications of NN in broader areas. On one hand, this whole process could be very time-consuming. On the other hand, in our experiments, we found that the accuracy of the algorithm could be highly sensitive to the choices of hyperparameters, which makes it an imperative step in model development. Often times, grid search is impractical due to limited computation capability. Random search may work more efficiently but the resulting set of hyperparameters may not be optimal. In our applications, we tried to hand tune the hyper parameters in a systematic order to strike a balance between accuracy and efficiency. And the suggested order is as follows:

- weight initialization and activation function

Although ReLu is the commonly used default activation function, we didn't find it very successful in our application, probably because it kills neurons in half of its regions. We tried ELu and ReLu6, which alleviate the possibility of killed neurons. Weight initialization is important too: if too small, all activations will collapse to zero; if too large, almost all neurons will be completely saturated. We found that xavier_initializer and variance_scaling_initializer are good choices in our applications. - learning rate

Typically, the suggested values to start with are {0.001, 0.0001, 0.00001}. In our case, we found that small learning rate barely leads to any updates over time. Therefore, we chose a larger value, learning rate = 0.01, and it works reasonably well. - n_hidden_layers and n_neurons

Although it is argued by many researchers that deep NN stands out when the network is deep and the model is trained long enough, we found simpler networks work better in our application, probably due to the relatively small sample size of our dataset. - dropout_rate and L1 regularization

Dropout can serve for both accelerated computation and regularization. In our application, we have 345 predictors. We think it is important to set different dropout rates for the input layer and the hidden layer, because dropping too many predictors in the input layer may result in loss of information. - optimizer class
- other parameters, such as batch_norm_momentum, batch_size

H2O is a machine learning platform developed by a startup h2o.ai. As an open-source machine learning platform, H2O has attracted increasing attention from industry and academics. H2O is competitive in the following aspects

- Provides a whole suite of supervised and unsupervised learning algorithms, and is keeping adding more;
- Easy to use for R and Python users, particularly with the help of this repo;
- Notable accuracy;
- Fast speed;
- Effcient memory use;
- Good scalability with fine-grain distributed processing and parallelism - work well on AWS and with Spark.
- Good Flexibility. One model generated using H2O-Python can be easily retrieved and applied using H2O-R, and vice versa.

Deep learning is an important part of H2O. So far, however, only feedforward neural network (FFNN) is implemented (no convolution, recurrence, etc.). From our limited experience, we notice the differences between the two as follows:

- TensorFlow gives users too much flexibility. On the one hand, it is good for details-oriented users or those who is more expertised in deep learning. On the other hand, it is not easy for non-expert users to implement a baseline working model. In comparison, H2O-deep learning offers those non-expert users a handy option to help them quickly build a baseline working model.
- Many TensorFlow-Python users refer to scikit-learn for help in hyper-parameter tuning. It is not convenient for R users though. In comparison, H2O has its built-in hyper-parameter tuning features in both R and Python versions. It comes to our attention that Keras serves a high level API on top of TensorFlow. Similarly as H2O, it enables users to build a working deep learning model faster without digging into too much details as TensorFlow does. Comparing Keras with H2O in terms of efficiency, flexibility and usability is our future work.

We did a random split of the original dataset into 80% training and 20% validation. Given the following hyperparameter setting,
```
'activation': <function tensorflow.python.ops.gen_nn_ops.elu>,
'batch_norm_momentum': 0.9,
'batch_size': 64,
'dropout_rate': None,
'learning_rate': 0.01,
'max_checks_without_progress': 20,
'n_hidden_layers': 5,
'n_neurons': 20
```

.The AUC on the validation set is 0.60. In addition, using the full dataset, the average of 10-fold cross-validated AUC is 0.64.

For comparison purposes, we again split the dataset into 80% training and 20% testing, and set the following hyperparameters in a H2O frame,
```
'distribution'=bernoulli,
'activation'=RectifierWithDropout,
'hidden'=[64, 128, 64],
'input_dropout_ratio'=0.2,
'hidden_dropout_ratios'=[0.5, 0.5, 0.5],
'l1'=1e-4,
'epochs'=100,
'rate'=0.01,
'adaptive_rate'=False,
'rate_annealing'=1e5,
'variable_importances'=True,
'nfolds'=10,
'stopping_rounds'=5,
'missing_values_handling'=Skip,
'balance_classes'=True
```

.The AUC on the validation set is 0.69. In addition, using the full dataset, the average of 10-fold cross-validated AUC is 0.70.

DNN has similar performance as the traditional machine learning algorithm in our application. Using Multi-Layer Perceptron in H2O, the 10-fold cross-validated AUC is 0.70. As a comparison, using random forest, the 10-fold cross-validated AUC is 0.68. These results are as expected, because deep learning achieves significantly better performance than traditional algorithms when amount of data is sufficiently large, while in our application the dataset is of moderate sample size.

Both TensorFlow and H2O have been used for implementation. In our experiences, we found that H2O is more user-friendly than TensorFlow.

Below are some useful tutorials and references: