Autoencoder and Deep Features

Note: this post is a replication of our kernel published on Kaggle---Autoencoder and Deep Features.

In this tutorial-like kernel, using data from New York City Taxi Trip Duration, we use the autoencoder and deeplearning capability of the H2O framework to explore deep features. Such an approach provides a means of reducing features (Hinton & Salakhutdinov, 2006)—although here there is no need, for the number of features is already small. It also offers a way to leverage unlabelled data—a technique called semi-supervised learning. The kernel Diabolo Trick: Pre-train, Train and Solve gives a concrete example. Finally, it is perhaps worth noting that deep features are the backbone of some transfer learning techniques.

Features Tinkering

First, we remove the column dropoff_datetime, as otherwise there is no point to predict the duration of the trip… We could just precisely calculate it…

X[, dropoff_datetime := NULL]

Outliers

Pickup Location

There seems to be some outliers in the dataset. Let’s remove some of them (obviously, a better way should be devised here, for the current naive implementation misses some outliers while removing some non-outliers).

X <- X[-which(X$pickup_longitude > -74
             & X$pickup_latitude < 40.57), ]

Trip Duration

We rely on the Hampel’s test to compute the maximum threshold above which values are declared spurious. One can find further details about the Hampel’s test for outlier detection in Scrub data with scale-invariant nonlinear digital filters.

X <- X[-which(X$trip_duration > maxThreshold), ]

Locations and Distances

We assume that the earth’s surface forms a perfect sphere; so that we can conveniently use the haversine formula to estimate each trip distance as the crow flies. (This distance should constitute a rough estimation of the covered distance; at least it gives a minimum.)

X[, distance_as_the_crow_flies := distHaversine(
    data.table(pickup_longitude, pickup_latitude),
    data.table(dropoff_longitude, dropoff_latitude))]
                  
Xt[, distance_as_the_crow_flies := distHaversine(
    data.table(pickup_longitude, pickup_latitude),
    data.table(dropoff_longitude, dropoff_latitude))]

Pickup Date

We add the two new features wday and hour.

X[, pickup_datetime := fastPOSIXct (pickup_datetime, tz = "EST")]
X[, wday := wday(pickup_datetime)]
X[, hour := hour (pickup_datetime)]

Xt[, pickup_datetime := fastPOSIXct (pickup_datetime, tz = "EST")]
Xt[, wday := wday(pickup_datetime)]
Xt[, hour := hour (pickup_datetime)]

We can observe, on the left plot, the presence of outliers (red dots)—which markedly shift the mean (green spot) from the median—presumably due to the volatility of the traffic in New York. The right plot contrasts shorted distances (arbitrarily chosen as less than 4 Km) with longer distances. Outliers appear to merely plague the shorter distances (which is fairly intuitive, especially given that many of the shorter-distance trips appear to be within Manhattan).

The plot above suggests that trip durations are longer late at night during the weekend, whereas they tend to be longer early morning (4~8) during the weekdays (depicting the rush hours of commuters). That sounds rather unsurprising…

Convert Categorical Variables to Numerical

There is 1 categorical variables, and there are 9 numerical variables with non-null standard deviation.

X[, store_and_fwd_flag := ifelse (store_and_fwd_flag == "N", 0, 1)]

Deep Features

Convert to H2O Frames

if (kIsOnKaggle) {
  # we reduce the training set size to avoid time out
  set.seed(1)
  trainIndexes <- createDataPartition(
    y = X[["trip_duration"]], p = 0.1, list = FALSE) 
  X <- X[trainIndexes, ]
  rm(trainIndexes)
}
train <- as.h2o(X)
predictors <- setdiff (names(X), c(toIgnore))

Autoencoder (Diabolo Network)

if (kIsOnKaggle) {
  hyperParamsAutoencoder = list( 
    hidden = list(c(8, 4, 8), c(8, 3, 8), c(7, 3, 7), c(6, 3, 6)),
    activation = c("Tanh"))
} else {
  hyperParamsAutoencoder = list( 
    hidden = list(c(11, 8, 11), c(10, 8, 10), c(9, 5, 9), c(8, 5, 8), 
                  c(7, 5, 7), c(6, 5, 6), c(11, 7, 11), c(10, 7, 10), 
                  c(9, 4, 9), c(8, 4, 8), c(7, 4, 7), c(6, 4, 6), 
                  c(11, 6, 11), c(10, 6, 10), c(9, 3, 9), c(8, 3, 8), 
                  c(7, 3, 7), c(6, 3, 6), c(11, 5, 11), c(10, 5, 10),
                  c(11, 4, 11), c(10, 4, 10), c(11, 8, 5, 8, 11)),
    activation = c("Tanh"))
}

gridAutoencoder <- h2o.grid(
  x = predictors,
  autoencoder = TRUE,
  training_frame = train,
  hyper_params = hyperParamsAutoencoder,
  search_criteria = list(strategy = "Cartesian"),
  algorithm = "deeplearning",
  grid_id = "grid_autoencoder", 
  reproducible = TRUE, 
  seed = 1,
  variable_importances = TRUE,
  categorical_encoding = "AUTO",
  score_interval = 10,
  epochs = 800,
  adaptive_rate = TRUE,
  standardize = TRUE,
  ignore_const_cols = FALSE)

The following table summarizes the grid results (it is sorted increasingly by ‘mse’):

	activation	hidden	mse
1	Tanh	[8, 4, 8]	2.2115573249558083E-5
2	Tanh	[8, 3, 8]	6.43305860410787E-4
3	Tanh	[7, 3, 7]	6.519543824901826E-4
4	Tanh	[6, 3, 6]	6.730825397209355E-4

Considering the “best” autoencoder (i.e., the one with the lowest ‘mse’, which is the one with the hidden layers [8, 4, 8]), the two following figures illustrate the fact that it performs rather well; only a limited portion of the input signal could not be reconstructed.

Deep Features Visualization

Second Layer

layer <- 2

deepFeature2 <- h2o.deepfeatures(bestAutoencoder, train, layer = layer)

data <- as.data.frame(deepFeature2)
data$log_trip_duration <- log(X$trip_duration)

summary(data)

##     DF.L2.C1           DF.L2.C2           DF.L2.C3          DF.L2.C4      
##  Min.   :-0.48997   Min.   :-0.70796   Min.   :-0.4266   Min.   :-0.9829  
##  1st Qu.:-0.28532   1st Qu.:-0.45367   1st Qu.:-0.1012   1st Qu.:-0.9661  
##  Median :-0.05116   Median :-0.04050   Median : 0.1981   Median :-0.9644  
##  Mean   :-0.05273   Mean   :-0.18749   Mean   : 0.1545   Mean   :-0.5969  
##  3rd Qu.: 0.12783   3rd Qu.:-0.01238   3rd Qu.: 0.4004   3rd Qu.:-0.1662  
##  Max.   : 0.47711   Max.   : 0.52161   Max.   : 0.6832   Max.   : 0.5846  
##  log_trip_duration
##  Min.   :0.000    
##  1st Qu.:5.951    
##  Median :6.446    
##  Mean   :6.376    
##  3rd Qu.:6.894    
##  Max.   :7.624

Third Layer

layer <- 3

deepFeature3 <- h2o.deepfeatures(bestAutoencoder, train, layer = layer)

data <- as.data.frame(deepFeature3)
data$log_trip_duration <- log(X$trip_duration)

summary(data)

##     DF.L3.C1           DF.L3.C2           DF.L3.C3          DF.L3.C4      
##  Min.   :-0.47818   Min.   :-0.72271   Min.   :-1.0000   Min.   :-0.7213  
##  1st Qu.:-0.14682   1st Qu.:-0.66474   1st Qu.:-1.0000   1st Qu.: 0.9941  
##  Median : 0.02289   Median :-0.32626   Median :-0.9999   Median : 0.9997  
##  Mean   : 0.03259   Mean   :-0.09229   Mean   :-0.9998   Mean   : 0.9889  
##  3rd Qu.: 0.26142   3rd Qu.: 0.46748   3rd Qu.:-0.9998   3rd Qu.: 0.9997  
##  Max.   : 0.47005   Max.   : 0.68681   Max.   :-0.9976   Max.   : 1.0000  
##     DF.L3.C5           DF.L3.C6         DF.L3.C7          DF.L3.C8       
##  Min.   :-0.60460   Min.   :0.7071   Min.   :-0.9202   Min.   :-0.27290  
##  1st Qu.:-0.12264   1st Qu.:0.8905   1st Qu.:-0.8631   1st Qu.:-0.09507  
##  Median :-0.03823   Median :0.9506   Median :-0.8037   Median :-0.09062  
##  Mean   :-0.03506   Mean   :0.9294   Mean   :-0.7637   Mean   : 0.33366  
##  3rd Qu.: 0.01698   3rd Qu.:0.9848   3rd Qu.:-0.6714   3rd Qu.: 0.87194  
##  Max.   : 0.44521   Max.   :0.9960   Max.   :-0.4528   Max.   : 0.99467  
##  log_trip_duration
##  Min.   :0.000    
##  1st Qu.:5.951    
##  Median :6.446    
##  Mean   :6.376    
##  3rd Qu.:6.894    
##  Max.   :7.624

Predictions Using Deep Features and GBM

We use the second layer of the autocencoder (bestAutoencoder) and the gradient boosting machine algorithm offered by H2O (h2o.gbm). Those are arbitrary choices for the purpose of exemplifying a use of deep features.

layer <- 2

Get Deep Features & Split Data

deepFeatureTrain <- h2o.deepfeatures(bestAutoencoder, train, layer = layer)
deepFeatureTrain[["trip_duration"]] <- as.h2o(X$trip_duration)

splits <- h2o.splitFrame(
  data = deepFeatureTrain, 
  ratios = c(0.7),
  seed = 1
)
trainDf <- splits[[1]]
validDf <- splits[[2]]

Grid Search

deepfeatures <- setdiff(names(deepFeatureTrain), c("trip_duration"))

if (kIsOnKaggle) {
  hyperParamsGbm <- list(
    max_depth = c(3, 4),
    min_split_improvement = c(0, 1e-6))
} else {
  hyperParamsGbm <- list( 
    max_depth = seq(3, 15, 2),
    sample_rate = seq(0.2, 1, 0.01),
    col_sample_rate_per_tree = seq(0.2, 1, 0.01),
    col_sample_rate_change_per_level = seq(0.6, 1.4, 0.02),
    min_rows = 2^seq(0, log2(nrow(train))-3, 1),
    nbins = 2^seq(2, 10, 1),
    nbins_cats = 2^seq(1, 4, 1),
    histogram_type = c("UniformAdaptive", "QuantilesGlobal", "RoundRobin"),
    min_split_improvement = c(0, 1e-8, 1e-6, 1e-4))
} 
  
gridGbm <- h2o.grid(
  hyper_params = hyperParamsGbm,
  search_criteria = list(strategy = "Cartesian"),
  algorithm = "gbm",
  grid_id = "grid_gbm", 
  x = deepfeatures, 
  y = "trip_duration", 
  training_frame = trainDf, 
  validation_frame = validDf,
  nfolds = 0,
  ntrees = ifelse(kIsOnKaggle, 200, 1000),
  learn_rate = 0.05,
  learn_rate_annealing = 0.99,
  max_runtime_secs = ifelse(kIsOnKaggle, 120, 3600),
  stopping_rounds = 5, 
  stopping_tolerance = 1e-5, 
  stopping_metric = "MSE", 
  score_tree_interval = 10,
  seed = 1,
  fold_assignment = "AUTO",
  keep_cross_validation_predictions = FALSE)

sortedGridGbm <- h2o.getGrid("grid_gbm", sort_by = "mse", decreasing = TRUE)

	max_depth	min_split_improvement	mse
1	3	1.0E-6	187782.15952044743
2	3	0.0	187782.15952044743
3	4	0.0	183708.52366487522
4	4	1.0E-6	183708.52366487522

Best Model & Performance on the Validation Set

bestGbmModel <- h2o.getModel(sortedGridGbm@model_ids[[1]])
perf <- h2o.performance(bestGbmModel, newdata = validDf)

## H2ORegressionMetrics: gbm
## 
## MSE:  187782.2
## RMSE:  433.3384
## MAE:  349.3986
## RMSLE:  0.7301732
## Mean Residual Deviance :  187782.2

Retrain the Best Model Using the Entire Train Set

Here we use k-folds cross validation to evaluate the final model (trained using the entire train set—i.e., rbind(trainDf, validDf)).

nfolds <- ifelse(kIsOnKaggle, 2, 8)

finalModel <- do.call(h2o.gbm,
                      {
                        p <- bestGbmModel@parameters
                        p$model_id = NULL         
                        p$training_frame = h2o.rbind(trainDf, validDf)      
                        p$validation_frame = NULL  
                        p$nfolds = nfolds               
                        p
                      })

finalModel@model$cross_validation_metrics_summary

## Cross-Validation Metrics Summary: 
##                               mean           sd  cv_1_valid  cv_2_valid
## mae                       350.8387   0.10633527   350.68832    350.9891
## mean_residual_deviance   189029.88    113.88638   188868.83   189190.94
## mse                      189029.88    113.88638   188868.83   189190.94
## r2                     0.029764593 1.9335767E-4 0.030038042 0.029491143
## residual_deviance        189029.88    113.88638   188868.83   189190.94
## rmse                     434.77563   0.13097145    434.5904   434.96085
## rmsle                   0.73611194 0.0039748037  0.73049074   0.7417332

Test Set & Submission

deepFeatureTest <- h2o.deepfeatures(bestAutoencoder, test, layer = layer)
deepFeatureTest[["trip_duration"]] <- as.h2o(Xt$trip_duration)

pred <- h2o.predict (finalModel, newdata = deepFeatureTest)

fwrite(data.table(id = Xt$id, trip_duration = as.vector(pred$predict)), 
       file = file.path(".", paste0("output-gbm-", Sys.Date(), ".csv"), 
                        fsep = .Platform$file.sep), 
       row.names = FALSE,  quote = FALSE)

Remark

In general, using the top model out of the grid search might not be the wisest approach. Instead, considering the k best models might yield improved performances (k > 1). An easy way to use k models is to average their predicted probabilities. A more sophisticated way consists in using ensemble learning (super learner), for which the choice of the meta-learner can help to even further improve the result (e.g., see H2O’s documentation on Stacked Ensembles).

Predictions Using Pretrained Neural Networks

This section describes an alternative approach to the previous section and is presented in a dedicated kernel: Deep Semi-supervised Learning. ■

License and Source Code

References

Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504–507. https://goo.gl/6GjADx