Predicting Collision Risk from Historical Collision Data Messages (CDMs) using Machine Learning
$3^{rd}$ place solution for ESA's Collision Risk Prediction Challenge hosted on Kelvins platform. (Please note that an extension of this solution is being prepated for publication)
1. Introduction
The number of close approaches between space objects is increasing due to the proliferation of satellites and debris in orbit. It is a routine operation for satellite operators to avoid probable collisions by conducting maneuvers based on conjunction data messages (CDMs). The number of CDMs issued weekly are a couple of orders of magnitude larger than the actionable ones, and this puts pressure on satellite operators. Therefore, an automated process that can predict collision risk with desired accuracy a couple of days ahead to allow satellite operators to plan for collision avoidance maneuvers is desired. In this work, the feasibility of leveraging deep learning to predict the final collision risk between two space objects from the history of CDMs is investigated. The CDMs issued for the satellites that ESA's Space Debris Office supports are used to train an ensemble of Manhattan-LSTM models. The proposed ensemble model leverages the similarity scores between the input pairs instead of determining the decision boundary between high and low-risk close approach classes to address the imbalanced data. The significant imbalance and class overlapping in the data is also addressed by casting the problem as an anomaly detection problem and utilizing ensemble of various models. The proposed ensemble model with majority vote can predict collision risks two days ahead with a precision of 91% for high-risk cases (after adding up the predictions of two-stages). It should ne noted that the solution is two-stage. First stage involves feature engineering and casting the problem as anomaly detection problem by branching normal and anomalous cases with the most important feature, and that gives the precision score of 75% for high-risk cases. The second stage leverages the Manhattan-LSTM to decide whether each input pairs are similar, and this gives f1 score of 80% for anomaly detection of high-risk cases. Therefore, it is possible to determine low-risk to high-risk at close approach time anomalies.
2. Data Preparation
Machine learning models are sensitive to the training data, and the attributes of the data should be understood before training models with them. The dataset used in this work includes time-series of CDMs issued for close approaches between a satellite and a space object. However, each example includes some information that exists in the standard CDM format, yet not all. The training dataset has 162634 rows (CDMs), 13154 of which are unique close approaches (approximately 12 CDMs for each event), and 103 columns(features from each CDM). Some features from CDMs are probability of collision risk, miss distance, relative position and velocity, solar activity indices,uncertainty in position and velocity, radar cross sections, and some orbitaldata about both space object. Further details about all features available in the dataset can be accessed on Kelvins platform [1]. Since the training dataset is small, 20 most informative features are used for training models to makesure the model doesn’t learn the noise in the data.
The training dataset has time-series of CDMs ranging from 7 days to a couple hours before the close approach. ESA Space Debris Office notifies flight teams to plan for possible maneuvers 2 days before the time of the closest approaches for their satellites, and they conduct futher analysis to decide whether to maneuver within 1 day. Since there exists uncertainty in orbit determination and propagation of space objects, it is preffered to start planning for collision avoidance maneuvers a couple of days ahead of the close approach time to keep uncertainty small and let flight teams have enough time to analyse the maneuvers to be conducted.
Since it is desired to predict the collision risk two or more days ago, the latest available collision risk value (target label) and a CDM (input features) which is issued two or more days before the close approach should be selected for each close approach event. Figure 1 shows how some input features of a CDM and a collision risk value as a target label (log base 10) are selected as described above to form a candidate input-output pair (most informative features are selected during feature engineering) for training machine learning models from time-series of CDMs.

Due to my domain expertise, I directly investigated whether the latest available collision risk values that are issued 2 or more days before the close approach are enough to predict the target risk class accurately, and I realised this was the case. Collision risk is derived from other input features, such as miss distance, position covariance and physical sizes of space objects, and it is an important feature. However, there are 23 cases that are turn out to be high-risk and 269 cases that turn out to be low-risk when the latest available collision risk values predict low-risk and high-risk respectively. Therefore, this work casts the collision risk prediction problem as an anomaly detection problem. It investigates the feasibility of using machine learning to predict collision risk class (high or low-risk) with desired accuracy by detecting anomalies (Figure 2).

The dataset has many challenges due to its small size, and it is more sensitive to the quality of data in general. Imbalance ratio, which is the ratio of the majority to the minority class, is 370 for low-risk target class and 3 for high-risk target class. Anomalies for low-risk target class has only 23 examples which is a challenge to even split data for validation purposes because the validation dataset should have enough examples for results to be statistically meaningful. In addition, the total number of samples for high-risk target class is 339. The stratified 3-folds cross-validator is implemented to avoid overfitting. An ensemble of various simple models with different hyperparameters that determines the similarity in input pairs is leveraged for better generalization. Figure shows the 2-D feature space representation of some input features (features that fuse multiple other features), namely time to close approach, mahalanobis distance, maximum risk estimate and scaling, and target labels using t-distributed stochastic neighbor embedding (t-SNE) algorithm [2]. It should be noted that due to significant class imbalance and class overlapping high-risk to low-risk anomalies, no statistical model can be developed for that case.

There are two neural networks in Manhattan-LSTM model, and they share weights [4]. The model utilizes LSTM to read the features that are derived from CDMs. Then, the similarity score between the input pairs is computed by using the similarity function $g = exp(-|h_{a}-h_{b}|_{1})$ (Figure 4). The LSTM is chosen because it allows variable lengths (to allow missing data to be included) as inputs, and this is not because the inputs are time-series. The total number of pairs for training is 1 million (normal-abnormal), and 10000 pairs are prepared for validation.

Table 1 shows the features that are selected based on the distribution differences between normal and anomalous case.
| Features | Variable | Description | 
|---|---|---|
| Time to tca | time_to_tca | Time interval between CDM creation and time-of-closest approach (days) | 
| Max risk estimate | max_risk_estimate | Maximum collision probability obtained by scaling combined covariance | 
| Mahalonobis distance | mahalanobis_distance | Miss distance scaled with uncertainty | 
| Miss distance | miss_distance | Relative position between chaser & target at tca (m) | 
| Number of CDMs issued | no_larger_2 | Number of CDMs issued before 2 days to the tca | 
| Mean of risk values | mean_larger_2 | Mean of collsion risk values for the CDMs issued before 2 days to the tca | 
| STD of risk values | std_larger_2 | STD of collsion risk values for the CDMs issued before 2 days to the tca | 
| Debris positional uncertainty (cross-track) | c_sigma_n | Covariance (cross-track) position standard deviation (sigma) in meters | 
| Debris positional uncertainty (along-track) | c_sigma_t | Covariance transverse (along-track) position standard deviation (sigma) in meters | 
| Debris positional uncertainty (radial) | c_sigma_r | Covariance radial position standard deviation (sigma) in meters | 
| Debris positional covariance determinant | c_position_covariance_det | Determinant of covariance | 
| Debris number of observations used | c_obs_used | Number of observations used for orbit determination (per CDM) | 
Table 2 shows the hyperparameters that are used to train the Manhattan-LSTM.
| Hyperparameters | Values | 
|---|---|
| Hidden layer size | 32 | 
| Batch size | 64 | 
| Epoch number | 100 | 
| Activation function | tanh | 
| Dropout | 0.05 | 
| Learning rate | 2e-5 | 
4. Results
Figure 5 shows the loss values for both training and validation data, and Figure 5 shows the area-under-curve values during training.


In Figure 7, correlation matrix for the test data is presented using the model for the epoch 55, which is the epoch when overfitting starts.

5. References
[1] https://kelvins.esa.int/collision-avoidance-challenge/
[2] Maaten, L.V.D. and Hinton, G., 2008. Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), pp.2579-2605. http://www.jmlr.org/papers/v9/vandermaaten08a.html
[3] Mueller, J. and Thyagarajan, A., 2016, March. Siamese recurrent architectures for learning sentence similarity. In thirtieth AAAI conference on artificial intelligence. https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/viewPaper/12195
# Load the required libraries
from sklearn.metrics import fbeta_score
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns
import os
import random
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from time import time
from sklearn.model_selection import train_test_split
import itertools
import datetime
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Lambda, Dense, Conv1D, Flatten, Dropout
import tensorflow.keras.backend as K
from tensorflow.keras.optimizers import Adadelta, Adam, SGD
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.metrics import AUC
from sklearn.manifold import TSNE
from datagen import prepare_data
pd.set_option('display.max_columns', None)
#prepare the data to test challenge metric
#define paths
path_to_dataset = 'dataset/'
#read the training data
train = pd.read_csv(os.path.join(path_to_dataset,'train_data.csv'))
test = pd.read_csv(os.path.join(path_to_dataset,'test_data.csv'))
#Shapes of the train and test dataset
print("shape of train data {}".format(train.shape))
print("shape of test data {}".format(test.shape))
#train time to close approach min and max
train["time_to_tca"].describe().loc[["min","max"]]
#train time to close approach min and max
test["time_to_tca"].describe().loc[["min","max"]]
print("number of events that has less than 2.0 days data : ",len(train[train["time_to_tca"]<2.0].groupby("event_id").count()["time_to_tca"].tolist()))
print("number of events that has more than 2.0 days data : ",len(train[train["time_to_tca"]>2.0].groupby("event_id").count()["time_to_tca"].tolist()))
print("total number of events  : ",len(train.groupby("event_id").count()["time_to_tca"].tolist()))
events_less_than_two_days = train[train["time_to_tca"]<2.0].groupby("event_id")["event_id"].first().tolist()
events_more_than_two_days = train[train["time_to_tca"]>2.0].groupby("event_id")["event_id"].first().tolist()
# number of events that has time_to_tca larger and smaller than 2 at the same time
events_trainable = [value for value in events_less_than_two_days if value in events_more_than_two_days]
print("number of trainable events : ",len(events_trainable))
#generate new training data that uses last available CDM (closest to 2 days) and adds latest CDM risk as target variable
lenData = len(events_trainable)
new_train = []
target_variable = []
no_larger_2 = []
mean_larger_2 =[]
std_larger_2 = []
mean_nan_number =[]
std_nan_number = []
for cnt in range(lenData):
    new_train.append(train[(train["event_id"]==events_trainable[cnt])&(train["time_to_tca"]>=2.0)][-1:].values.tolist()[0])
    target_variable.append(train[(train["event_id"]==events_trainable[cnt])][-1:].risk.values[0])
    no_larger_2.append(train[(train["event_id"]==events_trainable[cnt])&(train["time_to_tca"]>=2.0)].shape[0])
    mean_larger_2.append(train[(train["event_id"]==events_trainable[cnt])&(train["time_to_tca"]>=2.0)].risk.mean())
    std_larger_2.append(train[(train["event_id"]==events_trainable[cnt])&(train["time_to_tca"]>=2.0)].risk.std(ddof=0))
    mean_nan_number.append(train[(train["event_id"]==events_trainable[cnt])&(train["time_to_tca"]>=2.0)].isnull().sum(axis=1).mean())
    std_nan_number.append(train[(train["event_id"]==events_trainable[cnt])&(train["time_to_tca"]>=2.0)].isnull().sum(axis=1).std(ddof=0))
new_traintrain_pd = pd.DataFrame(new_train,columns=train.columns.tolist())
#add target value to the dataset
target_variable_pd = pd.DataFrame(target_variable,columns=["target_risk"])
no_larger_2_pd = pd.DataFrame(no_larger_2,columns=["no_larger_2"])
mean_larger_2_pd = pd.DataFrame(mean_larger_2,columns=["mean_larger_2"])
std_larger_2_pd = pd.DataFrame(std_larger_2,columns=["std_larger_2"])
mean_nan_number_pd = pd.DataFrame(mean_nan_number,columns=["mean_nan_number"])
std_nan_number_pd = pd.DataFrame(std_nan_number,columns=["std_nan_number"])
new_traintrain_pd["target_risk"] = target_variable_pd.values
new_traintrain_pd["no_larger_2"] = no_larger_2_pd.values
new_traintrain_pd["mean_larger_2"] = mean_larger_2_pd.values
new_traintrain_pd["std_larger_2"] = std_larger_2_pd.values
new_traintrain_pd["mean_nan_number"] = mean_nan_number_pd.values
new_traintrain_pd["std_nan_number"] = std_nan_number_pd.values
new_traintrain_pd.head(5)
# Determines the high risk events
new_traintrain_pd['target_risk_class']=np.nan
new_traintrain_pd.loc[new_traintrain_pd['target_risk']>=-6.0,'target_risk_class'] = int(1)
new_traintrain_pd.loc[new_traintrain_pd['target_risk']<-6.0,'target_risk_class'] = int(0)
print("number of high risk events",new_traintrain_pd[new_traintrain_pd['target_risk_class']==1].shape[0])
print("number of low risk events",new_traintrain_pd[new_traintrain_pd['target_risk_class']==0].shape[0])
print("ratio of high risk versus low risk ",new_traintrain_pd[new_traintrain_pd['target_risk_class']==1].shape[0]/new_traintrain_pd[new_traintrain_pd['target_risk_class']==0].shape[0]*100)
# Uses the latest available risk value to determine the normal and anomalous cases
new_traintrain_pd['predicted_risk'] = 0 
new_traintrain_pd.loc[new_traintrain_pd['risk']<-6.0,'predicted_risk'] =  -6.000000000000001
new_traintrain_pd.loc[new_traintrain_pd['risk']>=-6.0,'predicted_risk'] =  new_traintrain_pd.loc[new_traintrain_pd['risk'] >=-6.0].risk.values
# Assigns the classes based on risk values
new_traintrain_pd['predicted_risk_class']=np.nan
new_traintrain_pd.loc[new_traintrain_pd['predicted_risk']>=-6.0,'predicted_risk_class'] = int(1)
new_traintrain_pd.loc[new_traintrain_pd['predicted_risk']<-6.0,'predicted_risk_class'] = int(0)
print("number of high risk events with above criterion",new_traintrain_pd[new_traintrain_pd['predicted_risk_class']==1].shape[0])
print("number of low risk events with above criterion",new_traintrain_pd[new_traintrain_pd['predicted_risk_class']==0].shape[0])
print("ratio of high risk versus low risk with above criterion",new_traintrain_pd[new_traintrain_pd['predicted_risk_class']==1].shape[0]/new_traintrain_pd[new_traintrain_pd['predicted_risk_class']==0].shape[0]*100)
new_traintrain_pd['predicted_risk_anomaly']=np.nan
new_traintrain_pd.loc[(new_traintrain_pd['predicted_risk']>=-6.0)&(new_traintrain_pd['target_risk']>=-6.0),'predicted_risk_anomaly'] = int(0)
new_traintrain_pd.loc[(new_traintrain_pd['predicted_risk']<-6.0)&(new_traintrain_pd['target_risk']<-6.0),'predicted_risk_anomaly'] = int(1)
new_traintrain_pd.loc[(new_traintrain_pd['predicted_risk']>=-6.0)&(new_traintrain_pd['target_risk']<-6.0),'predicted_risk_anomaly'] = int(2)
new_traintrain_pd.loc[(new_traintrain_pd['predicted_risk']<-6.0)&(new_traintrain_pd['target_risk']>=-6.0),'predicted_risk_anomaly'] = int(3)
print("number of 0 to 1 risk anomaly events with above criterion",new_traintrain_pd[new_traintrain_pd['predicted_risk_anomaly']==3].shape[0])
print("number of 1 to 0 risk anomaly events with above criterion",new_traintrain_pd[new_traintrain_pd['predicted_risk_anomaly']==2].shape[0])
print("number of 0 to 0 risk events with above criterion",new_traintrain_pd[new_traintrain_pd['predicted_risk_anomaly']==1].shape[0])
print("number of 1 to 1 risk events with above criterion",new_traintrain_pd[new_traintrain_pd['predicted_risk_anomaly']==0].shape[0])
#convert categorical c_object_type
le = LabelEncoder()
new_traintrain_pd["c_object_type"] = le.fit_transform(new_traintrain_pd["c_object_type"])
# Computes the missing value percentage
obj = new_traintrain_pd.isnull().sum()
for key,value in obj.iteritems():
    if value!=0:
        print("key : {} ----> missing data (percentage) :{} %".format(key,\
                                                                      value/8892.0*100))
# Fills missing values with mean values and checks whether there exist more missing data
new_traintrain_pd.fillna(new_traintrain_pd.mean(), inplace=True)
obj = new_traintrain_pd.isnull().sum()
for key,value in obj.iteritems():
    if value!=0:
        print("key : {} ----> missing data (percentage) :{} %".format(key,\
                                                                      value/8892.0*100))
# Scales the input data for standard scaler
from sklearn.preprocessing import StandardScaler
qt_risk = StandardScaler()
new_traintrain_pd['qt_risk'] = qt_risk.fit_transform(np.expand_dims(new_traintrain_pd['risk'].values,axis=1))
# new_test_pd['qt_risk'] = qt_risk.transform(np.expand_dims(new_test_pd['risk'].values,axis=1))
# sns.distplot(new_traintrain_pd['qt_risk'])
qt_time_to_tca = StandardScaler()
new_traintrain_pd['qt_time_to_tca'] = qt_time_to_tca.fit_transform(np.expand_dims(new_traintrain_pd['time_to_tca'].values,axis=1))
# new_test_pd['qt_time_to_tca'] = qt_time_to_tca.transform(np.expand_dims(new_test_pd['time_to_tca'].values,axis=1))
# sns.distplot(new_traintrain_pd['qt_time_to_tca'])
qt_max_risk_estimate = StandardScaler()
new_traintrain_pd['qt_max_risk_estimate'] = qt_max_risk_estimate.fit_transform(np.expand_dims(new_traintrain_pd['max_risk_estimate'].values,axis=1))
# new_test_pd['qt_max_risk_estimate'] = qt_max_risk_estimate.transform(np.expand_dims(new_test_pd['max_risk_estimate'].values,axis=1))
# sns.distplot(new_traintrain_pd['qt_max_risk_estimate'])
qt_max_risk_scaling = StandardScaler()
new_traintrain_pd['qt_max_risk_scaling'] = qt_max_risk_scaling.fit_transform(np.expand_dims(new_traintrain_pd['max_risk_scaling'].values,axis=1))
# new_test_pd['qt_max_risk_scaling'] = qt_max_risk_scaling.transform(np.expand_dims(new_test_pd['max_risk_scaling'].values,axis=1))
# sns.distplot(new_traintrain_pd['qt_max_risk_scaling'])
qt_mahalanobis_distance = StandardScaler()
new_traintrain_pd['qt_mahalanobis_distance'] = qt_mahalanobis_distance.fit_transform(np.expand_dims(new_traintrain_pd['mahalanobis_distance'].values,axis=1))
# new_test_pd['qt_mahalanobis_distance'] = qt_mahalanobis_distance.transform(np.expand_dims(new_test_pd['mahalanobis_distance'].values,axis=1))
# sns.distplot(new_traintrain_pd['qt_mahalanobis_distance'])
qt_miss_distance = StandardScaler()
new_traintrain_pd['qt_miss_distance'] = qt_miss_distance.fit_transform(np.expand_dims(new_traintrain_pd['miss_distance'].values,axis=1))
# new_test_pd['qt_miss_distance'] = qt_miss_distance.transform(np.expand_dims(new_test_pd['miss_distance'].values,axis=1))
# sns.distplot(new_traintrain_pd['qt_miss_distance'])
qt_no_larger_2 = StandardScaler()
new_traintrain_pd['qt_no_larger_2'] = qt_no_larger_2.fit_transform(np.expand_dims(new_traintrain_pd['no_larger_2'].values,axis=1))
# new_test_pd['qt_no_larger_2'] = qt_no_larger_2.transform(np.expand_dims(new_test_pd['no_larger_2'].values,axis=1))
# sns.distplot(new_traintrain_pd['qt_no_larger_2'])
qt_mean_larger_2 = StandardScaler()
new_traintrain_pd['qt_mean_larger_2'] = qt_mean_larger_2.fit_transform(np.expand_dims(new_traintrain_pd['mean_larger_2'].values,axis=1))
# new_test_pd['qt_mean_larger_2'] = qt_mean_larger_2.transform(np.expand_dims(new_test_pd['mean_larger_2'].values,axis=1))
# sns.distplot(new_traintrain_pd['qt_mean_larger_2'])
qt_std_larger_2 = StandardScaler()
new_traintrain_pd['qt_std_larger_2'] = qt_std_larger_2.fit_transform(np.expand_dims(new_traintrain_pd['std_larger_2'].values,axis=1))
# new_test_pd['qt_std_larger_2'] = qt_std_larger_2.transform(np.expand_dims(new_test_pd['std_larger_2'].values,axis=1))
# sns.distplot(new_traintrain_pd['qt_std_larger_2'])
qt_c_sigma_n = StandardScaler()
new_traintrain_pd['qt_c_sigma_n'] = qt_c_sigma_n.fit_transform(np.expand_dims(new_traintrain_pd['c_sigma_n'].values,axis=1))
# new_test_pd['qt_c_sigma_n'] = qt_c_sigma_n.transform(np.expand_dims(new_test_pd['c_sigma_n'].values,axis=1))
# sns.distplot(new_traintrain_pd['qt_c_sigma_n'])
qt_c_sigma_t = StandardScaler()
new_traintrain_pd['qt_c_sigma_t'] = qt_c_sigma_t.fit_transform(np.expand_dims(new_traintrain_pd['c_sigma_t'].values,axis=1))
# new_test_pd['qt_c_sigma_t'] = qt_c_sigma_t.transform(np.expand_dims(new_test_pd['c_sigma_t'].values,axis=1))
# sns.distplot(new_traintrain_pd['qt_c_sigma_t'])
qt_c_sigma_r = StandardScaler()
new_traintrain_pd['qt_c_sigma_r'] = qt_c_sigma_r.fit_transform(np.expand_dims(new_traintrain_pd['c_sigma_r'].values,axis=1))
# new_test_pd['qt_c_sigma_r'] = qt_c_sigma_r.transform(np.expand_dims(new_test_pd['c_sigma_r'].values,axis=1))
# sns.distplot(new_traintrain_pd['qt_c_sigma_r'])
qt_c_position_covariance_det = StandardScaler()
new_traintrain_pd['qt_c_position_covariance_det'] = qt_c_position_covariance_det.fit_transform(np.expand_dims(new_traintrain_pd['c_position_covariance_det'].values,axis=1))
# new_test_pd['qt_c_position_covariance_det'] = qt_c_position_covariance_det.transform(np.expand_dims(new_test_pd['c_position_covariance_det'].values,axis=1))
# sns.distplot(new_traintrain_pd['qt_c_position_covariance_det'])
qt_c_obs_used = StandardScaler()
new_traintrain_pd['qt_c_obs_used'] = qt_c_obs_used.fit_transform(np.expand_dims(new_traintrain_pd['c_obs_used'].values,axis=1))
# new_test_pd['qt_c_obs_used'] = qt_c_obs_used.transform(np.expand_dims(new_test_pd['c_obs_used'].values,axis=1))
# sns.distplot(new_traintrain_pd['qt_c_obs_used'])
# Prepares the data for Manhattan-LSTM
X_train, Y_train, X_train_val, Y_train_val, new_data_pd = prepare_data(new_traintrain_pd)
# Model variables
n_hidden = 32
gradient_clipping_norm = 0.5
batch_size = 64
n_epoch = 100
def exponent_neg_manhattan_distance(left, right):
    ''' Helper function for the similarity estimate of the LSTMs outputs'''
    return K.exp(-K.sum(K.abs(left-right), axis=1, keepdims=True))
left_input = Input(shape=(len(new_data_pd.feature1[0]),1), dtype='float32')
right_input = Input(shape=(len(new_data_pd.feature1[0]),1), dtype='float32')
print(left_input.shape)
# Since this is a siamese network, both sides share the same LSTM
shared_lstm = LSTM(n_hidden,activation='tanh',dropout=0.05)
left_output = shared_lstm(left_input)
right_output = shared_lstm(right_input)
# Calculates the distance as defined by the MaLSTM model
malstm_distance = Lambda(function=lambda x: exponent_neg_manhattan_distance(x[0], x[1]),output_shape=lambda x: (x[0][0], 1))([left_output, right_output])
# Pack it all up into a model
malstm = Model([left_input, right_input], [malstm_distance])
optimizer = Adam(learning_rate=0.00002,decay=0.000002)
malstm.compile(loss='mean_squared_error', optimizer=optimizer, metrics=[AUC()])
# Start training
training_start_time = time()
# Include the epoch in the file name (uses `str.format`)
checkpoint_path = "cp_epoch_{epoch:04d}.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)
# Create a callback that saves the model's weights every 5 epochs
cp_callback = ModelCheckpoint(
    filepath=checkpoint_path, 
    verbose=1, 
    save_weights_only=True,
    period=5)
malstm_trained = malstm.fit([X_train['left'], X_train['right']], Y_train, batch_size=batch_size, nb_epoch=n_epoch,
                            validation_data=([X_train_val['left'], X_train_val['right']], Y_train_val),callbacks=[cp_callback])
print("Training time finished.\n{} epochs in {}".format(n_epoch, datetime.timedelta(seconds=time()-training_start_time)))
sns.set()
plt.plot(malstm_trained.history['loss'],'r-.',label='loss')
plt.plot(malstm_trained.history['val_loss'],'g-*',label='val_loss')
plt.legend()
plt.plot(malstm_trained.history['auc'],'r-.',label='auc')
plt.plot(malstm_trained.history['val_auc'],'g-*',label='val_auc')
plt.xlabel('Iterations')
plt.ylabel('Auc')
plt.legend()
from sklearn.metrics import confusion_matrix
from matplotlib import pyplot as plt
import itertools
def plot_confusion_matrix(cm,classes,normalize=False,title='confusion matrix',cmap=plt.cm.Blues):
  if normalize:
    cm = cm.astype('float')/cm.sum(axis=1)[:,np.newaxis]
    print('normalized')
  else:
    print('not normalized')
  # print(cm)
  plt.imshow(cm,interpolation='nearest',cmap=cmap)
  plt.title(title)
  plt.colorbar()
  tick_marks = np.arange(-1,len(classes))
  plt.xticks(tick_marks,np.array(classes)-1,rotation=45)
  plt.yticks(tick_marks,np.array(classes)-1)
  fmt = '.2f' if normalize else 'd'
  thresh = cm.max()/2.0
  for i,j in itertools.product(range(cm.shape[0]),range(cm.shape[1])):
    plt.text(j,i,format(cm[i,j],fmt),horizontalalignment='center',color='white' if cm[i,j]>thresh else 'black')
  plt.tight_layout()
  plt.ylabel('True Label')
  plt.xlabel('Predicted Label')
  plt.show()
path = 'cp_epoch_0055.ckpt'
malstm.load_weights(path)
p_test = malstm.predict([X_train_val['left'], X_train_val['right']])
p_test_pd = pd.DataFrame(p_test,columns=['prob'])
p_test_pd.loc[p_test_pd['prob']>=0.5,'prob'] = int(1)
p_test_pd.loc[p_test_pd['prob']<0.5,'prob'] = int(0)
cm = confusion_matrix(Y_train_val,p_test_pd.prob)
plot_confusion_matrix(cm,list(range(3)))
import sklearn
sklearn.metrics.f1_score(Y_train_val,p_test_pd.prob)