# set matplotlib backend to inline
%matplotlib inline 

# import modules
from sklearn import datasets 
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
import math

# Collections
from collections import Counter

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# load data
breast_cancer=datasets.load_breast_cancer()
print(breast_cancer.DESCR)

# this dataset has 30 features 
df_breast_cancer = pd.DataFrame(breast_cancer.data, columns = breast_cancer.feature_names )


# extract the data as numpy arrays of features, X, and target, y
X = df_breast_cancer
y = breast_cancer.target

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radius, field
        10 is Radius SE, field 20 is Worst Radius.

        - class:
                - WDBC-Malignant
                - WDBC-Benign

    :Summary Statistics:

    ===================================== ====== ======
                                           Min    Max
    ===================================== ====== ======
    radius (mean):                        6.981  28.11
    texture (mean):                       9.71   39.28
    perimeter (mean):                     43.79  188.5
    area (mean):                          143.5  2501.0
    smoothness (mean):                    0.053  0.163
    compactness (mean):                   0.019  0.345
    concavity (mean):                     0.0    0.427
    concave points (mean):                0.0    0.201
    symmetry (mean):                      0.106  0.304
    fractal dimension (mean):             0.05   0.097
    radius (standard error):              0.112  2.873
    texture (standard error):             0.36   4.885
    perimeter (standard error):           0.757  21.98
    area (standard error):                6.802  542.2
    smoothness (standard error):          0.002  0.031
    compactness (standard error):         0.002  0.135
    concavity (standard error):           0.0    0.396
    concave points (standard error):      0.0    0.053
    symmetry (standard error):            0.008  0.079
    fractal dimension (standard error):   0.001  0.03
    radius (worst):                       7.93   36.04
    texture (worst):                      12.02  49.54
    perimeter (worst):                    50.41  251.2
    area (worst):                         185.2  4254.0
    smoothness (worst):                   0.071  0.223
    compactness (worst):                  0.027  1.058
    concavity (worst):                    0.0    1.252
    concave points (worst):               0.0    0.291
    symmetry (worst):                     0.156  0.664
    fractal dimension (worst):            0.055  0.208
    ===================================== ====== ======

    :Missing Attribute Values: None

    :Class Distribution: 212 - Malignant, 357 - Benign

    :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

    :Donor: Nick Street

    :Date: November, 1995

This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass.  They describe
characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree.  Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.

The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

.. topic:: References

   - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction 
     for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on 
     Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
     San Jose, CA, 1993.
   - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and 
     prognosis via linear programming. Operations Research, 43(4), pages 570-577, 
     July-August 1995.
   - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
     to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 
     163-171.


# Define plotting function
# We can create a function that, given data X and labels y, plots this grid.  The function should be invoked 
# something like this: myplotGrid(X,y,...)

# where X is our training data and y are the labels (you may also supply additional optional arguments). We can 
# We can use an appropriate library to help us create the visualisation. We can code it ourselves using 
# matplotlib functions, scatter and hist

# This populates info regarding the dataset. 
# Plot x against y, using the target labels/ class i.e. benign or malignant
# There are 2, which are indicated by the 2 different colours on plot

X=breast_cancer.data
Y=breast_cancer.target
feature_names = np.array(breast_cancer.feature_names)
target_names = breast_cancer.target_names

"""
Creating plotting function in which given x and y is transformed to dataframe
and after that seaborn module is used to plot scatter and histogram
"""
def myplotgrid(x,y,col):
    df = pd.DataFrame(x, columns=col)
    df['label'] = y
    sns.pairplot(df, hue='label')
    plt.show()


#****************************************** #
# Plot using function (NO NOISE added)#
#****************************************** #

#Use plotting function for any two features
#that is, parameterise, to plot different feature combinations

print(feature_names)

# We can index the features using variables, e.g.
a = 22 #substitute different values here and play around with the feature combinations
b = 27 ##substitute different values here and play around with the feature combinations

# (remember that indices in python start at 0!)

# plt.scatter(X[:, a], X[:, b], c=Y, cmap=plt.cm.Paired)
# plt.xlabel(feature_names[a])
# plt.ylabel(feature_names[b])
myplotgrid(X[:, [a,b]], Y, col=feature_names[[a,b]])

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


#****************************************** #
# Plot using function (WITH NOISE added)#
#****************************************** #

#use plotting function for any two features
#that is, parameterise, to plot different feature combinations

print(feature_names)

# we can index the features using variables, e.g.
a = 22 #substitute different values here and play around with the feature combinations
b = 27 ##substitute different values here and play around with the feature combinations

# (remember that indices in python start at 0!)

# plt.scatter(X[:, a], X[:, b], c=Y, cmap=plt.cm.Paired)
# plt.xlabel(feature_names[a])
# plt.ylabel(feature_names[b])

myseed = 12345
np.random.seed(myseed)
X_noisy=X+np.random.normal(0,0.5,X.shape)

myplotgrid(X_noisy[:, [a,b]], y, col=feature_names[[a,b]])

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


#x_train = np.append(X_train,y_train.reshape(-1,1),axis=1)
#x_test = np.append(X_test,y_test.reshape(-1,1),axis=1)


# mykNN code
# There are only 3 steps for kNN:

# 1. Function for calculating Euclidean distance (known as L2 vector norm) between two vectors
def euclidean_distance(row1,row2) -> float:
    distance = 0.0
    
    for i in range(len(row1)):
        distance += (row1[i] - row2[i])**2

    return round(math.sqrt(distance), 4)

# 1. Function for calculating Manhattan distance (known as L1 vector norm) between two vectors 
def manhattan(row1,row2): 
    distance = 0.0
    for i in range(len(row1)-1):
        distance += abs(row1[i] - row2[i])
    return distance

# 1. Function to calculate Chebyshev distance (known as the L∞ metric) between two vectors
def chebyshev_distance(array1, array2) -> float:
    """
    Function that takes in 2 vectors and calculates the Chebyshev Distance between them.
    Chebyshev distance is the absolute magnitude of the maximum distance between 2 points.
    Input parameters: array1, array2 -> 2 vectors.
    Output: Returns a floating point number representing the Chebyshev Distance between array1 and array2.
    """
    distance = 0.0
    
    for i in range(len(array1)):
        
        # Calculate absolute value of the distance between elements of the vectors
        dist = abs(array1[i] - array2[i])

        # Updating the maximum distance value
        if dist > distance:
            distance = dist
        
    return float(distance)

# Creating a function mapping
function_mappings = {'euclidean': euclidean_distance, 'manhattan' : manhattan, 'chebyshev' : chebyshev_distance}

# 2. Locate the most similar neighbors using our manual function
def mykNN(X,y,X_, num_neighbors,distance=euclidean_distance):
    distances = list()
    train = np.append(X,y.reshape(-1,1),axis=1)
    test_row = X_
    for train_row in train:
        dist = distance(test_row,train_row)
        distances.append((train_row, dist))

    #for each point distance is measured, a list of these distances is created. our requirement is the shortest distance
    #as we need to assign data to the neighbours.  for this purpose, lambda function takes distances list and sorts list in ascending order
    #only top distance assigned to each neighbour i.e. 1st kNN gets 1st distance from list, 2nd kNN gets 2nd distance from list...and so on
    distances.sort(key=lambda tup: tup[1])
    
    neighbors = list()
    for i in range(num_neighbors):
        neighbors.append(distances[i][0])
    output_values = [row[-1] for row in neighbors]
    prediction = max(set(output_values), key=output_values.count)
    return prediction


# 3. vote for labels
# testing mykNN with manhattan distance
predict = mykNN(X,y,X[100],3,manhattan)
print('Actual:', y[100])
print('Predicted:', predict)

Actual: 0
Predicted: 0.0


"""
testing mykNN with euclidean distance. By default the function is 
set to euclidean, so even if distance is not defined it will choose euclidean
"""
predict = mykNN(X,y,X[100],3)
print('Actual:', y[100])
print('Predicted:', predict)

Actual: 0
Predicted: 0.0


# 3.vote for labels
# testing mykNN with chebychev distance
predict = mykNN(X,y,X[100],3,chebyshev_distance)
print('Actual:', y[100])
print('Predicted:', predict)

Actual: 0
Predicted: 0.0


#import scikit-learn for kneighboursclassifier (for verification purposes only)
#guide and verifying manual model above
np.random.seed(myseed) #random selections will be consistent every run, so results stay the same
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

#split to train and test
#80/20 is quite a commonly occurring ratio, often referred to as the Pareto principle. Refer to reference 3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#define knn classifier, with 10 neighbors and use the euclidian distance
knn=KNeighborsClassifier(n_neighbors=10, metric='euclidean')

#define training and testing data, fit the classifier
knn.fit(X_train,y_train)

#predict values for test data based on training data
y_pred=knn.predict(X_test)

# predicting with mykNN
myknn_pred = list()
for i in range(len(X_test)):
    myknn_pred.append(int(mykNN(X_train,y_train,X_test[i], 10)))
    
#print values
print("True Values", y_test) # true values
print("\nSklearn's KNN Classifier: ", y_pred) 
print("\nOur KNN Classifier: ", myknn_pred) 
print(len(y_pred) == len(myknn_pred))
print(len(y_pred))

True Values [1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1
 1 0 1 1 0 0 1 1 1 1 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 1 1
 0 0 0 1 1 1 1 1 0 1 0 1 0 1 1 0 0 0 0 1 1 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 0
 0 1 0]

Sklearn's KNN Classifier:  [1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1
 1 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 0 1 0 1 1 0 0 0 1 1 1 1
 0 0 0 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 1 1 1 0 0 1 0 1 0
 0 1 1]

Our KNN Classifier:  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1]
True
114


# Calculating accuracies of both classifiers
from sklearn.metrics import accuracy_score as acc
print("Accuracy of Sklearn KNN Classifier: ", acc(y_test, y_pred))
print("Accuracy of Our KNN Classifier: ", acc(y_test, myknn_pred))

Accuracy of Sklearn KNN Classifier:  0.9298245614035088
Accuracy of Our KNN Classifier:  0.9298245614035088


labels = breast_cancer.target_names


def evaluation_metric(true, predict, labels=labels):
    actual = np.array(true)
    predicted = np.array(predict)

    #calculate the confusion matrix; labels is numpy array of classification labels
    cm = np.zeros((len(labels), len(labels)))
    for a, p in zip(actual, predicted):
            #print(a, p)
            cm[a][p] += 1

    #also get the accuracy easily with numpy
    accuracy = (actual == predicted).sum() / float(len(actual))

    #also get the precision easily with numpy. precision = tp / tp + fp
    prec = list()
    for i in range(2): #range is 2, as we're deling with a binary classifier (as opposed to multi-class)
        fn = 0.0
        tp = cm[i][i]
        for j in range(2):        
            fn += cm[j][i]
        prec.append(tp/(fn))
    precision = sum(prec)/2

    #also get the recall easily with numpy
    recall = list()
    for i in range(2):
        fn = 0.0
        tp = cm[i][i]
        for j in range(2):        
            fn += cm[i][j]
        recall.append(tp/(fn))
    recall = sum(recall)/2

    # also get the error rate easily with numpy
    error_rate = (np.sum(cm) - np.trace(cm)) / np.sum(cm)

    return(cm, accuracy,precision,recall,error_rate)
    

cm, accuracy, precision, recall, error_rate = evaluation_metric(y_test, y_pred, labels)
print('Confusion Matrix: \n',cm)
print('Accuracy: %.3f'%accuracy)
print('Precision: %.3f'%precision)
print('Recall: %.3f'%recall)
print('Error Rate: %.3f'%error_rate)

Confusion Matrix: 
 [[38.  4.]
 [ 4. 68.]]
Accuracy: 0.930
Precision: 0.925
Recall: 0.925
Error Rate: 0.070


# test evaluation code

#compare the results to sklearnlibrary (for verification purposes)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score,precision_score,recall_score

print('Confusion Matrix: \n',confusion_matrix(y_test,y_pred))
print('Accuracy: %.3f'%accuracy_score(y_test,y_pred))
print('Precision: %.3f'%precision_score(y_test,y_pred,average='weighted'))
print('Recall: %.3f'%recall_score(y_test,y_pred,average='weighted'))

Confusion Matrix: 
 [[38  4]
 [ 4 68]]
Accuracy: 0.930
Precision: 0.930
Recall: 0.930


# Function to implement Nested Cross Validation:

# 1) Return accuracy per distinct set of hyperparameters
# 2) Return best set of parameters(k-neighbours & distance method) after evaluating validation fold (inner loop)
# 3) Confusion matrix per testing fold (outer loop)

def NestedCV(X, y, n_folds, seed = 0):
    """
    Input Parameters:
    -----------------
    X, y: arrays of features, labels
    n_folds: Number of folds to evaluate
    n_neighbors: List of the number of neighbours to evaluate
    distance: List of the distance methods to use 
    seed: Setting random seed for reproducibility
    
    Returns:
    --------
    most_common: Best performing parameters from all 5 folds.
    params_per_fold: List of best fitting hyper parameters per fold. 
    cm_df: Confusion Matrix in a Pandas DataFrame evaluated on the test set.
    test_accuracy: Test Accuracy of the model.
    validation_metrics: Validation precision and recall.
    test_metrics: Test Precision and recall.
    """
    
    # Setting random seed for numpy operations
    np.random.seed(seed)
    
    # Dividing data samples into folds
    indices = np.random.permutation(np.arange(0, len(X), 1))
    indices = np.array_split(indices, n_folds)
    
    # To preserve the parameters and their metrics
    params_per_fold = list()  # Stores best parameters along with accuracy for every fold
    test_accuracies, test_metrics = list(), list()  # Stores accuracy, precision and recall of test folds
    
    
    # Hyperparameters -> the model's performance will be evaluated with different combinations of these
    function_mappings = {'euclidean': euclidean_distance, 'manhattan' : manhattan, 'chebyshev' : chebyshev_distance}

    distance_functions = ['euclidean', 'manhattan' , 'chebyshev']
    n_neighbors = list(range(1, 11))

    
    # Preparing a list of different permutations of the hyperparameters
    parameter_list = [(x, y) for x in distance_functions for y in n_neighbors]
    
    # Outer for loop to divide dataset into train/valid/test subset
    
    for iFold in range(0, n_folds):
        testFold = indices[iFold]
        print('\n\n-----------------------  Fold %s  -----------------------'%iFold)

        remaining_folds = np.delete(range(0, n_folds), iFold)
        validationFold = indices[ remaining_folds[0] ]
        trainfold = indices[remaining_folds[1]]
        
        for i in range(2,len(remaining_folds)):
            trainfold = np.concatenate([trainfold, indices[i]])
        
        print('\n1) Accuracy per distinct set of parameters on the validation set')
        
        # Storing results (to calculate max accuracy for every fold)
        results = list()
        
        # Inner loop to fit each parameter set on the (same) validation fold
        # The inner loop is responsible for model selection/hyperparameter tuning (similar to validation set)
        for each_combo in parameter_list:
            
            predictions = list() 
            # Inner inner for loop to train & test model
            for valid in validationFold:
                
                # Getting predictions
                
                predictions.append(int(mykNN(X[trainfold], y[trainfold], X[valid], num_neighbors = each_combo[1], distance = function_mappings[each_combo[0]])))
                    
            
            # Calculate metrics for these predictions; only retain accuracy for each set
            _, validation_accuracy, validation_precision, validation_recall, _ = evaluation_metric(y[validationFold], predictions)
                
            # Rounding accuracy, precision and recall to 4 decimal points
            validation_accuracy = round(validation_accuracy, 4)
            validation_precision = round(validation_precision, 4)
            validation_recall = round(validation_recall, 4)
            
            # Retain accuracy per set & accuracy along with distance & neighbours
            results.append((validation_accuracy, validation_precision, validation_recall, each_combo[0], each_combo[1]))
                
            # Print accuracy for the distinct set of paramters
            print('\n     Accuracy: {} for {} distance method with {} neighbours'.format(validation_accuracy, 
                                                                                         each_combo[0], 
                                                                                         each_combo[1]))
        
        # Best set of parameters for the fold after validation
        best_set = max(results)
        
        print('\n\n2) Best set of paramters for the fold (after validation): ', best_set)

        # Saving best parameters after every fold
        params_per_fold.append(best_set)
        
    
            
        # Evaluate on the test set
        test_pred = list()
        for test in testFold:
            test_pred.append(int(mykNN(X[trainfold], y[trainfold], X[test], 
                                       num_neighbors = each_combo[1], 
                                       distance = function_mappings[each_combo[0]])))
        
        
        cm, test_accuracy, test_precision, test_recall,_ = evaluation_metric(y[testFold],test_pred)
        test_accuracies.append(test_accuracy)
        test_metrics.append((test_precision, test_recall))
        
        cm_df = pd.DataFrame(cm, columns = target_names, index = target_names)
        print('\n\n3) Confusion Matrix per fold (on the test set) \n\n',cm_df)
    
    most_common = max(set(params_per_fold), key = params_per_fold.count)
    print('\n\n4) Best Parameters: ', most_common)
    
    return most_common, cm_df, test_accuracies, params_per_fold, test_metrics


# Evaluate Nested CV on Clean Data Code

best_parameters, clean_cm, clean_test_accuracies, per_fold_clean, clean_test_metrics = NestedCV(X, y, n_folds = 5)


-----------------------  Fold 0  -----------------------

1) Accuracy per distinct set of parameters on the validation set

     Accuracy: 0.9649 for euclidean distance method with 1 neighbours

     Accuracy: 0.9386 for euclidean distance method with 2 neighbours

     Accuracy: 0.9211 for euclidean distance method with 3 neighbours

     Accuracy: 0.9123 for euclidean distance method with 4 neighbours

     Accuracy: 0.9474 for euclidean distance method with 5 neighbours

     Accuracy: 0.9474 for euclidean distance method with 6 neighbours

     Accuracy: 0.9561 for euclidean distance method with 7 neighbours

     Accuracy: 0.9737 for euclidean distance method with 8 neighbours

     Accuracy: 0.9649 for euclidean distance method with 9 neighbours

     Accuracy: 0.9649 for euclidean distance method with 10 neighbours

     Accuracy: 0.9737 for manhattan distance method with 1 neighbours

     Accuracy: 0.9649 for manhattan distance method with 2 neighbours

     Accuracy: 0.9737 for manhattan distance method with 3 neighbours

     Accuracy: 0.9561 for manhattan distance method with 4 neighbours

     Accuracy: 0.9561 for manhattan distance method with 5 neighbours

     Accuracy: 0.9649 for manhattan distance method with 6 neighbours

     Accuracy: 0.9649 for manhattan distance method with 7 neighbours

     Accuracy: 0.9649 for manhattan distance method with 8 neighbours

     Accuracy: 0.9649 for manhattan distance method with 9 neighbours

     Accuracy: 0.9825 for manhattan distance method with 10 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 1 neighbours

     Accuracy: 0.9386 for chebyshev distance method with 2 neighbours

     Accuracy: 0.9298 for chebyshev distance method with 3 neighbours

     Accuracy: 0.9211 for chebyshev distance method with 4 neighbours

     Accuracy: 0.9386 for chebyshev distance method with 5 neighbours

     Accuracy: 0.9298 for chebyshev distance method with 6 neighbours

     Accuracy: 0.9386 for chebyshev distance method with 7 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 8 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 9 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 10 neighbours


2) Best set of paramters for the fold (after validation):  (0.9825, 0.9797, 0.9797, 'manhattan', 10)


3) Confusion Matrix per fold (on the test set) 

            malignant  benign
malignant       44.0     3.0
benign           2.0    65.0


-----------------------  Fold 1  -----------------------

1) Accuracy per distinct set of parameters on the validation set

     Accuracy: 0.9035 for euclidean distance method with 1 neighbours

     Accuracy: 0.886 for euclidean distance method with 2 neighbours

     Accuracy: 0.8947 for euclidean distance method with 3 neighbours

     Accuracy: 0.9035 for euclidean distance method with 4 neighbours

     Accuracy: 0.9474 for euclidean distance method with 5 neighbours

     Accuracy: 0.9474 for euclidean distance method with 6 neighbours

     Accuracy: 0.9561 for euclidean distance method with 7 neighbours

     Accuracy: 0.9474 for euclidean distance method with 8 neighbours

     Accuracy: 0.9561 for euclidean distance method with 9 neighbours

     Accuracy: 0.9474 for euclidean distance method with 10 neighbours

     Accuracy: 0.8947 for manhattan distance method with 1 neighbours

     Accuracy: 0.8947 for manhattan distance method with 2 neighbours

     Accuracy: 0.9123 for manhattan distance method with 3 neighbours

     Accuracy: 0.9298 for manhattan distance method with 4 neighbours

     Accuracy: 0.9474 for manhattan distance method with 5 neighbours

     Accuracy: 0.9474 for manhattan distance method with 6 neighbours

     Accuracy: 0.9474 for manhattan distance method with 7 neighbours

     Accuracy: 0.9386 for manhattan distance method with 8 neighbours

     Accuracy: 0.9386 for manhattan distance method with 9 neighbours

     Accuracy: 0.9474 for manhattan distance method with 10 neighbours

     Accuracy: 0.8947 for chebyshev distance method with 1 neighbours

     Accuracy: 0.8684 for chebyshev distance method with 2 neighbours

     Accuracy: 0.8772 for chebyshev distance method with 3 neighbours

     Accuracy: 0.8947 for chebyshev distance method with 4 neighbours

     Accuracy: 0.9298 for chebyshev distance method with 5 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 6 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 7 neighbours

     Accuracy: 0.9386 for chebyshev distance method with 8 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 9 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 10 neighbours


2) Best set of paramters for the fold (after validation):  (0.9561, 0.9562, 0.9532, 'euclidean', 9)


3) Confusion Matrix per fold (on the test set) 

            malignant  benign
malignant       32.0     4.0
benign           1.0    77.0


-----------------------  Fold 2  -----------------------

1) Accuracy per distinct set of parameters on the validation set

     Accuracy: 0.9035 for euclidean distance method with 1 neighbours

     Accuracy: 0.8947 for euclidean distance method with 2 neighbours

     Accuracy: 0.9298 for euclidean distance method with 3 neighbours

     Accuracy: 0.9298 for euclidean distance method with 4 neighbours

     Accuracy: 0.9474 for euclidean distance method with 5 neighbours

     Accuracy: 0.9474 for euclidean distance method with 6 neighbours

     Accuracy: 0.9561 for euclidean distance method with 7 neighbours

     Accuracy: 0.9561 for euclidean distance method with 8 neighbours

     Accuracy: 0.9649 for euclidean distance method with 9 neighbours

     Accuracy: 0.9649 for euclidean distance method with 10 neighbours

     Accuracy: 0.9123 for manhattan distance method with 1 neighbours

     Accuracy: 0.9035 for manhattan distance method with 2 neighbours

     Accuracy: 0.9474 for manhattan distance method with 3 neighbours

     Accuracy: 0.9386 for manhattan distance method with 4 neighbours

     Accuracy: 0.9561 for manhattan distance method with 5 neighbours

     Accuracy: 0.9474 for manhattan distance method with 6 neighbours

     Accuracy: 0.9649 for manhattan distance method with 7 neighbours

     Accuracy: 0.9561 for manhattan distance method with 8 neighbours

     Accuracy: 0.9649 for manhattan distance method with 9 neighbours

     Accuracy: 0.9649 for manhattan distance method with 10 neighbours

     Accuracy: 0.8947 for chebyshev distance method with 1 neighbours

     Accuracy: 0.8947 for chebyshev distance method with 2 neighbours

     Accuracy: 0.9211 for chebyshev distance method with 3 neighbours

     Accuracy: 0.9211 for chebyshev distance method with 4 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 5 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 6 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 7 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 8 neighbours

     Accuracy: 0.9649 for chebyshev distance method with 9 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 10 neighbours


2) Best set of paramters for the fold (after validation):  (0.9649, 0.9671, 0.9606, 'manhattan', 10)


3) Confusion Matrix per fold (on the test set) 

            malignant  benign
malignant       33.0     7.0
benign           5.0    69.0


-----------------------  Fold 3  -----------------------

1) Accuracy per distinct set of parameters on the validation set

     Accuracy: 0.9035 for euclidean distance method with 1 neighbours

     Accuracy: 0.8947 for euclidean distance method with 2 neighbours

     Accuracy: 0.9298 for euclidean distance method with 3 neighbours

     Accuracy: 0.9298 for euclidean distance method with 4 neighbours

     Accuracy: 0.9474 for euclidean distance method with 5 neighbours

     Accuracy: 0.9474 for euclidean distance method with 6 neighbours

     Accuracy: 0.9561 for euclidean distance method with 7 neighbours

     Accuracy: 0.9561 for euclidean distance method with 8 neighbours

     Accuracy: 0.9649 for euclidean distance method with 9 neighbours

     Accuracy: 0.9649 for euclidean distance method with 10 neighbours

     Accuracy: 0.9123 for manhattan distance method with 1 neighbours

     Accuracy: 0.9035 for manhattan distance method with 2 neighbours

     Accuracy: 0.9474 for manhattan distance method with 3 neighbours

     Accuracy: 0.9386 for manhattan distance method with 4 neighbours

     Accuracy: 0.9561 for manhattan distance method with 5 neighbours

     Accuracy: 0.9474 for manhattan distance method with 6 neighbours

     Accuracy: 0.9649 for manhattan distance method with 7 neighbours

     Accuracy: 0.9561 for manhattan distance method with 8 neighbours

     Accuracy: 0.9649 for manhattan distance method with 9 neighbours

     Accuracy: 0.9649 for manhattan distance method with 10 neighbours

     Accuracy: 0.8947 for chebyshev distance method with 1 neighbours

     Accuracy: 0.8947 for chebyshev distance method with 2 neighbours

     Accuracy: 0.9211 for chebyshev distance method with 3 neighbours

     Accuracy: 0.9211 for chebyshev distance method with 4 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 5 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 6 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 7 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 8 neighbours

     Accuracy: 0.9649 for chebyshev distance method with 9 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 10 neighbours


2) Best set of paramters for the fold (after validation):  (0.9649, 0.9671, 0.9606, 'manhattan', 10)


3) Confusion Matrix per fold (on the test set) 

            malignant  benign
malignant       41.0     5.0
benign           2.0    66.0


-----------------------  Fold 4  -----------------------

1) Accuracy per distinct set of parameters on the validation set

     Accuracy: 0.9035 for euclidean distance method with 1 neighbours

     Accuracy: 0.8947 for euclidean distance method with 2 neighbours

     Accuracy: 0.9298 for euclidean distance method with 3 neighbours

     Accuracy: 0.9298 for euclidean distance method with 4 neighbours

     Accuracy: 0.9474 for euclidean distance method with 5 neighbours

     Accuracy: 0.9474 for euclidean distance method with 6 neighbours

     Accuracy: 0.9561 for euclidean distance method with 7 neighbours

     Accuracy: 0.9561 for euclidean distance method with 8 neighbours

     Accuracy: 0.9649 for euclidean distance method with 9 neighbours

     Accuracy: 0.9649 for euclidean distance method with 10 neighbours

     Accuracy: 0.9123 for manhattan distance method with 1 neighbours

     Accuracy: 0.9035 for manhattan distance method with 2 neighbours

     Accuracy: 0.9474 for manhattan distance method with 3 neighbours

     Accuracy: 0.9386 for manhattan distance method with 4 neighbours

     Accuracy: 0.9561 for manhattan distance method with 5 neighbours

     Accuracy: 0.9474 for manhattan distance method with 6 neighbours

     Accuracy: 0.9649 for manhattan distance method with 7 neighbours

     Accuracy: 0.9561 for manhattan distance method with 8 neighbours

     Accuracy: 0.9649 for manhattan distance method with 9 neighbours

     Accuracy: 0.9649 for manhattan distance method with 10 neighbours

     Accuracy: 0.8947 for chebyshev distance method with 1 neighbours

     Accuracy: 0.8947 for chebyshev distance method with 2 neighbours

     Accuracy: 0.9211 for chebyshev distance method with 3 neighbours

     Accuracy: 0.9211 for chebyshev distance method with 4 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 5 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 6 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 7 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 8 neighbours

     Accuracy: 0.9649 for chebyshev distance method with 9 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 10 neighbours


2) Best set of paramters for the fold (after validation):  (0.9649, 0.9671, 0.9606, 'manhattan', 10)


3) Confusion Matrix per fold (on the test set) 

            malignant  benign
malignant       39.0     4.0
benign           2.0    68.0


4) Best Parameters:  (0.9649, 0.9671, 0.9606, 'manhattan', 10)


#clean_test_accuracy, best_parameters
per_fold_clean

[(0.9825, 0.9797, 0.9797, 'manhattan', 10),
 (0.9561, 0.9562, 0.9532, 'euclidean', 9),
 (0.9649, 0.9671, 0.9606, 'manhattan', 10),
 (0.9649, 0.9671, 0.9606, 'manhattan', 10),
 (0.9649, 0.9671, 0.9606, 'manhattan', 10)]


clean_test_metrics

[(0.9562020460358056, 0.9531597332486503),
 (0.9601571268237935, 0.938034188034188),
 (0.8881578947368421, 0.8787162162162162),
 (0.9415329184408778, 0.9309462915601023),
 (0.9478319783197832, 0.9392026578073089)]


# Evaluate Nested CV on Noisy Data Code:

best_parameter_noisy, noisy_cm, noisy_test_accuracies, per_fold_noisy, noisy_test_metrics = NestedCV(X_noisy, y, n_folds = 5)


-----------------------  Fold 0  -----------------------

1) Accuracy per distinct set of parameters on the validation set

     Accuracy: 0.9561 for euclidean distance method with 1 neighbours

     Accuracy: 0.9298 for euclidean distance method with 2 neighbours

     Accuracy: 0.9298 for euclidean distance method with 3 neighbours

     Accuracy: 0.9123 for euclidean distance method with 4 neighbours

     Accuracy: 0.9474 for euclidean distance method with 5 neighbours

     Accuracy: 0.9474 for euclidean distance method with 6 neighbours

     Accuracy: 0.9561 for euclidean distance method with 7 neighbours

     Accuracy: 0.9649 for euclidean distance method with 8 neighbours

     Accuracy: 0.9561 for euclidean distance method with 9 neighbours

     Accuracy: 0.9561 for euclidean distance method with 10 neighbours

     Accuracy: 0.9561 for manhattan distance method with 1 neighbours

     Accuracy: 0.9474 for manhattan distance method with 2 neighbours

     Accuracy: 0.9561 for manhattan distance method with 3 neighbours

     Accuracy: 0.9474 for manhattan distance method with 4 neighbours

     Accuracy: 0.9561 for manhattan distance method with 5 neighbours

     Accuracy: 0.9474 for manhattan distance method with 6 neighbours

     Accuracy: 0.9649 for manhattan distance method with 7 neighbours

     Accuracy: 0.9649 for manhattan distance method with 8 neighbours

     Accuracy: 0.9649 for manhattan distance method with 9 neighbours

     Accuracy: 0.9737 for manhattan distance method with 10 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 1 neighbours

     Accuracy: 0.9298 for chebyshev distance method with 2 neighbours

     Accuracy: 0.9386 for chebyshev distance method with 3 neighbours

     Accuracy: 0.9298 for chebyshev distance method with 4 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 5 neighbours

     Accuracy: 0.9298 for chebyshev distance method with 6 neighbours

     Accuracy: 0.9386 for chebyshev distance method with 7 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 8 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 9 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 10 neighbours


2) Best set of paramters for the fold (after validation):  (0.9737, 0.9731, 0.9658, 'manhattan', 10)


3) Confusion Matrix per fold (on the test set) 

            malignant  benign
malignant       44.0     3.0
benign           2.0    65.0


-----------------------  Fold 1  -----------------------

1) Accuracy per distinct set of parameters on the validation set

     Accuracy: 0.8947 for euclidean distance method with 1 neighbours

     Accuracy: 0.886 for euclidean distance method with 2 neighbours

     Accuracy: 0.8947 for euclidean distance method with 3 neighbours

     Accuracy: 0.886 for euclidean distance method with 4 neighbours

     Accuracy: 0.9474 for euclidean distance method with 5 neighbours

     Accuracy: 0.9474 for euclidean distance method with 6 neighbours

     Accuracy: 0.9561 for euclidean distance method with 7 neighbours

     Accuracy: 0.9474 for euclidean distance method with 8 neighbours

     Accuracy: 0.9561 for euclidean distance method with 9 neighbours

     Accuracy: 0.9561 for euclidean distance method with 10 neighbours

     Accuracy: 0.886 for manhattan distance method with 1 neighbours

     Accuracy: 0.8947 for manhattan distance method with 2 neighbours

     Accuracy: 0.9035 for manhattan distance method with 3 neighbours

     Accuracy: 0.9298 for manhattan distance method with 4 neighbours

     Accuracy: 0.9474 for manhattan distance method with 5 neighbours

     Accuracy: 0.9474 for manhattan distance method with 6 neighbours

     Accuracy: 0.9474 for manhattan distance method with 7 neighbours

     Accuracy: 0.9474 for manhattan distance method with 8 neighbours

     Accuracy: 0.9561 for manhattan distance method with 9 neighbours

     Accuracy: 0.9474 for manhattan distance method with 10 neighbours

     Accuracy: 0.9035 for chebyshev distance method with 1 neighbours

     Accuracy: 0.8772 for chebyshev distance method with 2 neighbours

     Accuracy: 0.886 for chebyshev distance method with 3 neighbours

     Accuracy: 0.8947 for chebyshev distance method with 4 neighbours

     Accuracy: 0.9386 for chebyshev distance method with 5 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 6 neighbours

     Accuracy: 0.9386 for chebyshev distance method with 7 neighbours

     Accuracy: 0.9386 for chebyshev distance method with 8 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 9 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 10 neighbours


2) Best set of paramters for the fold (after validation):  (0.9561, 0.9562, 0.9532, 'manhattan', 9)


3) Confusion Matrix per fold (on the test set) 

            malignant  benign
malignant       32.0     4.0
benign           1.0    77.0


-----------------------  Fold 2  -----------------------

1) Accuracy per distinct set of parameters on the validation set

     Accuracy: 0.8947 for euclidean distance method with 1 neighbours

     Accuracy: 0.8947 for euclidean distance method with 2 neighbours

     Accuracy: 0.9298 for euclidean distance method with 3 neighbours

     Accuracy: 0.9298 for euclidean distance method with 4 neighbours

     Accuracy: 0.9474 for euclidean distance method with 5 neighbours

     Accuracy: 0.9474 for euclidean distance method with 6 neighbours

     Accuracy: 0.9561 for euclidean distance method with 7 neighbours

     Accuracy: 0.9561 for euclidean distance method with 8 neighbours

     Accuracy: 0.9649 for euclidean distance method with 9 neighbours

     Accuracy: 0.9649 for euclidean distance method with 10 neighbours

     Accuracy: 0.9035 for manhattan distance method with 1 neighbours

     Accuracy: 0.9123 for manhattan distance method with 2 neighbours

     Accuracy: 0.9474 for manhattan distance method with 3 neighbours

     Accuracy: 0.9386 for manhattan distance method with 4 neighbours

     Accuracy: 0.9561 for manhattan distance method with 5 neighbours

     Accuracy: 0.9474 for manhattan distance method with 6 neighbours

     Accuracy: 0.9649 for manhattan distance method with 7 neighbours

     Accuracy: 0.9561 for manhattan distance method with 8 neighbours

     Accuracy: 0.9649 for manhattan distance method with 9 neighbours

     Accuracy: 0.9561 for manhattan distance method with 10 neighbours

     Accuracy: 0.9035 for chebyshev distance method with 1 neighbours

     Accuracy: 0.8947 for chebyshev distance method with 2 neighbours

     Accuracy: 0.9211 for chebyshev distance method with 3 neighbours

     Accuracy: 0.9211 for chebyshev distance method with 4 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 5 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 6 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 7 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 8 neighbours

     Accuracy: 0.9649 for chebyshev distance method with 9 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 10 neighbours


2) Best set of paramters for the fold (after validation):  (0.9649, 0.9671, 0.9606, 'manhattan', 9)


3) Confusion Matrix per fold (on the test set) 

            malignant  benign
malignant       33.0     7.0
benign           5.0    69.0


-----------------------  Fold 3  -----------------------

1) Accuracy per distinct set of parameters on the validation set

     Accuracy: 0.8947 for euclidean distance method with 1 neighbours

     Accuracy: 0.8947 for euclidean distance method with 2 neighbours

     Accuracy: 0.9298 for euclidean distance method with 3 neighbours

     Accuracy: 0.9298 for euclidean distance method with 4 neighbours

     Accuracy: 0.9474 for euclidean distance method with 5 neighbours

     Accuracy: 0.9474 for euclidean distance method with 6 neighbours

     Accuracy: 0.9561 for euclidean distance method with 7 neighbours

     Accuracy: 0.9561 for euclidean distance method with 8 neighbours

     Accuracy: 0.9649 for euclidean distance method with 9 neighbours

     Accuracy: 0.9649 for euclidean distance method with 10 neighbours

     Accuracy: 0.9035 for manhattan distance method with 1 neighbours

     Accuracy: 0.9123 for manhattan distance method with 2 neighbours

     Accuracy: 0.9474 for manhattan distance method with 3 neighbours

     Accuracy: 0.9386 for manhattan distance method with 4 neighbours

     Accuracy: 0.9561 for manhattan distance method with 5 neighbours

     Accuracy: 0.9474 for manhattan distance method with 6 neighbours

     Accuracy: 0.9649 for manhattan distance method with 7 neighbours

     Accuracy: 0.9561 for manhattan distance method with 8 neighbours

     Accuracy: 0.9649 for manhattan distance method with 9 neighbours

     Accuracy: 0.9561 for manhattan distance method with 10 neighbours

     Accuracy: 0.9035 for chebyshev distance method with 1 neighbours

     Accuracy: 0.8947 for chebyshev distance method with 2 neighbours

     Accuracy: 0.9211 for chebyshev distance method with 3 neighbours

     Accuracy: 0.9211 for chebyshev distance method with 4 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 5 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 6 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 7 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 8 neighbours

     Accuracy: 0.9649 for chebyshev distance method with 9 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 10 neighbours


2) Best set of paramters for the fold (after validation):  (0.9649, 0.9671, 0.9606, 'manhattan', 9)


3) Confusion Matrix per fold (on the test set) 

            malignant  benign
malignant       41.0     5.0
benign           2.0    66.0


-----------------------  Fold 4  -----------------------

1) Accuracy per distinct set of parameters on the validation set

     Accuracy: 0.8947 for euclidean distance method with 1 neighbours

     Accuracy: 0.8947 for euclidean distance method with 2 neighbours

     Accuracy: 0.9298 for euclidean distance method with 3 neighbours

     Accuracy: 0.9298 for euclidean distance method with 4 neighbours

     Accuracy: 0.9474 for euclidean distance method with 5 neighbours

     Accuracy: 0.9474 for euclidean distance method with 6 neighbours

     Accuracy: 0.9561 for euclidean distance method with 7 neighbours

     Accuracy: 0.9561 for euclidean distance method with 8 neighbours

     Accuracy: 0.9649 for euclidean distance method with 9 neighbours

     Accuracy: 0.9649 for euclidean distance method with 10 neighbours

     Accuracy: 0.9035 for manhattan distance method with 1 neighbours

     Accuracy: 0.9123 for manhattan distance method with 2 neighbours

     Accuracy: 0.9474 for manhattan distance method with 3 neighbours

     Accuracy: 0.9386 for manhattan distance method with 4 neighbours

     Accuracy: 0.9561 for manhattan distance method with 5 neighbours

     Accuracy: 0.9474 for manhattan distance method with 6 neighbours

     Accuracy: 0.9649 for manhattan distance method with 7 neighbours

     Accuracy: 0.9561 for manhattan distance method with 8 neighbours

     Accuracy: 0.9649 for manhattan distance method with 9 neighbours

     Accuracy: 0.9561 for manhattan distance method with 10 neighbours

     Accuracy: 0.9035 for chebyshev distance method with 1 neighbours

     Accuracy: 0.8947 for chebyshev distance method with 2 neighbours

     Accuracy: 0.9211 for chebyshev distance method with 3 neighbours

     Accuracy: 0.9211 for chebyshev distance method with 4 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 5 neighbours

     Accuracy: 0.9474 for chebyshev distance method with 6 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 7 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 8 neighbours

     Accuracy: 0.9649 for chebyshev distance method with 9 neighbours

     Accuracy: 0.9561 for chebyshev distance method with 10 neighbours


2) Best set of paramters for the fold (after validation):  (0.9649, 0.9671, 0.9606, 'manhattan', 9)


3) Confusion Matrix per fold (on the test set) 

            malignant  benign
malignant       39.0     4.0
benign           2.0    68.0


4) Best Parameters:  (0.9649, 0.9671, 0.9606, 'manhattan', 9)


#noisy_test_accuracy, best_parameters
per_fold_noisy

[(0.9737, 0.9731, 0.9658, 'manhattan', 10),
 (0.9561, 0.9562, 0.9532, 'manhattan', 9),
 (0.9649, 0.9671, 0.9606, 'manhattan', 9),
 (0.9649, 0.9671, 0.9606, 'manhattan', 9),
 (0.9649, 0.9671, 0.9606, 'manhattan', 9)]


noisy_test_metrics

[(0.9562020460358056, 0.9531597332486503),
 (0.9601571268237935, 0.938034188034188),
 (0.8881578947368421, 0.8787162162162162),
 (0.9415329184408778, 0.9309462915601023),
 (0.9478319783197832, 0.9392026578073089)]


# Summarising results ---> Clean data
best_parameters, clean_cm, clean_test_accuracies, per_fold_clean
print('\nSummarising Results of KNN Classifier on clean data')
print('---------------------------------------------------')

# Average Validation Accuracy of 5 folds
accuracies = [x[0] for x in per_fold_clean]
print('Average Validation Accuracy: ', round(np.mean(accuracies), 4))

# Average Validation Precision of 5 folds
precision = [x[0] for x in clean_test_metrics]
print('Average Validation Precision: ', round(np.mean(precision), 4))

# Average Validation Recall of 5 folds
recall = [x[1] for x in clean_test_metrics]
print('Average Validation Recall: ', round(np.mean(recall), 4))

# Average Test accuracy of 5 folds
print('\nAverage Test accuracy: ', round(np.mean(clean_test_accuracies), 4))
print('\n')

# Confusion Matrix
fig, ax = plt.subplots()
im = ax.imshow(clean_cm)

# Setting the class labels as ticks
ax.set_xticks(np.arange(len(range(0, 2))))
ax.set_yticks(np.arange(len(range(0, 2))))

# Setting up x- & y-axes labels
ax.set_xlabel('Predicted Class', fontweight = 'semibold')
ax.set_ylabel('Actual Class', fontweight = 'semibold')

# Loop over data dimensions and create text annotations.
for i in range(0, 2):
    for j in range(0, 2):
        text = ax.text(j, i, noisy_cm.iloc[i, j], fontsize = 'x-large', fontweight = 'semibold', 
                       ha = "center", va = "center", color = "w")

ax.set_title("Confusion Matrix (Clean Data)", fontweight = "bold")
fig.tight_layout()
plt.show()

Summarising Results of KNN Classifier on clean data
---------------------------------------------------
Average Validation Accuracy:  0.9667
Average Validation Precision:  0.9388
Average Validation Recall:  0.928

Average Test accuracy:  0.9385


# Summarising results ---> Noisy data
print('\nSummarising Results of KNN Classifier on noisy data')
print('---------------------------------------------------')

# Average Validation Accuracy of 5 folds
accuracies = [x[0] for x in per_fold_noisy]
print('Average Validation Accuracy: ', round(np.mean(accuracies), 4))


# Average Test Precision of 5 folds
precision = [x[0] for x in noisy_test_metrics]
print('Average Validation Precision: ', round(np.mean(precision), 4))

# Average Test Recall of 5 folds
recall = [x[1] for x in noisy_test_metrics]
print('Average Validation Recall: ', round(np.mean(recall), 4))

# Average Test accuracy of 5 folds
print('\nAverage Test accuracy: ', round(np.mean(noisy_test_accuracies), 4))
print('\n\n')

# Plotting Confusion Matrix
fig, ax = plt.subplots()
im = ax.imshow(noisy_cm)

# Setting the class labels as ticks
ax.set_xticks(np.arange(len(range(0, 2))))
ax.set_yticks(np.arange(len(range(0, 2))))

# Setting up x- & y-axes labels
ax.set_xlabel('Predicted Class', fontweight = 'semibold')
ax.set_ylabel('Actual Class', fontweight = 'semibold')

# Loop over data dimensions and create text annotations.
for i in range(0, 2):
    for j in range(0, 2):
        text = ax.text(j, i, noisy_cm.iloc[i, j], fontsize = 'x-large', fontweight = 'semibold', 
                       ha = "center", va = "center", color = "w")

ax.set_title("Confusion Matrix (Noisy Data)", fontweight = "bold")
fig.tight_layout()
plt.show()

Summarising Results of KNN Classifier on noisy data
---------------------------------------------------
Average Validation Accuracy:  0.9649
Average Validation Precision:  0.9388
Average Validation Recall:  0.928

Average Test accuracy:  0.9385

Diagnosis of Breast Cancer¶

Utilising Binary, Supervised Machine Learning (Distance Based Algorithm i.e. KNN) and Using 5 Fold Nested Cross Validation for Robust Model Evaluation and Hyper-parameter Tuning¶

Introduction¶

Dataset¶

Exploratory Data Analysis¶

Visualising the data¶

Exploratory Data Analysis under noise¶

Exploratory Data Analysis Summary¶

Data with noise¶

Implementing the kNN Algorithm¶

Summary of Implementing KNN (Distance Based Algorithm):¶

Classifier evaluation¶

Summary of Classifier Evaluation:¶

Nested Cross-Validation Using Our Implementation of KNN¶

Summary of Nested Cross Validation Results - Utilising Best Parameters¶

Confusion Matrices¶

Confusion Matrices Summary¶

Conclusion¶

References¶

Fold	accuracy	k	distance
1	98.3	10	manhattan
2	95.6	9	euclidian
3	96.5	10	manhattan
4	96.5	10	manhattan
5	96.5	10	manhattan
total	96.68 $\pm$ 0.88

Fold	accuracy	k	distance
1	97.4	10	manhatten
2	95.6	9	manhattan
3	96.5	9	manhattan
4	96.5	9	manhattan
5	96.5	9	manhattan
total	96.50 $\pm$ 0.57