Confusion Matrix Explained
(CM Explained)
Visualise and Validate of Machine Learning Data in VS Code.
Subhead 1: Explainable models create trust
The further development of visualization in code has brought about some interesting and promising innovations in recent years. This includes in particular the continuous integration of some special technologies of machine learning mapping such as the integration of the Jupyter notebook format in VS code, MS Power BI or the calling of tensorboard by TensorFlow to display and record the training results. This illustrates how far the optimization of code visualization has already progressed or could be.
by Max Kleiner
However, an immediate benefit is already clear today: Areas such as robotics, expert systems, mathematical optimization, anomaly detection, feature reduction or model-based control would be easier to explain if the model could show the features found for decision directly by means of a corresponding graphic. The basics of this report serve this purpose.
The goal is to understand as exactly as possible why and how an AI makes certain decisions. With image recognition algorithms, for example, a colored heat map shows the locations of an image that are particularly relevant for its classification.
We start with a simple data set of a classification system and visualize the decision of the classification with a confusion matrix and associated heat map. As an IDE, I use Visual Studio Code with the two configuration files tasks.json and the project-specific settings.json including test units and path details. Both files can be viewed as a listing below.
As an introduction to VS code with Python, I can recommend the tutorial [1], which Microsoft has published with the current version March 2020 (version 1.44): “Tutorials for creating Python containers and building Data Science models”.
Now we start with the imported modules in Listing 1 and call our script logregclassifier2.py [2] or [7] as a notebook.
Listing 1
// get the modules as we needimport matplotlib.pyplot as pltimport numpy as npfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import classification_report, confusion_matrix
End
The Dataset
After different ML projects, I wanted to write this article to share my experience and maybe help some of you integrate Machine Learning with classification. The data itself is deliberately neutral and simple, so that it is optimally clarified and understandable. I chose a (completely senseless) data series from 0 to 9 (samples) as training data to classify a target with 0 and 1 1. In the case of known labels (target), one also speaks of supervised learning. So we want to train the system so far that the low numbers are likely to be classified with 0 and the high numbers with 1 :.
X=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]y=[0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
Listing 2
// arrays for the input (X) and output (y) values:X = np.arange(10).reshape(-1, 1)y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
End
We can use the np.arange (10) command to create an array that contains all the integers from 0 to 9. As a convention, I see X as a two-dimensional array (matrix) and y as a one-dimensional target (vector). Reshape (-1.1) means we only have 1 feature as a column. Features are the feature carriers which help the model to find unknown patterns.
Now I define the model that already trains the data with the fit method to create a relationship between the influencing variables (determinants) and the target:
Listing 3
// Once you have input and output prepared, define your classification model.model= LogisticRegression(solver='liblinear', random_state=0)model.fit(X, y)print(model)
End
Now the model is set up, and accordingly I can now use predict () to try a first classification with a score and immediately create the confusion matrix to validate.
Needless to say that the implementation of ML-based solutions can lead to major cost savings, higher predictability, and the increased availability of the systems.
Listing 4
print(model.predict(X))print(model.score(X, y))// One false positive prediction: The fourth observation is a zero that was wrongly predicted as one.print(confusion_matrix(y, model.predict(X)))Real: [0 0 0 0 1 1 1 1 1 1]Predict: [0 0 0 1 1 1 1 1 1 1]Score: 0.9Confusion Matrix:0: [[3 1] :41: [0 6]] :6
And lo and behold, a false positive (false alarm) has crept in. The model mistakenly classified a 0 as 1, as if the system incorrectly activated a quiet situation as a fire alarm. The confusion matrix shows this as a false alarm (false positive).
Ideally, with a 100% score, the matrix has the following picture:
0: [[4 0] :41: [0 6]] :6
The data set becomes an image
The next step is the visual preparation of the matrix in order to create an optical relationship between the real data and the predicted ones (from present to target).
Listing 5
plt.rcParams.update({'font.size': 16})fig, ax = plt.subplots(figsize=(4, 4))ax.imshow(cm)ax.grid(False)ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))ax.set_ylim(1.5, -0.5)for i in range(2):for j in range(2):ax.text(j, i, cm[i, j], ha='center', va='center', color='red')plt.show()
End

Fig. 1: Konfusionsmatrix with Pyplot
file: logreg2cm2.png
This graphic can also be made simpler and more modern with an additional library. We need the Python Library Seaborn, which can best be installed directly in the VS Code with Pip Install using the integrated command line shell. By the way validate and translate are 2 funny words.
Listing 6
import seaborn as sns// get the instance of confusion_matrix: cm = confusion_matrix(y, model.predict(X))sns.heatmap(cm, annot=True)plt.title('heatmap confusion matrix')plt.show()
End

Fig. 2: Konfusionsmatrix with Seaborn
File: heatmapconfusionmatrix.png
Class 0 has 3 correct cases (true negative) and class 1 has 6 correct cases (true positive). User accuracy also shows a single false positive result. The user accuracy (consumer risk versus producer risk) is also referred to as transfer errors or errors of type 1, errors of type 2 are then false negative.
The .heatmap () function from the “seaborn” library defines the type of diagram I’m using. The following arguments parameterize the appearance of the diagram. Let’s take a look at the error analysis, which is defined by the default threshold of probability at 0.5. The discrimination between 0 and 1 took place too early, so that our model classified a 0 too early as 1. Of course, these so-called hyper parameters can be optimized to find a fairer distribution of the classification.
It has to be said that the effect on discrete, dichotomous variables [0,1] cannot be explained and verified with the method of the classic linear regression analysis.
Hyperparameter
The current distribution with the associated classification looks like this:

Fig. 3: The first 3 samples are counted as 0 and the rest as 1.
File: class_logplot2.png
Listing 7
sns.set(style = 'whitegrid')sns.regplot(X, model.predict_proba(X)[:,1], logistic=True, scatter_kws={"color": "red"}, line_kws={"color": "blue"})
#label=model.predict(X))plt.title('Logistic Probability Plot')plt.show()
Listing 7 contains the estimated probability as a target in the regplot function. Not every classifier offers the internal probabilities. The Naive Bayes classifier, which is named after the English mathematician Thomas Bayes, is also probabilistic 2; it is derived from the Bayes theorem. The corresponding decision boundary is also visually recognizable for the analysis and helps to interpret the result or to find a better solver (see below):

Fig. 4: Decision Boundary with the false positive (blue dot in white area)
file: classifier_decision2.png
Imagine a medical research institute proposing a screening to screen a large group of people for the presence of a particular disease (which is for the moment context-sensitive). An important counter argument for such a screening are the false positive results, which we have to consider as a conditional probability:
T precision recall f1-score support CM
0 1.00 0.75 0.86 4 [[3 1]
1 0.86 1.00 0.92 6 [0 6]]
Table 1: Classification Report
We can see from the table that there is 1 case of false positive and no case of false negative. This means that only in 86% of all cases a positive result also corresponds to a disease, the precision pays off as follows: True positive / (True positive + False positive) =
6 / (6+1) = 0.8571 = 0.86
It is therefore crucial to include the false positive cases in the accuracy of the tests (screening). By the way, similar examples of conditional probability can be found on the website “Lies with Statistics” [3]. Again, I calculated and visualized a case (from the field of mammography) and the false positives look more complex:
[gallery ids=”1030,1031" type=”rectangular”]

Fig. 5: Non-linear analysis of false positives in a hyperplane (Support Vector Machine)
File: cell_class_boundaries.png

Optimise with Optic
Now we want to bring the hyper parameters mentioned into play, some of which exist and which are part of the model evaluation.
- C is a positive floating point number (1.0 by default) that defines the relative strength of the regularization. Smaller values indicate a stronger regularization.
- Solver is a string (‘liblinear’ by default) that decides which solver to use to customize the model and can be part of a kernel. Other options are ‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’.
- max_iter is an integer (100 by default) that defines the maximum number of iterations through the solver during model fitting.
Listing 8
model = LogisticRegression(solver='liblinear', C=1, random_state=0).fit(X, y)
// show more model detailsprint(model)LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, max_iter=100, multi_class='warn',n_jobs=None, penalty='l2', random_state=0, solver='liblinear',tol=0.0001, verbose=0, warm_start=False)
The actual adjustment is simply and means to use a different solver:
model = LogisticRegression(solver='lbfgs', C=1, random_state=0).fit(X, y)print(classification_report(y, model.predict(X)))
In Listing 8 above we can see the preset model parameters, which can of course be changed. However, I cannot directly determine the best value for a model hyper parameter in relation to a specific problem. You can use empirical values, copy values that I have used for other problems, or try to find the best value by trying. I mainly use the value C (regulator), the kernel or the solver to optimize.
The difference between parameters and hyperparameters:
Model Hyper parameters have to be defined before the training and cannot be learned from the model (e.g. learning rate, hidden layers, regulator).
Model parameters are then learned from the model and are derived from the data (e.g. word frequency, weighting, bias, variance).
Hyper-parameters are those which we supply to the model, for example: number of hidden Nodes and Layers,input features, Learning Rate, Activation Function etc in Neural Network, while Parameters are those which would be learned by the machine like Weights and Biases.
In machine learning, a model M with parameters and hyper-parameters looks like,
Y≈MH(Φ|D)
where Φ are parameters and H are hyper-parameters. D is training data and Y is output data (class labels in case of classification task). y≈MH(Φ|X)
A model hyper parameter is a configuration that is external to the model and whose value cannot be estimated from data.
- They are often used in processes to help estimate model parameters.
- They are often specified by the practitioner.
- They can often be set using heuristics.
- They are often tuned for a given predictive modelling problem.
We cannot know the best value for a model hyper-parameter on a given problem. We may use rules of thumb, copy values used on other problems, or search for the best value by trial and error.
Once we have all this information, it becomes possible to decide which modelling strategy fits best with the available data and the desired output. The results are now optimal in terms of the quality of the algorithm of our number series, which also withstand the optical comparison to a decision tree.
There are multiple modelling strategies for predictive maintenance and we will describe two of them (that I worked almost on the most) concerning the question they aim to answer and which kind of data they require for example in the domain of predictive maintenance:
- Regression models to predict remaining useful lifetime (RUL)
- Classification models to predict failure within a given time window
- Ensemble Models
For this scenario, we need static and historical data, and that every event is labelled. Moreover, several events of each type of failure must be part of the dataset. Ideally, we prefer to build such models when the degradation process is linear [9].

Fig. 6: Optimal decision of the classification
File: class_logplot3optsolver.png
The decision tree procedure in Listing 9 is a common option for regression or classification using a multivariate data set. I can use the procedure, for example, to classify the solvency of customers or to form a function to predict false reports3 or fake news.
In practice, however, the process presents data scientists with major challenges with regard to their interpretation and overfitting (memorizing the trained examples), even though the tree itself offers transparent and legible graphics. For this I use the installed Graphviz2.38 in VS Code and an additional line in the code that directly sets the path information in the OS path. So I can configure adjustments to another version or platform directly in the code.
Listing 9
from sklearn.tree import DecisionTreeClassifierfrom converter import app, requestimport unittestimport osfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.tree import export_graphvizos.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'os.environ["PATH"] += os.pathsep + 'C:/Program Files/Pandoc/'

Fig. 7: The confusion matrix no longer has any wrong ones.
File: heatmapconfusionmatrix_solver.png
Important: The dimensions of the confusion matrix are unfortunately not standardized. In the example, the truth is “Real Actual” in the rows and the estimate “Predict” in the columns (from Present to Target), but depending on the software used, the dimensions can be reversed. It seems important to me to start the matrix at 0, i.e. to standardize True Negative at the top left, see Fig. 8. And clearly for an N-class problem, the confusion matrix then consists of an NxN matrix, so it is not limited to a binary classification.

Abb. 8: a standardized confusion matrix, File: cm_mock_template.png
Jupyter Notebook in VS Code
Here is a look at the integration of Jupyter [6]. Jupyter (formerly IPython Notebook) is an open source project, with which I can easily combine interactive markdown text and executable Python source code on a canvas, which is known as a notebook. Visual Studio Code supports working with Jupyter notebooks and Python code files, and my experience with debugging or code metrics is also good.
An Anaconda environment in VS Code or another Python environment is required to work with Jupyter notebooks, but a Jupyter package must be installed beforehand. This gives us the possibility to directly integrate graphics, document or execute interactive code in VS Code:

Fig. 9: Work with Jupyter
File: vscode_jupyter_librosa_demo2.png

Fig. 10: With the terminal, images can also be controlled interactively in code!
File: vscode_jupyter_librosa_demo3.png
Listing 10 viper2\.vscode\settings.json
{"python.pythonPath": "C:\\Users\\Max\\AppData\\Local\\Programs\\Python\\Python37\\python.exe","python.testing.pytestArgs": ["freshonion"],"python.testing.unittestEnabled": false,"python.testing.nosetestsEnabled": false,"python.testing.pytestEnabled": false,"python.testing.unittestArgs": ["-v","-s","./freshonion","-p","*test.py"],"python.testing.promptToConfigure": false}
End
Listing 11 \viper2\.vscode\tasks.json
{// See https://go.microsoft.com/fwlink/?LinkId=733558// for the documentation about the tasks.json format// build from older win8.1. to win10.2 by max"version": "2.0.0","tasks": [{"label": "buildpython","type": "shell","command": "C:\\Users\\Max\\AppData\\Local\\Programs\\Python\\Python37\\python.exe","args": ["${file}"],"showOutput":"always", "problemMatcher": [],"group": {"kind": "build","isDefault": true}}]}
End
Max Kleiner’s professional environment lies in the areas of machine learning, e-learning, OOP, UML and system architecture — including as a trainer, developer, consultant and publicist. His focus is on training, IT security, databases and frameworks that work in an event-oriented manner. As a lecturer and consultant at a university of applied sciences and on behalf of a company, microcontrollers and IoT have also been added. His book “Patterns in C #”, published in 2003, is still up to date with the Clean Code Initiative.
https://basta.net/speaker/max-kleiner/
Links & Literature
[1] https://code.visualstudio.com/docs/python/data-science-tutorial
[2] http://www.softwareschule.ch/examples/logregclassifier2.py.txt
[3] https://de.statista.com/statistik/lexikon/definition/8/luegen_mit_statistiken/
[4] https://sourceforge.net/projects/cai/
[5] https://maxbox4.wordpress.com/blog/
[6] https://code.visualstudio.com/docs/python/jupyter-support
[7] https://github.com/maxkleiner/maXbox/blob/master/logisticregression2.ipynb
Literature of the Free Book:
[8] https://www.oreilly.com/programming/free/python-data-for-developers.csp
Appendix Source package for MS PowerBI: PBIDesktop_x64.msi
News
Python Data for Developers
A Curated Collection of Chapters from the O’Reilly Data and Programming Library
Get the free ebook
Data is everywhere, and not just for data scientists. Developers are increasingly seeing it enter their realm, requiring new skills and problem solving. Python has emerged as a giant in the field, combining an easy-to-learn language with strong libraries and a vibrant community. If you have a programming background (in Python or otherwise), this free ebook will provide a snapshot of the landscape for you to start exploring more deeply.
1It could also be patients 0 to 9 who are taking a medical test.
2The basic assumption with the naive Bayes classifier is (therefore naive) to assume that the characteristics used are strictly independent.
3Fraud detection is a knowledge-intensive activity.