It enables us to dabble in vicarious vice and to sit in smug judgment on the result.
Online Quote Generator
First, i hope everyone is safe. Second i haven’t written a Snake_Byte [ ] in quite some time so here goes. This is a library i ran across late last night and well for what it achieves even for data exploration it is well worth the pip install dabl
cost of it all.
Data analysis is an essential task in the field of machine learning and artificial intelligence. However, it can be a challenging and time-consuming task, especially for those who are not familiar with programming. That’s where the dabl
library comes into play.
dabl
, short for Data Analysis Baseline Library, is a high-level data analysis library in python
, designed to make data analysis as easy and effortless as possible. It is an open-source library, developed and maintained by the scikit-learn
community.
The library provides a collection of simple and intuitive functions for exploring, cleaning, transforming, and visualizing data. With dabl
, users can perform various data analysis tasks such as regression, classification, clustering, anomaly detection, and more, with just a few lines of code.
One of the main benefits of dabl
is that it helps users get started quickly by providing a set of default actions for each task. For example, to perform a regression analysis, users can simply call the “regression” function and pass in their data, and dabl
will take care of the rest.
Another advantage of dabl
is that it provides easy-to-understand visualizations of the results, allowing users to quickly understand the results of their analysis and make informed decisions based on the data. This is particularly useful for non-technical users who may not be familiar with complex mathematical models or graphs.
dabl
also integrates well with other popular data analysis libraries such as pandas
, numpy
, and matplotlib
, making it a convenient tool for those already familiar with these libraries.
So let us jump into the code shall we?
This code uses the dabl
library to perform regression analysis on the Titanic dataset. The dataset is loaded using the pandas
library and passed to the dabl.SimpleRegressor
function for analysis. The fit
method is used to fit the regression model to the data, and the score
method is used to evaluate the performance of the model. Finally, the dabl.plot
function is used to visualize the results of the regression analysis.
import dabl
import pandas as pd
import matplotlib.pyplot as plt
# Load the Titanic dataset from the disk
titanic = pd.read_csv(dabl.datasets.data_path("titanic.csv"))
#check shape columns etc
titanic.shape
titanic.head
#all that is good tons of stuff going on here but now let us ask dabl whats up:
titanic_clean = dabl.clean(titanic, verbose=1)
#a cool call to detect types
types = dabl.detect_types(titanic_clean)
print (types)
#lets do some eye candy
dabl.plot(titanic, 'survived')
#lets check the distribution
plt.show()
#let us try simple regression if it works it works
# Perform regression analysis
fc = dabl.SimpleClassifier(random_state=0)
X = titanic_clean.drop("survived", axis=1)
y = titanic_clean.survived
fc.fit(X, y)
Ok so lets break this down a little.
We load the data set: (make sure the target directory is the same)
# Load the Titanic dataset from the disk
titanic = pd.read_csv(dabl.datasets.data_path("titanic.csv"))
Of note we loaded this in to a pandas dataframe
. Assuming we can use python and load a comma-separated values file lets now do some exploration:
#check shape columns etc
titanic.shape
titanic.head
You should see the following:
(1309, 14)
Which is [1309 rows x 14 columns]
and then:
pclass survived name \ 0 1 1 Allen, Miss. Elisabeth Walton 1 1 1 Allison, Master. Hudson Trevor 2 1 0 Allison, Miss. Helen Loraine 3 1 0 Allison, Mr. Hudson Joshua Creighton 4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) ... ... ... ... 1304 3 0 Zabour, Miss. Hileni 1305 3 0 Zabour, Miss. Thamine 1306 3 0 Zakarian, Mr. Mapriededer 1307 3 0 Zakarian, Mr. Ortin 1308 3 0 Zimmerman, Mr. Leo sex age sibsp parch ticket fare cabin embarked boat \ 0 female 29 0 0 24160 211.3375 B5 S 2 1 male 0.9167 1 2 113781 151.55 C22 C26 S 11 2 female 2 1 2 113781 151.55 C22 C26 S ? 3 male 30 1 2 113781 151.55 C22 C26 S ? 4 female 25 1 2 113781 151.55 C22 C26 S ? ... ... ... ... ... ... ... ... ... ... 1304 female 14.5 1 0 2665 14.4542 ? C ? 1305 female ? 1 0 2665 14.4542 ? C ? 1306 male 26.5 0 0 2656 7.225 ? C ? 1307 male 27 0 0 2670 7.225 ? C ? 1308 male 29 0 0 315082 7.875 ? S ? body home.dest 0 ? St Louis, MO 1 ? Montreal, PQ / Chesterville, ON 2 ? Montreal, PQ / Chesterville, ON 3 135 Montreal, PQ / Chesterville, ON 4 ? Montreal, PQ / Chesterville, ON ... ... ... 1304 328 ? 1305 ? ? 1306 304 ? 1307 ? ? 1308 ? ?
Wow tons of stuff going on here and really this is cool data from an awful disaster. Ok let dabl exercise some muscle here and ask it to clean it up a bit:
titanic_clean = dabl.clean(titanic, verbose=1)
types = dabl.detect_types(titanic_clean)
print (types)
i set verbose = 1
in this case and dabl.detect_types()
shows the types detected which i found helpful:
Detected feature types:
continuous 0
dirty_float 3
low_card_int 2
categorical 5
date 0
free_string 4
useless 0
dtype: int64
However look what dabl did for us;
continuous dirty_float low_card_int categorical \
pclass False False False True
survived False False False True
name False False False False
sex False False False True
sibsp False False True False
parch False False True False
ticket False False False False
cabin False False False False
embarked False False False True
boat False False False True
home.dest False False False False
age_? False False False True
age_dabl_continuous True False False False
fare_? False False False False
fare_dabl_continuous True False False False
body_? False False False True
body_dabl_continuous True False False False
date free_string useless
pclass False False False
survived False False False
name False True False
sex False False False
sibsp False False False
parch False False False
ticket False True False
cabin False True False
embarked False False False
boat False False False
home.dest False True False
age_? False False False
age_dabl_continuous False False False
fare_? False False True
fare_dabl_continuous False False False
body_? False False False
body_dabl_continuous False False False
Target looks like classification
Linear Discriminant Analysis training set score: 0.578
Ah sweet! So data science, machine learning or data mining is 80% cleaning up the data. Take what you can get and go with it folks. dabl even informs us it appears the target method looks like a classification problem. As the name suggests, Classification means classifying the data on some grounds. It is a type of Supervised learning. In classification, the target column should be a Categorical column. If the target has only two categories like the one in the dataset above (Fit/Unfit), it’s called a Binary Classification Problem. When there are more than 2 categories, it’s a Multi-class Classification Problem. The “target” column is also called a “Class” in the Classification problem.
Now lets do some analysis. Yep we are just getting to some statistics. There are univariate and bivariate in this case.
Bivariate analysis is the simultaneous analysis of two variables. It explores the concept of the relationship between two variable whether there exists an association and the strength of this association or whether there are differences between two variables and the significance of these differences.
The main three types we will see here are:
- Categorical v/s Numerical
- Numerical V/s Numerical
- Categorical V/s Categorical data
Also of note Linear Discriminant Analysis or LDA is a dimensionality reduction technique. It is used as a pre-processing step in machine learning. The goal of LDA is to project the features in higher dimensional space onto a lower-dimensional space in order to avoid the curse of dimensionality and also reduce resources and dimensional costs. The original technique was developed in the year 1936 by Ronald A. Fisher and was named Linear Discriminant or Fisher’s Discriminant Analysis.
(NOTE there is another LDA (Latent Dirichlet Allocation which is used in Semantic Engineering that is quite different).
dabl.plot(titanic, 'survived')
In the following plots that auto-magically happens is continuous feature plots for discriminant analysis.
In the plots you will also see PCA (Principle Component Analysis). PCA was invented in 1901 by Karl Pearson, as an analog of the principal axis theorem in mechanics; it was later independently developed and named by Harold Hotelling in the 1930s. Depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT) in signal processing, the Hotelling transform in multivariate quality control, proper orthogonal decomposition (POD) in mechanical engineering. PCA is used extensively in many and my first usage of it was in 1993 for three-dimensional rendering of sound.
What is old is new again.
The main difference is that the Linear discriminant analysis is a supervised dimensionality reduction technique that also achieves classification of the data simultaneously. LDA focuses on finding a feature subspace that maximizes the separability between the groups. While Principal component analysis is an unsupervised Dimensionality reduction technique, it ignores the class label. PCA focuses on capturing the direction of maximum variation in the data set.
Both reduce the dimensionality of the dataset and make it more computationally resourceful. LDA and PCA both form a new set of components.
The last plot is categorical versus target.
So now lets try as dabl
said a SimpleClassifier then fit the data to the line. (hey some machine learning!)
fc = dabl.SimpleClassifier(random_state=0)
X = titanic_clean.drop("survived", axis=1)
y = titanic_clean.survived
fc.fit(X, y)
This should produce the following outputs with accuracy metrics:
Running DummyClassifier(random_state=0)
accuracy: 0.618 average_precision: 0.382 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.382
=== new best DummyClassifier(random_state=0) (using recall_macro):
accuracy: 0.618 average_precision: 0.382 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.382
Running GaussianNB()
accuracy: 0.970 average_precision: 0.975 roc_auc: 0.984 recall_macro: 0.964 f1_macro: 0.968
=== new best GaussianNB() (using recall_macro):
accuracy: 0.970 average_precision: 0.975 roc_auc: 0.984 recall_macro: 0.964 f1_macro: 0.968
Running MultinomialNB()
accuracy: 0.964 average_precision: 0.988 roc_auc: 0.990 recall_macro: 0.956 f1_macro: 0.961
Running DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0)
accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974
=== new best DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0) (using recall_macro):
accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974
Running DecisionTreeClassifier(class_weight='balanced', max_depth=5, random_state=0)
accuracy: 0.969 average_precision: 0.965 roc_auc: 0.983 recall_macro: 0.965 f1_macro: 0.967
Running DecisionTreeClassifier(class_weight='balanced', min_impurity_decrease=0.01,
random_state=0)
accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974
Running LogisticRegression(C=0.1, class_weight='balanced', max_iter=1000,
random_state=0)
accuracy: 0.974 average_precision: 0.991 roc_auc: 0.993 recall_macro: 0.970 f1_macro: 0.972
Running LogisticRegression(C=1, class_weight='balanced', max_iter=1000, random_state=0)
accuracy: 0.975 average_precision: 0.991 roc_auc: 0.994 recall_macro: 0.971 f1_macro: 0.973
Best model:
DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0)
Best Scores:
accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974
This actually calls the sklearn
routines in aggregate. Looks like our old friend logistic regression works. keep it simple sam it ain’t gotta be complicated.
In conclusion, dabl is a highly recommended library for those looking to simplify their data analysis tasks. With its intuitive functions and visualizations, it provides a quick and easy way to perform data analysis, making it an ideal tool for both technical and non-technical user. Again, the real strength of dabl
is in providing simple interfaces for data exploration. For more information:
dabl github. <- click here
Until Then,
#iwishyouwater <- hold your breath on a dive with my comrade at arms @corepaddleboards. great video and the clarity was astounding.
Muzak To Blog By: “Ballads For Two”, Chet Baker and Wolfgang Lackerschmid, trumpet meet vibraphone sparsity. The space between the note is where all of the action lives.