YData Profiling: Streamlining Data Analysis

Bryan Paget
3 min readNov 29, 2023

--

YData Profiling stands as a powerful tool designed to simplify and enhance the data exploration experience, catering specifically to the needs of statisticians and data scientists. With a focus on providing a one-line Exploratory Data Analysis (EDA) solution, YData Profiling delivers an extended analysis of your DataFrame, akin to the convenience offered by pandas’ df.describe() function. This analysis can be seamlessly exported in different formats such as HTML and JSON.

Quickstart:

Getting started with YData Profiling is straightforward. Begin by installing the library using either of the following commands:

%%capture
! pip install -U ydata-profiling ipywidgets scikit-learn

Key Features:

  • Type Inference: Automatically detects columns’ data types, including Categorical, Numerical, Date, etc.
  • Warnings: Summarizes potential data quality issues such as missing data, inaccuracies, skewness, etc.
  • Univariate Analysis: Provides descriptive statistics and informative visualizations like distribution histograms.
  • Multivariate Analysis: Includes correlations, detailed analysis of missing data, duplicate rows, and visual support for pairwise interactions.
  • Time-Series Analysis: Offers statistical information relative to time-dependent data, along with auto-correlation and seasonality insights.
  • Text and File Analysis: Includes text category detection and file/image analysis for comprehensive exploration.
  • Comparing Datasets: Enables a fast and complete report on the comparison of datasets with a single line of code.

Example Code:

For a practical demonstration, consider the following example generating a standard report for a Pokémon dataset:

Generating a Standard Report

This cell contains a script to fetch a Pokemon dataset and display the default ProfileReport from ydata_profiling. The report includes some additional correlation calculations (see line 23 below). Line 34 profile_report.to_file("pokemon.html") saves the report to an HTML file which you can open in a new browser tab.

Click here to see the output.

import numpy as np
import pandas as pd

from ydata_profiling import ProfileReport
from ydata_profiling.utils.cache import cache_file


file_name = cache_file(
"pokemon.csv",
"https://raw.githubusercontent.com/bryanpaget/html/main/pokemon.csv"
)

pokemon_df = pd.read_csv(file_name)

profile_report = ProfileReport(
pokemon_df,
sort=None,
html={
"style": {"full_width": True}
},
progress_bar=False,
correlations={
"auto": {"calculate": True},
"pearson": {"calculate": False},
"spearman": {"calculate": False},
"kendall": {"calculate": False},
"phi_k": {"calculate": True},
"cramers": {"calculate": True},
},
explorative=True,
title="Profiling Report"
)

profile_report.to_file("../output/pokemon.html")

profile_report

Comparing Datasets

We can also generate reports comparing two datasets. This example below compares training and test pokemon datasets. train_test_split from scikit-learn is used to create the train and test datasets.

Click here to see the output.

import numpy as np
from sklearn.model_selection import train_test_split

from ydata_profiling import ProfileReport


file_name = cache_file(
"pokemon.csv",
"https://raw.githubusercontent.com/bryanpaget/html/main/pokemon.csv"
)

pokemon_df = pd.read_csv(file_name)

X = pokemon_df[['Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]
y = pokemon_df[['Type 1', 'Type 2']]

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)

train_df = X_train
train_report = ProfileReport(train_df, title="Train")

test_df = X_test
test_report = ProfileReport(test_df, title="Test")

comparison_report = train_report.compare(test_report)
comparison_report.to_file("../output/comparison.html")

comparison_report

Time-Series Data Analysis:

For time-series datasets, YData Profiling introduces a specialized time-series mode, as illustrated in the analysis of Microsoft’s stock price:

Click here to see the output.

import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport
from ydata_profiling.utils.cache import cache_file

file_name = cache_file(
"msft.csv",
"https://raw.githubusercontent.com/bryanpaget/html/main/msft.csv"
)

msft_df = pd.read_csv(file_name)
msft_df["Date"] = pd.to_datetime(msft_df["Date"])

# Enable tsmode to True to automatically identify time-series variables
# Provide the column name that provides the chronological order of your time-series
profile = ProfileReport(msft_df, tsmode=True, sortby="Date", title="Time-Series EDA")

profile.to_file("../output/msft-report-timeseries.html")

profile

Explore the capabilities of YData Profiling and transform your data analysis workflow into a seamless and comprehensive experience, aligning with the familiarity of Excel, Power BI, and SAS.

--

--

Bryan Paget
Bryan Paget

Written by Bryan Paget

Statistics and Data Engineering

No responses yet