Exploratory Data Analysis
Data Understanding
Primary analysis methods
df.shape()
df.head()
df.dtypes()
df.describe()
df.columns()
Data Preparation
Copy
Following any mutating operation on a dataframe, it is more appropriate to copy the old dataframe to a new one.
df.copy()
Datetime
It is convenient to convert date values to a datetime data type. This allows us to perform additional operations which would not be possible on a string data type.
df.to_datetime()
Numeric
Sometimes, we want to change the string data type to numeric, e.g. year.
df.to_numeric()
Is_na
Proper way to summarize the occurrences of null values in all columns.
df.is_na().sum()
Duplicated
When we look for duplicates in a column, it is possible to chain the locate operation and duplicated method with a specified subset column label.
df.loc[df.duplicated(subset = ['Column Name'])]
Unique
In order to check for unique values in a given column.
pd.unique(df['Column Name'])
Query
We would filter given columns by a given value with the query method.
df.query('Column Name == "Value"')
Reset index
*We want the index numbers to remain sequential after a certain operation, e.g. dropping rows. With drop = True parameter, we replace the newly reset index to be replaced with the old one, preventing an additional index column to be appended.
df.reset_index(drop = True)
Univariate Analysis
Plotting Feature Distributions
- Histogram
- KDE
- Boxplot
# Bar Plot
df.['Column Name'].value_counts().plot(kind = 'bar', title = 'title')
# Horizontal Bar Plot
df.['Column Name'].value_counts().plot(kind = 'barh', title = 'title')
Feature Relationships
- Scatterplot
- Heat-map Correlation
sns.heatmap(df.corr(), annot=True)
- Pairplot
sns.pairplot(df, vars=['Columns'])
- Group-by Comparisons