Exploratory data analysis using Python Pandas and SQL
Pandas Python library is becoming more and more popular between data scientists and analysts. It allows you quickly to load, process, transform, analyze, and visualize the data.
When you work with Pandas, the most important thing to understand is that there are two main data structures — Series and DataFrame:
- Series is a one-dimensional indexed array which can hold any data type (integer, float, etc).
- DataFrame is a primary Pandas data structure. It is a two-dimensional data class (rows and columns), where every column may contain different data type. You may also pass an index and additional columns to a given DataFrame.
Below I will use some common commands for exploratory data analysis using Pandas and SQL for a public sample of random Reddit posts.
Importing the packages
We start from importing the necessary packages which will be used for our data analysis:
import numpy as np #linear algebra
import pandas as pd #data processing
import seaborn as sns #statistical graph package
import matplotlib.pyplot as plt #plot package for visualisations
import pandasql as ps #sql package
You will need pandasql package only if you are going to run SQL. If you are working with visuals, you may have to choose the plot style:
#plt.style.use('bmh') #setting up 'bmh' as "Bayesian Methods for Hackers" style sheet
plt.style.use('ggplot') #R ggplot style
To see all available plot styles, you can run:
Reading the data
As I mentioned above, I’ll be running the analysis for a publicly released dataset of random Reddit posts published on Kaggle. You can find it by clicking on this Kaggle link.
Reading CSV file in Pandas:
df = pd.read_csv('/kaggle/input/dataisbeautiful/r_dataisbeautiful_posts.csv')
We can see that there are 170,674 entries in the given dataset. Caveat, not all columns are completed, and there are some missing/null values. You can clean the data by returning only non-missing or NULL values by running df.notnull() or df.isnull() accordingly.
Exploratory data analysis is often a first brief look at your dataset which helps you understand its structure, form, size, and discover patterns. Below I’ll demonstrate a few common commands for EDA and will show a way how to run SQL statements in Pandas.
We can start with running basic DataFrame exploratory commands:
df.describe() #or df.count()
Now we know that the DataFrame we’re working with contains 12 columns with boolean, float, integer, and Python object data classes. We also can see which columns have missing values and learn a basic numerical data distribution for score, num_comments, and total_awards_received columns.
Additionally, we can dig deeper into columns by running built-in statistical commands as mean(), sum(), max(), shape(), or dtypes(), and etc. These can be applied to the whole DataFrame and to each column separately as well:
df.describe() #or df.count()df.score.median()
Running SQL in Pandas
One of the biggest advantages of the Pandas library is that it can work well with SQL and tabular data. One of the ways to run SQL statements is to import pandasql package and call the following commands:
q1 = """SELECT removed_by, count(distinct id)as number_of_removed_posts
where removed_by is not null
group by removed_by
order by 2 desc """grouped_df = ps.sqldf(q1, locals())
This returns us the DataFrame:
Visualizing data using SQL output:
Let’s build a bar chart from the output from the code above to visualize our data:
removed_by = grouped_df['removed_by'].tolist()
number_of_removed_posts = grouped_df['number_of_removed_posts'].tolist()
plt.ylabel("Number of deleted reddits")
From the chart above we can see that from all deleted reddits, 68% are removed by moderator and 6% — by users.
There are many types of plots and analysis, from basic statistics to complex visuals and forecasts, that you can build and perform in Pandas. You can see a more complete analysis in Kaggle here.