Path: blob/master/SQL for Data Science/Analzing real world data set using Python and SQLpy.ipynb
3074 views
Lab: Analyzing a real world data-set with SQL and Python
Introduction
This notebook shows how to store a dataset into a database using and analyze data using SQL and Python. In this lab you will:
Understand a dataset of selected socioeconomic indicators in Chicago
Learn how to store data in an Db2 database on IBM Cloud instance
Solve example problems to practice your SQL skills
Selected Socioeconomic Indicators in Chicago
The city of Chicago released a dataset of socioeconomic data to the Chicago City Portal. This dataset contains a selection of six socioeconomic indicators of public health significance and a “hardship index,” for each Chicago community area, for the years 2008 – 2012.
Scores on the hardship index can range from 1 to 100, with a higher index number representing a greater level of hardship.
A detailed description of the dataset can be found on the city of Chicago's website, but to summarize, the dataset has the following variables:
Community Area Number (
ca
): Used to uniquely identify each row of the datasetCommunity Area Name (
community_area_name
): The name of the region in the city of ChicagoPercent of Housing Crowded (
percent_of_housing_crowded
): Percent of occupied housing units with more than one person per roomPercent Households Below Poverty (
percent_households_below_poverty
): Percent of households living below the federal poverty linePercent Aged 16+ Unemployed (
percent_aged_16_unemployed
): Percent of persons over the age of 16 years that are unemployedPercent Aged 25+ without High School Diploma (
percent_aged_25_without_high_school_diploma
): Percent of persons over the age of 25 years without a high school educationPercent Aged Under 18 or Over 64:Percent of population under 18 or over 64 years of age (
percent_aged_under_18_or_over_64
): (ie. dependents)Per Capita Income (
per_capita_income_
): Community Area per capita income is estimated as the sum of tract-level aggragate incomes divided by the total populationHardship Index (
hardship_index
): Score that incorporates each of the six selected socioeconomic indicators
In this Lab, we'll take a look at the variables in the socioeconomic indicators dataset and do some basic analysis with Python.
Connect to the database
Let us first load the SQL extension and establish a connection with the database
Store the dataset in a Table
In many cases the dataset to be analyzed is available as a .CSV (comma separated values) file, perhaps on the internet. To analyze the data using SQL, it first needs to be stored in the database.
We will first read the dataset source .CSV from the internet into pandas dataframe
Then we need to create a table in our Db2 database to store the dataset. The PERSIST command in SQL "magic" simplifies the process of table creation and writing the data from a pandas
dataframe into the table
You can verify that the table creation was successful by making a basic query like:
Problems
Problem 1
How many rows are in the dataset?
Problem 2
How many community areas in Chicago have a hardship index greater than 50.0?
Problem 3
What is the maximum value of hardship index in this dataset?
Problem 4
Which community area which has the highest hardship index?
Double-click here for the solution.
Problem 5
Which Chicago community areas have per-capita incomes greater than $60,000?
Double-click here for the solution.
Problem 6
Create a scatter plot using the variables per_capita_income_
and hardship_index
. Explain the correlation between the two variables.
Insights
You can see that as Per Capita Income rises as the Hardship Index decreases. We see that the points on the scatter plot are somewhat closer to a straight line in the negative direction, so we have a negative correlation between the two variables. -->