Projects

Socioeconomic Indicators in Chicago Data Analysis (using sqlite3)

DaTALK 2021. 4. 25.

The city of Chicago released a dataset of socioeconomic data to the Chicago City Portal. This dataset contains a selection of six socioeconomic indicators of public health significance and a “hardship index,” for each Chicago community area, for the years 2008 – 2012.

Scores on the hardship index can range from 1 to 100, with a higher index number representing a greater level of hardship.

A detailed description of the dataset can be found on the city of Chicago's website, but to summarize, the dataset has the following variables:

  • Community Area Number (ca): Used to uniquely identify each row of the dataset
  • Community Area Name (community_area_name): The name of the region in the city of Chicago
  • Percent of Housing Crowded (percent_of_housing_crowded): Percent of occupied housing units with more than one person per room
  • Percent Households Below Poverty (percent_households_below_poverty): Percent of households living below the federal poverty line
  • Percent Aged 16+ Unemployed (percent_aged_16_unemployed): Percent of persons over the age of 16 years that are unemployed
  • Percent Aged 25+ without High School Diploma (percent_aged_25_without_high_school_diploma): Percent of persons over the age of 25 years without a high school education
  • Percent Aged Under 18 or Over 64:Percent of population under 18 or over 64 years of age (percent_aged_under_18_or_over_64): (ie. dependents)
  • Per Capita Income (per_capita_income_): Community Area per capita income is estimated as the sum of tract-level aggragate incomes divided by the total population
  • Hardship Index (hardship_index): Score that incorporates each of the six selected socioeconomic indicators

In this Lab, we'll take a look at the variables in the socioeconomic indicators dataset and do some basic analysis with Python.

 

- the dataset looks like:

 

 

- we will connect the sqlite3 DB browser and make a new database file:

 

import sqlite3
conn = sqlite3.connect('new_name.sqlite')
cur = conn.cursor()
chicago_socioeconomic_data.to_sql('chicago',conn) # table name is 'chicago'

 

Q) How many rows are in the dataset?

 

cur.execute('select count(*) from chicago')
for row in cur:
    print(row[0])

 

ans) 78

 

Q) How many community areas in Chicago have a hardship index greater than 50.0?

 

cur.execute('select count(*) from chicago where hardship_index > 50.0')
for row in cur:
    print(row[0])

 

ans) 38

 

Q) What is the maximum value of hardship index in this dataset?

 

cur.execute('select MAX(hardship_index) from chicago')
for row in cur:
    print(row[0])

 

ans) 98.0

 

Q) Which community area which has the highest hardship index?

 

cur.execute('select community_area_name from chicago where hardship_index = 98.0')
for row in cur:
    print(row[0])

 

ans) Riverdale

 

Q) Which Chicago community areas have per-capita incomes greater than $60,000?

 

cur.execute('select community_area_name from chicago where per_capita_income_ > 60000')
for row in cur:
    print(row[0])

 

ans) Lake View, Lincoln Park, Near North Side, Loop

 

visualization Q) Create a scatter plot using the variables per_capita_income_ and hardship_index. Explain the correlation between the two variables.

 

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pandas

selectQuery = "select per_capita_income_, hardship_index from chicago"
income_vs_hardship = pandas.read_sql(selectQuery, conn)
plot = sns.jointplot(x='per_capita_income_',y='hardship_index', data=income_vs_hardship)

 

- as Per Capita Income rises as the Hardship Index decreases

- the points on the scatter plot are somewhat closer to a straight line in the negative direction

(we have a negative correlation between the two variables)

댓글