Data

CS 424 Spring 2021

David Shumway

Introduction | Data | Interesting notes on the data

Data source: https://www.kaggle.com/chicago/chicago-energy-usage-2010

The spatially-based dataset is sourced from the City of Chicago and contains a mixture of Chicago and US Census data points. The dataset is primarily split into Census blocks, the smallest spatial container in the US Census, and can thus be further grouped by other larger Census arrangements, such as Census tract. Data points include Gas Use, Electricity Use, Building Age, Building Type, Building Height, and Total Population. In addition to the original dataset, as a proof of concept two additional Census data points are retrieved from the US Census dataset and included in the visualization: Resident age, and migrant worker housing units.

Data types in the original dataset

Building type: qualitative (Residential, Commercial, Industrial)
   No NA values.
KWH (month) 2010: sequential
THERMS (month) 2010: sequential
TOTAL KWH: sequential
TOTAL THERMS: sequential
Total population: sequential
   In blocks with multiple rows, each row contains the same population value.
Average stories: sequential (0-110)
   In blocks with multiple rows, each row contains a different average stories value, and thus the rows are averaged to ascertain the average stories value for a given block.
Average building age: sequential (0-158)
   In blocks with multiple rows, each row contains a different average building age value, and thus the rows are averaged to ascertain the average building age for a given block.
Occupied units percentage: sequential (0-1)
   In blocks with multiple rows, each row contains the same value.
Renter-occupied housing percentage: sequential (0-1)
   In blocks with multiple rows, each row contains the same value.

Data types added from the 2010 US Census

In all cases, data was retrieved at the block-level. Tract-level data was then later generated from the block-level data.
Resident age: sequential
Migrant housing units: sequential

Data issues

The THERM.APRIL.2010 column in the original dataset is titled "TERM.APRIL.2010" (sic).