Spooky Author Identification (Kaggle Competition)
To gain familiarity with working with text data, I entered the “Spooky Author Identification” competition on Kaggle in the fall of 2017. The goal of the competition is to predict which of three authors from a similar genre and era (Edgar Allan Poe, Mary W. Shelley, and H.P. Lovecraft) is the author of a particular sentence. As part of my data investigation, I examined how the author’s style and usage varied before beginning my modeling enterprise. I used a variety of different models including logistic regression, random forest classification, neural net multiclass classification, and a multiclass classification using a convolutional neural network. The notebooks I created with all my code as well as the original data files are on my GitHub, here.
Predicting the Sale Price of Homes in Ames, IA
To practice working with messy data and building a predictive model, I used the Ames Housing Dataset. The data is particularly tricky as it includes many null values and a number of columns comprised of various kinds of categorical variables. I’ve condensed my project into a single jupyter notebook that walks through the various steps I took to evaluate and clean the data and then build and evaluate a series of predictive models. Both the notebook and the original data file are on my Github, here.
Pokemon Stay in Python
I created a jupyter notebook that contains project objectives and the python code to complete them. The purpose of this project was to use base python 3 functionality to gain familiarity with how lists and dictionaries function. The end deliverable is a function that takes a csv file and converts it into a dictionary.
The project is great practice for creating and appending python dictionaries as well as accessing nested dictionaries. There is also an emphasis on creating and applying functions to nested dictionaries using both dictionary comprehensions and for loops. Another key element of the project is reading in a csv file, parsing it out, and then extracting information from the resulting list of lists.
You can check out this fun exercise on my GitHub.
311, Weather, and the Economy
Recently I took a quick break from working on my research to play around with some fun data. I wanted to look at the relationship between 311 calls in the City of New York, the weather, and economic performance. The purpose of the project is to see if 311 call volume can be predicted. If call volume and complaints can be accurately predicted this would have several different potential applications. The most immediate application is for the City itself. If the City knows to expect a rush of calls or even better, a specific type of call, they can better staff their call center and prepare response teams. Additionally, private companies might also be able to improve their targeted marketing. For example, I found that rodent complaints spike at regular intervals (see the plot titled “311 Complaint Types Across 2013 at here). By finding what might predict these spikes, companies that provide extermination services, or companies that offer products to address rodent problems, can increase their marketing efforts precisely when demand is higher, optimizing their marketing budgets. It might also be a time when pet adoption agencies should be increasing their efforts as more people might be interested in adopting cats out of shelters to deal with mice and rats.
I gathered data from three sources. First are all 311 calls in New York City in 2013 available for download at NYC OpenData ( specifically here). This is a 124 MB .csv file. From looking through the 311 data, I noticed that the most common call type was regarding heating. Thus I thought it important to collect weather data. I downloaded daily weather reports for New York City from 2013 via NOAA’s National Centers for Environmental Information which provided a .54MB .csv file. Finally, I thought that economic performance might influence how people feel about their current situation. If the economy is doing well, perhaps they will be slightly less cranky or as easily annoyed and therefore be less likely to call 311. To capture the economic conditions, I make use of daily information regarding the performance of the Dow Jones Industrial Average (DJIA). To get this data, I use Quandl’s native R package and use the data described here. To visually asses the three data sources, please see my figure here.
Once I had all of my data, I subset my data to only Manhattan based 311 calls. I then created daily counts of 311 calls to use as my dependent variable. I then created a simple regression model using the change in the day’s DJIA, the day’s DJIA volume, the minimum observed temperature, and the amount of precipitation. While I found no effect for change in DJIA or DJIA volume, I did find that lower temperatures significantly increased the number of 311 calls. Serial auto-correlation is of course an issue in these time series data, and I checked the robustness of my results appropriately. While I found no effect for the economic indicator I used, I believe that project is still worth pursuing for the following reasons. First, I did find a significant predictor of 311 call volume. Second, I did not disaggregate complaint types in my analysis; heating complaints are the single most common type of 311 complaint and it’s likely that my model results are being driven by this result. More nuanced investigation might reveal that different classes of complaints are predicted by different identifiable events and circumstances.
R for this project can be found, along with code for the other projects mentioned here, on my GitHub at github.com/DavidAGelman.