Data Science Porfolio

This website summarizes several of the technical projects that I’ve worked on in the last few years. From a transit routing algorithm and a traffic congestion predictor to a Messenger chatbot that talks like a hockey player in an interview, projects range from practical and informative to entertaining. They all served as an opportunity to deepen my understanding of a certain method or technology while tackling an interesting problem.

With the exception of the Camera-Based Cycling Cadence Tracker, all projects were completed before August 2020.

1. Research

1.1. Data Augmentation For Text Classification Tasks - University of Waterloo Master’s Thesis

A detailed study of various synthetic data generation techniques that provide multi-percentage point accuracy increases to state-of-the-art text classification models. The data generation methods range from rules-based to deep-learning-based approaches.

Relevant Links

Libraries Used

PyTorch, PyTorch-Transformers, NLTK, Gensim, spaCy

2. Machine Learning

2.1. HockeyBot - The Facebook Messenger Chatbot

Product

Upon receiving a message that could plausibly begin a hockey player’s interview response (e.g. “Well you know”), the chatbot responds with a 5 sentence continuation of that message. Active for over 4 months, it has maintained a 100% response rate within 30 seconds of receiving a message.

Data

The dataset described in the National Hockey League Interview Transcripts project.

Approach

A unidirectional recurrent neural network is trained on the interview transcript data. Each string of 6 contiguous words forms a training example, where the first 5 are the input and the 6th is the label. Having learned a probability distribution over the possible choices for the next word, this distribution is sampled to generate new text. This model is then deployed to a Heroku server that responds to all messages at any time.

Relevant Links

Facebook Messenger link to interact with the bot
Bot training GitHub repository
Bot Heroku deployment GitHub repository

Libraries Used

PyTorch

3. Computer Vision

3.1 Camera-Based Cycling Cadence Tracker

Product

A stationary bike cadence (i.e., RPM) tracker that runs in real-time on a MacBook Pro. Current and average speed are also estimated from the RPMs and two bike-specific parameters.

Approach

The algorithm proposed in this paper is implemented. The frame-by-frame pixel-level intensity difference is calculated, compressed, and reshaped to a vector. The frequency of the periodic motion is determined via Fourier analysis of the resultant vector.

Relevant Links

GitHub repository
Research Paper
Libraries Used
SciPy, NumPy

3.2. Motion Activated Camera

Product

A security camera that begins recording video once motion is detected. Used to investigate my housemates’ suspicions that a cat has been sneaking into our house in the early morning via our front door mail slot (the results are inconclusive).

Approach

Motion is detected by background subtraction, in which the difference between each consecutive frame is calculated and studied. The frames are converted to grayscale, blurred with a Gaussian filter, and the element-wise absolute difference between frames is computed. The threshold is applied to the result to map each value to 0 or 255. The resultant image is dilated for display purposes. When motion is detected for 5 consecutive frames, it begins saving the video feeds. These are saved until 30 seconds have passed without any movement. The videos are saved to timestamped folders, and the camera feed video is annotated with the time and the status (whether motion is detected at each given moment).

Relevant Links

GitHub repository

Libraries Used

OpenCV, NumPy

4. Data Exploration and Modelling

4.1. Traffic Congestion Prediction

Product

A program that predicts the traffic congestion at more than 10,000 intersections in 4 major American cities. It is estimated that the model would achieve results in the 25th percentile of the Kaggle leaderboard. Please see the modelling notebook for an explanation of this estimate.

Data

The 20th, 50th, and 80th percentiles of the following two traffic congestion metrics were used:

The total time spent waiting at an intersection
The distance between the center of the intersection and the place at which the vehicle first stops.

These metrics are aggregated by city, intersection, month, hour of the day, entrance and exit orientation, and several other variables.

Goal

Given a training set with all variables and a test set with only the aggregation variables, predict each of the six metric-percentile pairs on the test set.

Approach

Weakly predictive variables such as coordinates, entrance and exit orientation are used to engineer strongly predictive variables such as distance from city center and direction of turn. It is observed that dummy encoding of over 10,000 unique intersections would not fit into memory. This important feature is kept in the training set by using a gradient boosted trees library that has a novel encoding method for high-cardinality variables. Hyperparameter tuning and early stopping are used to ensure that the model fits the data well.

Relevant Links

Libraries Used

CatBoost, Pandas, NumPy, scikit-learn, seaborn

4.2. Public Transit Optimization for the Greater Toronto Area

The Problem

Imagine Jack and Jill, starting from different locations, have a shared destination. Jill is planning to drive, and agrees to pick Jack up at a transit stop on the condition that she does not have to go any further out of her way than strictly necessary. Which stop should Jack travel to?

The Solution

This program calculates within 10 seconds the answer to the above question. It gives the correct answer as long as Jack’s starting location is within range of the Toronto public transit system.

Data

Extracted relevant data from a set of 7 csv files that is made publicly available on the Toronto OpenData website.

Approach

The data is organized into Pandas dataframes and used to create a NetworkX graph. By constructing a graphical representation of the 9155 stops and 213 routes, Google API use is limited to only once per query.

Relevant Links

Libraries Used

NetworkX, Pandas, NumPy

4.3. Contrasting Hockey Players’ and Coaches’ Speech Patterns

Findings and Product

The average sentiment of players is more positive than that of coaches, and this difference is statistically significant.
The difference between coaches’ and players’ average selfishness is not statistically significant.
The variance in both sentiment and selfishness is greater for players than it is for coaches.
A model classifying speech as from a coach or from a player is trained and achieves an F1 score of 0.969 on the test set.

Data

See “National Hockey League Interview Transcripts” in the Data Collection and Cleaning section of this page.

Approach

Sentiment is classified using the Afinn sentiment lexicon, where each word is assigned a sentiment score between -5 and +5. Selfishness is scored using the same approach but with a simple lexicon of my creation, where first person singular and first person plural pronouns are assigned scores of +1 and -1, respectively. Sentiment and selfishness are both normalized by the number of words in the interview response. The player vs coach classifier is a logistic regression model with TF-IDF features as input.

Relevant Links

ASAP Sports, the sports interview aggregation site
Kaggle Kernel, published alongside my dataset

Libraries Used

AFINN, NLTK, Pandas, NumPy, seaborn, matplotlib, scikit-learn

5. Data Collection and Cleaning

5.1. Web Browser Activity Tracker

Product

A python script that tracks daily Firefox browser use, saving the information to a json file. An accompanying program reads the saved json file and organizes it into a Pandas dataframe. It currently plots a bar graph of a use metric for each website, but can be used to analyze the data in any desired way. The following use metrics are currently supported:

Number of visits to each website
Length of time that each webpage was open

Data

As Firefox runs, it stores the current session’s information (e.g. each open tab’s title and url) in an lz4 compressed json file. This file does not contain any historical data - it only contains information pertaining to the open tabs at the time of update. The frequency at which the file is updated can vary, but it is usually every 5 seconds.

Approach

The python script reads the lz4 file at a predefined frequency, 1 Hz by default. It compares the current reading to the previous reading, noting the current time and the webpages that have appeared or disappeared. In doing so it compiles a history of the time at which a user opened and closed each visited webpage.

Libraries Used

Pandas, NumPy, seaborn, matplotlib

5.2. National Hockey League Interview Transcripts

Product

A csv file with columns team1 and team2 (the two teams in the Stanley Cup Final), the date, the interviewee name and job type (player, coach, other), and the interview transcripts.

Approach

The website is formatted as sport -> year -> date -> interview page. BeautifulSoup is used to crawl across each of the bottom three levels and gather the relevant information as it goes. A data cleaning script is written to settle naming inconsistencies (such as Mike Babcock vs Coach Babcock) and determine job type. If job type or name cannot be inferred by the data alone, the script asks the user for the relevant input.

Relevant Links

ASAP Sports, the sports interview aggregation site
Project GitHub repository

Libaries Used

Beautiful Soup, NLTK, Pandas

5.3. Medical Mask Donation Hubs

Product

A web scraping script that creates a csv file with rows for each entry and columns for each of the fields on a medical mask donation hub aggregation website. The fields describe the hub locations and the type of donations they are accepting. The program was completed as part of an Upwork project proposal.

Relevant Links

Libraries Used

Beautiful Soup, Selenium, Pandas, NumPy

6. Statistics

6.1. Inference on the Boston Housing Dataset Using Linear Regression

Findings

One can say with 95% confidence that for a fixed percentage of the population that is lower status, an increase of 1 in the average number of rooms per house results in an increase of between 5.44% and 13.20% in the median house value.
One can say with 95% confidence that for a fixed average number of rooms per house, an increase of 1 percent in the percentage of the population that is lower status results in a decease of betweeen 0.47% and 0.56% in the median house value.
Given the effects due to the percentage of the population that is lower status, the average number of rooms has a statistically significant effect on the log of the median house value.
Given the effects due to the average number of rooms, the logarithm of the percentage of the population that is lower status has a statistically significant effect on the log of the median house value.

Data

The Boston Housing Dataset contains medians, means, and proportions of various attributes of 506 different segments of the city of Boston. The median house value is used as the target variable.

Approach

Linear regression’s assumptions are validated, allowing the enumerated confidence intervals to be created and hypothesis tests to be conducted.

Relevant Links

GitHub repository

Libraries Used

SciPy, Pandas, NumPy, seaborn, matplotlib

7. Blogging

Medium posts explaining elements of certain projects.

7.1. NHL Player Chatbot

Selected by Medium curators for distribution in the site’s AI and Machine Learning sections
Published on Analytics Vidhya’s Medium page
Medium link

7.2. A Quantitative Study of NHL Interviews

Published on Analytics Vidhya’s Medium page
Medium link

8. Honorable Mentions

Data wrangling, feature engineering, and model fine-tuning with two classic Kaggle datasets.

My Contact and Other Work

Table of Contents

1. Research

1.1. Data Augmentation For Text Classification Tasks - University of Waterloo Master’s Thesis

Relevant Links

Libraries Used

2. Machine Learning

2.1. HockeyBot - The Facebook Messenger Chatbot

Product

Data

Approach

Relevant Links

Libraries Used

3. Computer Vision

3.1 Camera-Based Cycling Cadence Tracker

Product

Approach

Relevant Links

Libraries Used

3.2. Motion Activated Camera

Product

Approach

Relevant Links

Libraries Used

4. Data Exploration and Modelling

4.1. Traffic Congestion Prediction

Product

Data

Goal

Approach

Relevant Links

Libraries Used

4.2. Public Transit Optimization for the Greater Toronto Area

The Problem

The Solution

Data

Approach

Relevant Links

Libraries Used

4.3. Contrasting Hockey Players’ and Coaches’ Speech Patterns

Findings and Product

Data

Approach

Relevant Links

Libraries Used

5. Data Collection and Cleaning

5.1. Web Browser Activity Tracker

Product

Data

Approach

Libraries Used

5.2. National Hockey League Interview Transcripts

Product

Approach

Relevant Links

Libaries Used

5.3. Medical Mask Donation Hubs

Product

Relevant Links

Libraries Used

6. Statistics

6.1. Inference on the Boston Housing Dataset Using Linear Regression

Findings

Data

Approach

Relevant Links

Libraries Used

7. Blogging

7.1. NHL Player Chatbot

7.2. A Quantitative Study of NHL Interviews

8. Honorable Mentions

8.1. Kaggle House Prices

8.2. Titanic Classification

9. Recommended Links