This website summarizes several of the technical projects that I’ve worked on in the last few years. From a transit routing algorithm and a traffic congestion predictor to a Messenger chatbot that talks like a hockey player in an interview, projects range from practical and informative to entertaining. They all served as an opportunity to deepen my understanding of a certain method or technology while tackling an interesting problem.
With the exception of the Camera-Based Cycling Cadence Tracker, all projects were completed before August 2020.
My Contact and Other Work
Table of Contents
- Research
- Machine Learning
- Computer Vision
- Data Exploration and Modelling
- Data Collection and Cleaning
- Statistics
- Blogging
- Honorable Mentions
- Recommended Links
1. Research
1.1. Data Augmentation For Text Classification Tasks - University of Waterloo Master’s Thesis
A detailed study of various synthetic data generation techniques that provide multi-percentage point accuracy increases to state-of-the-art text classification models. The data generation methods range from rules-based to deep-learning-based approaches.
Relevant Links
Libraries Used
- PyTorch, PyTorch-Transformers, NLTK, Gensim, spaCy
2. Machine Learning
2.1. HockeyBot - The Facebook Messenger Chatbot
Product
Upon receiving a message that could plausibly begin a hockey player’s interview response (e.g. “Well you know”), the chatbot responds with a 5 sentence continuation of that message. Active for over 4 months, it has maintained a 100% response rate within 30 seconds of receiving a message.
Data
The dataset described in the National Hockey League Interview Transcripts project.
Approach
A unidirectional recurrent neural network is trained on the interview transcript data. Each string of 6 contiguous words forms a training example, where the first 5 are the input and the 6th is the label. Having learned a probability distribution over the possible choices for the next word, this distribution is sampled to generate new text. This model is then deployed to a Heroku server that responds to all messages at any time.
Relevant Links
- Facebook Messenger link to interact with the bot
- Bot training GitHub repository
- Bot Heroku deployment GitHub repository
Libraries Used
- PyTorch
3. Computer Vision
3.1 Camera-Based Cycling Cadence Tracker
Product
A stationary bike cadence (i.e., RPM) tracker that runs in real-time on a MacBook Pro. Current and average speed are also estimated from the RPMs and two bike-specific parameters.
Approach
The algorithm proposed in this paper is implemented. The frame-by-frame pixel-level intensity difference is calculated, compressed, and reshaped to a vector. The frequency of the periodic motion is determined via Fourier analysis of the resultant vector.
Relevant Links
- GitHub repository
- Research Paper
Libraries Used
- SciPy, NumPy
3.2. Motion Activated Camera
Product
A security camera that begins recording video once motion is detected. Used to investigate my housemates’ suspicions that a cat has been sneaking into our house in the early morning via our front door mail slot (the results are inconclusive).
Approach
Motion is detected by background subtraction, in which the difference between each consecutive frame is calculated and studied. The frames are converted to grayscale, blurred with a Gaussian filter, and the element-wise absolute difference between frames is computed. The threshold is applied to the result to map each value to 0 or 255. The resultant image is dilated for display purposes. When motion is detected for 5 consecutive frames, it begins saving the video feeds. These are saved until 30 seconds have passed without any movement. The videos are saved to timestamped folders, and the camera feed video is annotated with the time and the status (whether motion is detected at each given moment).
Relevant Links
Libraries Used
- OpenCV, NumPy
4. Data Exploration and Modelling
4.1. Traffic Congestion Prediction
Product
A program that predicts the traffic congestion at more than 10,000 intersections in 4 major American cities. It is estimated that the model would achieve results in the 25th percentile of the Kaggle leaderboard. Please see the modelling notebook for an explanation of this estimate.
Data
The 20th, 50th, and 80th percentiles of the following two traffic congestion metrics were used:
- The total time spent waiting at an intersection
- The distance between the center of the intersection and the place at which the vehicle first stops.
These metrics are aggregated by city, intersection, month, hour of the day, entrance and exit orientation, and several other variables.
Goal
Given a training set with all variables and a test set with only the aggregation variables, predict each of the six metric-percentile pairs on the test set.
Approach
Weakly predictive variables such as coordinates, entrance and exit orientation are used to engineer strongly predictive variables such as distance from city center and direction of turn. It is observed that dummy encoding of over 10,000 unique intersections would not fit into memory. This important feature is kept in the training set by using a gradient boosted trees library that has a novel encoding method for high-cardinality variables. Hyperparameter tuning and early stopping are used to ensure that the model fits the data well.
Relevant Links
Libraries Used
- CatBoost, Pandas, NumPy, scikit-learn, seaborn
4.2. Public Transit Optimization for the Greater Toronto Area
The Problem
Imagine Jack and Jill, starting from different locations, have a shared destination. Jill is planning to drive, and agrees to pick Jack up at a transit stop on the condition that she does not have to go any further out of her way than strictly necessary. Which stop should Jack travel to?
The Solution
This program calculates within 10 seconds the answer to the above question. It gives the correct answer as long as Jack’s starting location is within range of the Toronto public transit system.
Data
Extracted relevant data from a set of 7 csv files that is made publicly available on the Toronto OpenData website.
Approach
The data is organized into Pandas dataframes and used to create a NetworkX graph. By constructing a graphical representation of the 9155 stops and 213 routes, Google API use is limited to only once per query.
Relevant Links
Libraries Used
- NetworkX, Pandas, NumPy
4.3. Contrasting Hockey Players’ and Coaches’ Speech Patterns
Findings and Product
- The average sentiment of players is more positive than that of coaches, and this difference is statistically significant.
- The difference between coaches’ and players’ average selfishness is not statistically significant.
- The variance in both sentiment and selfishness is greater for players than it is for coaches.
- A model classifying speech as from a coach or from a player is trained and achieves an F1 score of 0.969 on the test set.
Data
See “National Hockey League Interview Transcripts” in the Data Collection and Cleaning section of this page.
Approach
Sentiment is classified using the Afinn sentiment lexicon, where each word is assigned a sentiment score between -5 and +5. Selfishness is scored using the same approach but with a simple lexicon of my creation, where first person singular and first person plural pronouns are assigned scores of +1 and -1, respectively. Sentiment and selfishness are both normalized by the number of words in the interview response. The player vs coach classifier is a logistic regression model with TF-IDF features as input.
Relevant Links
- ASAP Sports, the sports interview aggregation site
- Kaggle Kernel, published alongside my dataset
Libraries Used
- AFINN, NLTK, Pandas, NumPy, seaborn, matplotlib, scikit-learn
5. Data Collection and Cleaning
5.1. Web Browser Activity Tracker
Product
A python script that tracks daily Firefox browser use, saving the information to a json file. An accompanying program reads the saved json file and organizes it into a Pandas dataframe. It currently plots a bar graph of a use metric for each website, but can be used to analyze the data in any desired way. The following use metrics are currently supported:
- Number of visits to each website
- Length of time that each webpage was open
Data
As Firefox runs, it stores the current session’s information (e.g. each open tab’s title and url) in an lz4 compressed json file. This file does not contain any historical data - it only contains information pertaining to the open tabs at the time of update. The frequency at which the file is updated can vary, but it is usually every 5 seconds.
Approach
The python script reads the lz4 file at a predefined frequency, 1 Hz by default. It compares the current reading to the previous reading, noting the current time and the webpages that have appeared or disappeared. In doing so it compiles a history of the time at which a user opened and closed each visited webpage.
Libraries Used
- Pandas, NumPy, seaborn, matplotlib
5.2. National Hockey League Interview Transcripts
Product
A csv file with columns team1 and team2 (the two teams in the Stanley Cup Final), the date, the interviewee name and job type (player, coach, other), and the interview transcripts.
Approach
The website is formatted as sport -> year -> date -> interview page. BeautifulSoup is used to crawl across each of the bottom three levels and gather the relevant information as it goes. A data cleaning script is written to settle naming inconsistencies (such as Mike Babcock vs Coach Babcock) and determine job type. If job type or name cannot be inferred by the data alone, the script asks the user for the relevant input.
Relevant Links
- ASAP Sports, the sports interview aggregation site
- Project GitHub repository
Libaries Used
- Beautiful Soup, NLTK, Pandas
5.3. Medical Mask Donation Hubs
Product
A web scraping script that creates a csv file with rows for each entry and columns for each of the fields on a medical mask donation hub aggregation website. The fields describe the hub locations and the type of donations they are accepting. The program was completed as part of an Upwork project proposal.
Relevant Links
Libraries Used
- Beautiful Soup, Selenium, Pandas, NumPy
6. Statistics
6.1. Inference on the Boston Housing Dataset Using Linear Regression
Findings
- One can say with 95% confidence that for a fixed percentage of the population that is lower status, an increase of 1 in the average number of rooms per house results in an increase of between 5.44% and 13.20% in the median house value.
- One can say with 95% confidence that for a fixed average number of rooms per house, an increase of 1 percent in the percentage of the population that is lower status results in a decease of betweeen 0.47% and 0.56% in the median house value.
- Given the effects due to the percentage of the population that is lower status, the average number of rooms has a statistically significant effect on the log of the median house value.
- Given the effects due to the average number of rooms, the logarithm of the percentage of the population that is lower status has a statistically significant effect on the log of the median house value.
Data
The Boston Housing Dataset contains medians, means, and proportions of various attributes of 506 different segments of the city of Boston. The median house value is used as the target variable.
Approach
Linear regression’s assumptions are validated, allowing the enumerated confidence intervals to be created and hypothesis tests to be conducted.
Relevant Links
Libraries Used
- SciPy, Pandas, NumPy, seaborn, matplotlib
7. Blogging
Medium posts explaining elements of certain projects.
7.1. NHL Player Chatbot
- Selected by Medium curators for distribution in the site’s AI and Machine Learning sections
- Published on Analytics Vidhya’s Medium page
- Medium link
7.2. A Quantitative Study of NHL Interviews
- Published on Analytics Vidhya’s Medium page
- Medium link
8. Honorable Mentions
Data wrangling, feature engineering, and model fine-tuning with two classic Kaggle datasets.
8.1. Kaggle House Prices
8.2. Titanic Classification
9. Recommended Links
- Alan Perlis’ “Epigrams in Programming”
- Datasets Subreddit
- Our World In Data