IMPORTANT NOTE: Most of the analysis done on this webpage can be found in the pdf report here

Problem Description

Transportation is the life-blood of civilization. For many people, the two largest purchases in their lives will be a house and a car. However, not all cars are built equally and, from personal experience, many car salesmen will not eagerly disclose these inequalities. Thus, I will be investigating the used car market for my project, thinking from the perspective of a prospective buyer and from a used car dealership.

It’s common knowledge that car prices depreciate at an exponential rate, and new cars begin to lose value the moment they are driven off of the lot. My analysis focuses on understanding factors that affect the decay, as well as factors that affect price in general.

How do factors such as year, mileage, wheel configuration, and body type influence the price of used cars?

In particular, I want to find actionable insights that can benefit people who are considering purchasing a vehicle, whether they want to save money, want to find a vehicle with a low depreciation rate, or to understand what is within their budget, given a set of desired properties.

Data description

I used a car listings dataset was sourced from this github page, and contains information for around 1000 unique Kijiji listings in the Greater Toronto Area. The data was acquired in 2019 through web scraping.

Using the brand, model and model year attributes from the car listings dataset, I attempted to scrape the market price for each car from MotorTrend.com, to augment the data. MotorTrend.com gives a “clean retail price” or “market price” for many cars, representing a “reasonable asking price” for cars with clean history, and no defects/damage.

Note that the market price was obtained in 2023, and may be lower than the listing prices obtained in 2019 due to depreciation. However, with prior knowledge that the depreciation rate should be fixed for different categories of cars and can be adjusted for, I believe relationships between these variables are still worth investigating.

Data cleaning summary

I removed columns that I’m not investigating, such as links and vehicle identification numbers, and converted the remaining columns to the correct data types. Cross-referencing with other sites, I believe MotorTrend market prices are given in CAD, so I did not perform any currency conversion.

I then removed listings with a listing price of 0. Kijiji sellers will often list a price of 0 or ‘Please Contact’ to attract buyers, which does not reflect the actual asking price.

Since market prices are scraped in 2023 and the dataset is from 2019, there is no reasonable way to impute the missing values in that column. Market price is an important variable in this investigation, so I decided to remove all rows with missing market prices.

I also created a new price range variable that indicates if the car mentioned in a listing had a market price below the 25th percentile, between the 25-75th percentile or above the 75th percentile, comparing to other recorded listings in the same year. This variable is intended to roughly indicate whether a vehicle is a “luxury vehicle” or not, and takes the values ‘low,’ ‘medium,’ and ‘high.’

Methods and results

I performed some data analysis and created some interactive visualizations, available on the data analysis page.

The visualizations and numerical analysis mostly suggest that most brands of cars decay at a similar exponential rate every year. In addition, we find that prices differ between types of cars and wheel configuration, and that certain types of cars such as trucks and sports cars may experience less price decay than normal cars due to the features they provide the buyer.

I also fit some models that attempt to predict the listing price of a car given other variables. This model is meant to find how much of the variation in listing prices can be explained using the data I collected and state-of-the-art prediction methods, such as random forests and gradient boosting. In reality, we would probably scrape present-day data and fit a new model if we wanted to make accurate predictions.

Visualizations/PDF report

Visualizations are on the “Methods and Results” tab.

To download the PDF report, click here (alternate link)

Code for the pdf report can be found here

Code for the visualization and the final project website can be found here