Professor Kate Ashley MISM 6202
Foundations of Data Analysis for Business PROBLEM SET 2
Use the file ‘commutervan.csv’ for all questions. Refer to ‘commutervan_data_dictionary.csv’ for descriptions of the variables included in this dataset.
Commuter Van Express, Inc. (CVE) is a commuter shuttle service with operations in a large US city where public transportation is frequently delayed and overcrowded. The company operates a fleet of 14-passenger vans, equipped with WiFi and comfortable seats, to provide shuttle service along fixed routes between residential neighborhoods and the city center during commuting hours (running toward the city center in the morning and towards the residential areas in the evening). CVE’s customers use a website or mobile app to book a seat in one of the shuttles, thereby guaranteeing that space will be available. Vans run on a regular schedule along each route, and the app includes location tracking to provide users with real-time arrival information.
CVE uses analytics platforms to collect and organize data on their ongoing operations. One set of metrics includes ride volume (number of rides actually taken each day) and revenues, which can vary per ride due to volume discounts on package purchases (e.g. 12 rides for the price of 10), a monthly subscription option, and various short-term coupons and promotions. Their platforms also track user activity in the mobile app, including actions like starting a new session, tapping on a stop, booking a ride, etc.
It is April 1, 2016 and CVE has hired you as a consultant to help them understand their recent performance and develop a method to forecast future rides and revenues. To assist in your analysis, the company has provided you with daily data from its analytics platforms for the first quarter of 2016. The dataset has been reviewed by CVE’s analytics team and confirmed to be clean and free of errors.
Use RStudio to answer the following questions. Provide your written answers, along with any relevant tables and charts, in a single PDF file. Any charts included in your report should be properly labeled and formatted for an audience of company executives. Do not include R code in your PDF report. RMarkdown is not required or suggested for this assignment. You should also submit a single .R script file with your code for the analysis.
Regression Analysis.
1. Because customers value flexibility in their commuting plans, CVE allows customers to cancel a booking without penalty up until the van they booked arrives at their chosen stop. As a result, not all ride bookings result in a ride actually taking place. Estimate a simple linear regression model to understand the relationship between daily bookings and daily completed rides. Report the estimated regression equation and R2 value and interpret them in words.
Professor Kate Ashley MISM 6202
2. CVE would like to know if ride bookings through the mobile app can be predicted using the actions that an app user may perform prior to booking: namely, starting a session, tapping on a stop, tapping on the sidebar, and viewing van ETAs. Estimate a multiple regression model that uses the relevant variables to predict ride bookings. Multiple models involving these variables are possible; select the best model and explain your choice, citing specific numerical evidence from the regression output. Report the estimated regression equation and R2 value and interpret them in words.
Forecasting.
3. Create a well-formatted and labeled scatter plot to visually inspect the ‘rides’ variable. Describe any trend and seasonality that appear to be present.
4. Construct a k-period simple moving average for the rides variable, where k is chosen based on your assessment of the seasonality patterns in the data. Explain your choice of k and report MSE, MAD, and MAPE for this forecasting model.
5. Estimate a linear trend model for the ‘rides’ variable. Report the estimated linear trend equation and the R2 of the model, and interpret both the equation and the R2 in words.
6. Estimate a linear trend model with day-of-week dummy variables for the ‘rides’ variable. Interpret both the estimated regression equation and the R2 in words, and comment on the magnitude of the adjusted R2 relative to the adjusted R2 from the regression you performed in (5).
7. (a) Use the estimated regression equation from (6) to calculate a forecast of ‘rides’ for each day in your dataset. Calculate MSE, MAD, and MAPE for this forecast. Comment on which of the two forecasts you have calculated in this problem set (from Q4 and this question) performs the best and why that method is best-suited to this data.
(b) Use the estimated regression equation from (7a) to forecast daily completed rides for each weekday in the next month (April 1-April 29). Optional: Also forecast revenues for each day.
8. Write a concise but thorough 1-2 paragraph summary of the forecasting analysis you performed in this problem set, focusing on the most important findings. In other words, think about the work you did for Q3-Q7 and summarize what you would communicate to CVE to help them better understand their ridership data.