An analysis of the Bike Share data using R programming.
Ifeoma Ojialor 10/16/2020
In this project, we will use a bike-sharing dataset to create machine
learning models. Before moving forward, I will briefly explain the
bike-sharing system and how it works. A bike-sharing system is a service
in which users can rent/use bicycles on a short term basis for a fee.
The goal of these programs is to provide affordable access to bicycles
for short distance trips as opposed to walking or taking public
transportation. Imagine how many people use these systems on a given
day, the numbers can vary greatly based on some elements. The goal of
this project is to build a predictive model to find out the number of
people that use these bikes in a given time period using available
information about that time/day. This in turn, can help businesses that
oversee this systems to manage them in a cost efficient manner.
We will be using the bike-sharing dataset from the UCL Machine Learning
Repository. We will use the regression and boosted tree method to model
the response variable cnt
.
First we will read in the data using a relative path.
#read in data and filter to desired weekday
day1 <- read.csv("Bike-Sharing-Dataset/day.csv")
head(day1,5)
## instant dteday season yr mnth holiday weekday workingday weathersit
## 1 1 2011-01-01 1 0 1 0 6 0 2
## 2 2 2011-01-02 1 0 1 0 0 0 2
## 3 3 2011-01-03 1 0 1 0 1 1 1
## 4 4 2011-01-04 1 0 1 0 2 1 1
## 5 5 2011-01-05 1 0 1 0 3 1 1
## temp atemp hum windspeed casual registered cnt
## 1 0.344167 0.363625 0.805833 0.160446 331 654 985
## 2 0.363478 0.353739 0.696087 0.248539 131 670 801
## 3 0.196364 0.189405 0.437273 0.248309 120 1229 1349
## 4 0.200000 0.212122 0.590435 0.160296 108 1454 1562
## 5 0.226957 0.229270 0.436957 0.186900 82 1518 1600
Next, we will remove the casual and registered variables since the
cnt
variable is a combination of both.
day1 <- select(day1, -casual, -registered)
day1$weekday <- as.factor(day1$weekday)
levels(day1$weekday) <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
day <- filter(day1, weekday == params$days)
#Check for missing values
miss <- data.frame(apply(day,2,function(x){sum(is.na(x))}))
names(miss)[1] <- "missing"
miss
## missing
## instant 0
## dteday 0
## season 0
## yr 0
## mnth 0
## holiday 0
## weekday 0
## workingday 0
## weathersit 0
## temp 0
## atemp 0
## hum 0
## windspeed 0
## cnt 0
There are no missing values in the dataset, so we can continue with our analysis.
#Change the variables into their appropriate format.
day$season <- as.factor(day$season)
day$weathersit <- as.factor(day$weathersit)
day$holiday <- as.factor(day$holiday)
day$workingday <- as.factor(day$workingday)
day$yr <- as.factor(day$yr)
day$mnth <- as.factor(day$mnth)
levels(day$season) <- c("winter", "spring", "summer", "fall")
levels(day$yr) <- c("2011", "2012")
str(day)
## 'data.frame': 104 obs. of 14 variables:
## $ instant : int 4 11 18 25 32 39 46 53 60 67 ...
## $ dteday : chr "2011-01-04" "2011-01-11" "2011-01-18" "2011-01-25" ...
## $ season : Factor w/ 4 levels "winter","spring",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : Factor w/ 2 levels "2011","2012": 1 1 1 1 1 1 1 1 1 1 ...
## $ mnth : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ holiday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ weekday : Factor w/ 7 levels "Sunday","Monday",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ workingday: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ weathersit: Factor w/ 3 levels "1","2","3": 1 2 2 2 2 1 1 1 1 1 ...
## $ temp : num 0.2 0.169 0.217 0.223 0.192 ...
## $ atemp : num 0.212 0.191 0.232 0.235 0.235 ...
## $ hum : num 0.59 0.686 0.862 0.617 0.83 ...
## $ windspeed : num 0.1603 0.1221 0.1468 0.1298 0.0532 ...
## $ cnt : int 1562 1263 683 1985 1360 1530 1815 1450 1851 2133 ...
The cnt
is the response variable, so we’ll use a histogram to get a
visual understanding of the variable.
ggplot(day, aes(x = cnt)) + theme_bw() + geom_histogram(aes(y =..density..), color = "black", fill = "white", binwidth = 1000) + geom_density(alpha = 0.2, fill = "blue") + labs(title = "Count Density", x = "Count", y = "Density")
summary(day$cnt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 683 3579 4576 4511 5769 7767
From the histogram and summary statistics output, it is pretty evident that the count of total rental bikes are in the sub 5000 range. We will investigate if there is a relationship between the response variable and other relevant predictor variables in the next section. Lets look at the other variables individually.
#visualize numeric predictor variables using a histogram
p1 <- ggplot(day) + geom_histogram(aes(x = temp), fill = "red", binwidth = 0.03)
p2 <- ggplot(day) + geom_histogram(aes(x = atemp), fill = "red", binwidth = 0.03)
p3 <- ggplot(day) + geom_histogram(aes(x = hum), fill = "red", binwidth = 0.025)
p4 <- ggplot(day) + geom_histogram(aes(x = windspeed), fill = "red", binwidth = 0.03)
gridExtra::grid.arrange(p1,p2,p3,p4, nrow = 2)
Observations:
* No clear cut pattern in temp
and atemp
.
hum
appears to be skewed to the left when the dataset is not
filtered to a specific weekday.
windspeed
appears to be skewed(right). This variable should be
transformed to curb its skewness.
The distribution of temp
and atemp
looks very similar. We should
think about taking out one of the variables.
#visualize categorical predictor variables
h1 <- ggplot(day) + geom_bar(aes(x = season),fill = "pink")
h2 <- ggplot(day) + geom_bar(aes(x = yr),fill = "pink")
h3 <- ggplot(day) + geom_bar(aes(x = holiday),fill = "pink")
h4 <- ggplot(day) + geom_bar(aes(x = workingday),fill = "pink")
h5 <- ggplot(day) + geom_bar(aes(x = mnth),fill = "pink")
h6 <- ggplot(day) + geom_bar(aes(x = weathersit),fill = "pink")
gridExtra::grid.arrange(h1,h2,h3,h4,h5,h6, nrow = 3)
Observations:
* The variation between the four seasons is little to none.
About the same number of people rode bikes in 2011 and 2012.
Many people rode bikes on days that are not holidays.
Most people used the bike-sharing system on days that were neither weekends nor holidays.
Most people used the bike sharing system on days with clear weather.
In this section, we will explore the predictor variables with respect to the response variable. The objective is to discover hidden relationships between the independent and response variables and use those findings in the model building process.
# First, we will explore the relationship between the target and numerical variables.
p1 <- ggplot(day) +geom_point(aes(x = temp, y = cnt), colour = "violet") + labs(title = "Normalized Temperature vs Total Rental Bikes")
p2 <- ggplot(day) +geom_point(aes(x = atemp, y = cnt), colour = "#FF99CC") +labs(title = "Normalized Feeling Temperature vs Total Rental Bikes")
p3 <- ggplot(day) +geom_point(aes(x = hum, y = cnt), colour = "pink") + labs(title = "Normalized Humidity vs Total rental Bikes")
p4 <- ggplot(day) +geom_point(aes(x = windspeed, y = cnt), colour = "#FF66CC") +labs(title= "Normalized Windspeed vs Total rental Bikes")
gridExtra::grid.arrange(p1, p2, p3, p4, nrow = 2)
Observations:
* There appears to be a positive linear relationship between cnt
,
temp
, and atemp
.
cnt
, hum
, and
windspeed
.# Now we'll visualize the relationship between the target and categorical variables.
# Instead of using a boxplot, I will use a violin plot which is the blend of both a boxplot and density plot
g1 <- ggplot(day) + geom_col(aes(x = yr, y = cnt, fill = season))+theme_bw()
g2 <- ggplot(day) + geom_violin(aes(x = yr, y = cnt))+theme_bw()
g3 <- ggplot(day) + geom_col(aes(x = mnth, y = cnt, fill = season))+theme_bw()
g4 <- ggplot(day) + geom_col(aes(x = holiday, y = cnt, fill = season)) + theme_bw()
g6 <- ggplot(day) + geom_col(aes(x = workingday, y = cnt, fill = season))
g7 <- ggplot(day) + geom_col(aes(x = weathersit, y = cnt, fill = season))
gridExtra::grid.arrange(g1, g2, g3, nrow = 2)
gridExtra::grid.arrange(g4, g6, g7, nrow = 2)
Observations:
* The total bike rental count is higher in 2012 than 2011.
During workingday, the bike rental counts quite the highest compared to during no working day for different seasons.
During clear,partly cloudy weather, the bike rental count is highest and the second highest is during mist cloudy weather and followed by third highest during light snow and light rain weather.
The highest bike rental count was during the summer and lowest in the winter.
Correlation matrix helps us to understand the linear relationship between variables.
day_c <- day[ , c(10:14)]
round(cor(day_c), 2)
## temp atemp hum windspeed cnt
## temp 1.00 1.00 0.11 -0.09 0.63
## atemp 1.00 1.00 0.12 -0.12 0.64
## hum 0.11 0.12 1.00 -0.10 -0.18
## windspeed -0.09 -0.12 -0.10 1.00 -0.15
## cnt 0.63 0.64 -0.18 -0.15 1.00
From the above matrix, we can see that temp
and atemp
are highly
correlated. So we only need to include one of these variables in the
model to prevent multicollinearity. We will also transform the humidity
and windspeed variable.
day <- mutate(day, log_hum = log(day$hum+1))
day <- mutate(day, log_ws = log(day$windspeed + 1))
#Remove irrelevant variables
day <- select(day, -weekday,-holiday,-workingday,-dteday,-temp, -instant)
First we split the data into train and test sets.
set.seed(23)
dayIndex<- createDataPartition(day$cnt, p = 0.7, list=FALSE)
dayTrain <- day[dayIndex, ]
dayTest <- day[-dayIndex, ]
# Build a tree-based model using loocv;
fitTree <- train(cnt~ ., data = dayTrain, method = "rpart",
preProcess = c("center", "scale"),
trControl = trainControl(method = "loocv", number = 10), tuneGrid = data.frame(cp = 0.01:0.10))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
# Build a boosted tree model using cv
fitBoost <- train(cnt~., data = dayTrain, method = "gbm",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv", number = 10),
tuneGrid = expand.grid(n.trees=c(10,20,50,100,500,1000),shrinkage=c(0.01,0.05,0.1,0.5),n.minobsinnode =c(3,5,10),interaction.depth=c(1,5,10)))
# Adding a linear regression model part 2!
FitLinear <- train(cnt~ atemp + mnth*season, data = dayTrain, method = "lm", trControl = trainControl(method = "cv", number = 10))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
# Display information from the tree fit
fitTree$results
## cp RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 0.01 823.6188 NaN 823.6188 658.3374 NA 658.3374
# Display information from the boost fit
fitBoost$results
## shrinkage interaction.depth n.minobsinnode n.trees RMSE
## 1 0.01 1 3 10 1716.5908
## 7 0.01 1 5 10 1717.0477
## 13 0.01 1 10 10 1716.3381
## 55 0.05 1 3 10 1530.0472
## 61 0.05 1 5 10 1511.4705
## 67 0.05 1 10 10 1512.9832
## 109 0.10 1 3 10 1307.5567
## 115 0.10 1 5 10 1340.6528
## 121 0.10 1 10 10 1307.6531
## 163 0.50 1 3 10 848.1331
## 169 0.50 1 5 10 864.4649
## 175 0.50 1 10 10 883.0600
## 19 0.01 5 3 10 1655.3577
## 25 0.01 5 5 10 1660.7433
## 31 0.01 5 10 10 1694.9238
## 73 0.05 5 3 10 1310.6083
## 79 0.05 5 5 10 1338.8179
## 85 0.05 5 10 10 1452.5004
## 127 0.10 5 3 10 1093.6222
## 133 0.10 5 5 10 1078.3550
## 139 0.10 5 10 10 1251.7282
## 181 0.50 5 3 10 887.4160
## 187 0.50 5 5 10 853.1468
## 193 0.50 5 10 10 937.2025
## 37 0.01 10 3 10 1655.3585
## 43 0.01 10 5 10 1666.9992
## 49 0.01 10 10 10 1693.4027
## 91 0.05 10 3 10 1323.5993
## 97 0.05 10 5 10 1355.0405
## 103 0.05 10 10 10 1453.2335
## 145 0.10 10 3 10 1071.1757
## 151 0.10 10 5 10 1100.2979
## 157 0.10 10 10 10 1231.7675
## 199 0.50 10 3 10 940.4929
## 205 0.50 10 5 10 736.3302
## 211 0.50 10 10 10 985.8810
## 2 0.01 1 3 20 1665.9342
## 8 0.01 1 5 20 1661.8524
## 14 0.01 1 10 20 1662.3235
## 56 0.05 1 3 20 1322.9027
## 62 0.05 1 5 20 1332.0855
## 68 0.05 1 10 20 1308.2128
## 110 0.10 1 3 20 1100.2823
## 116 0.10 1 5 20 1095.9149
## 122 0.10 1 10 20 1120.3766
## 164 0.50 1 3 20 749.2292
## 170 0.50 1 5 20 801.4227
## 176 0.50 1 10 20 816.5135
## 20 0.01 5 3 20 1562.4757
## 26 0.01 5 5 20 1564.3416
## 32 0.01 5 10 20 1629.9705
## 74 0.05 5 3 20 1075.5512
## 80 0.05 5 5 20 1101.6318
## 86 0.05 5 10 20 1249.3434
## 128 0.10 5 3 20 861.0841
## 134 0.10 5 5 20 859.0220
## 140 0.10 5 10 20 1053.6645
## 182 0.50 5 3 20 884.6596
## 188 0.50 5 5 20 832.7330
## 194 0.50 5 10 20 935.9977
## 38 0.01 10 3 20 1554.6247
## 44 0.01 10 5 20 1573.5106
## 50 0.01 10 10 20 1628.6780
## 92 0.05 10 3 20 1077.9416
## 98 0.05 10 5 20 1108.9181
## 104 0.05 10 10 20 1249.3974
## 146 0.10 10 3 20 858.6573
## 152 0.10 10 5 20 889.8509
## 158 0.10 10 10 20 1017.0416
## 200 0.50 10 3 20 928.5215
## 206 0.50 10 5 20 767.4830
## 212 0.50 10 10 20 924.7482
## 3 0.01 1 3 50 1519.4490
## 9 0.01 1 5 50 1513.4075
## 15 0.01 1 10 50 1517.2969
## 57 0.05 1 3 50 1031.2226
## 63 0.05 1 5 50 1046.4894
## 69 0.05 1 10 50 1034.2262
## 111 0.10 1 3 50 896.7500
## 117 0.10 1 5 50 852.8522
## 123 0.10 1 10 50 903.7798
## 165 0.50 1 3 50 757.9963
## 171 0.50 1 5 50 788.0966
## 177 0.50 1 10 50 824.9130
## 21 0.01 5 3 50 1331.9673
## 27 0.01 5 5 50 1339.9759
## 33 0.01 5 10 50 1460.8342
## 75 0.05 5 3 50 798.1827
## 81 0.05 5 5 50 812.3188
## 87 0.05 5 10 50 963.8011
## 129 0.10 5 3 50 723.6393
## 135 0.10 5 5 50 724.1141
## 141 0.10 5 10 50 849.9416
## 183 0.50 5 3 50 925.4825
## 189 0.50 5 5 50 894.7305
## 195 0.50 5 10 50 903.7591
## 39 0.01 10 3 50 1305.8857
## 45 0.01 10 5 50 1341.0549
## 51 0.01 10 10 50 1452.3778
## 93 0.05 10 3 50 800.1248
## Rsquared MAE RMSESD RsquaredSD MAESD
## 1 0.4995247 1368.2560 278.4846 0.29188269 236.2993
## 7 0.5034043 1366.1694 276.5985 0.34151759 234.3472
## 13 0.5470584 1367.0568 279.8480 0.34528524 238.3186
## 55 0.6166238 1220.3976 228.8579 0.21549712 215.7865
## 61 0.5992804 1209.8016 251.7771 0.28780666 221.8344
## 67 0.6267791 1208.1030 241.1058 0.22451341 219.1175
## 109 0.6361756 1038.1264 195.3830 0.23470612 188.4539
## 115 0.6458185 1072.0875 202.6734 0.28481117 163.6836
## 121 0.6904525 1059.0336 228.3313 0.24671766 182.9455
## 163 0.7938250 675.1701 174.6852 0.08808613 109.4318
## 169 0.8009097 692.2895 262.9275 0.11670843 230.9146
## 175 0.7741463 692.5080 211.2900 0.15123919 142.9257
## 19 0.7425823 1321.0026 278.9219 0.21361378 237.3641
## 25 0.7225502 1323.1130 286.8718 0.19414590 243.3129
## 31 0.6793973 1348.2000 289.5358 0.27114180 246.2144
## 73 0.7362373 1052.7441 261.0868 0.19630044 227.4761
## 79 0.7300301 1064.4333 247.3542 0.21577047 219.4944
## 85 0.7013240 1159.0097 217.8897 0.25431391 193.4974
## 127 0.7521239 871.8920 221.4919 0.18764654 180.5949
## 133 0.7692609 857.5000 237.2030 0.16105545 207.2112
## 139 0.7134378 1005.4486 242.5242 0.22269217 219.2442
## 181 0.7725663 704.6625 374.8409 0.14334905 299.9728
## 187 0.7545219 666.0393 269.6432 0.24660022 162.3346
## 193 0.7459773 739.6750 275.6533 0.18985839 224.2260
## 37 0.7709712 1317.7848 283.4699 0.19130424 242.1085
## 43 0.7343074 1329.8123 280.7197 0.20484303 237.4894
## 49 0.6667386 1346.4833 280.9182 0.24807336 240.2880
## 91 0.7149839 1054.5835 249.6020 0.20867830 225.5004
## 97 0.7212097 1086.9977 228.5808 0.17955438 209.6899
## 103 0.6784243 1166.3735 252.9842 0.22357194 220.3575
## 145 0.7298770 853.1199 230.7811 0.24156721 184.2407
## 151 0.7519276 895.2388 189.9199 0.17009430 174.0389
## 157 0.7230562 992.9730 249.7142 0.18731228 209.7406
## 199 0.7287976 725.6525 308.0989 0.19079608 208.4010
## 205 0.8251560 569.2180 272.7229 0.14815968 199.6474
## 211 0.7464707 803.0327 217.5783 0.15914438 192.7326
## 2 0.5144040 1329.6452 260.7201 0.29083374 220.9358
## 8 0.5515374 1326.8092 265.5388 0.30395359 229.3872
## 14 0.5905884 1326.2697 267.4346 0.31016651 234.5227
## 56 0.6835973 1057.0767 194.1938 0.18820118 209.4190
## 62 0.6623531 1071.1566 205.3567 0.23536442 202.3980
## 68 0.7201917 1050.2668 194.9380 0.22671296 174.7627
## 110 0.7112104 859.7503 155.6109 0.17062018 129.1280
## 116 0.7420594 878.4130 195.3607 0.19536596 167.4214
## 122 0.7227394 885.3423 183.6312 0.17762823 164.2852
## 164 0.8377712 590.3547 183.0580 0.07870399 118.7724
## 170 0.8213271 626.7663 217.4401 0.09485754 194.5677
## 176 0.7962767 662.5631 219.5165 0.14881362 166.2876
## 20 0.7387506 1251.5901 268.2870 0.21153241 234.9512
## 26 0.7354499 1246.1535 276.9063 0.16626374 238.1250
## 32 0.6697558 1298.7687 277.1584 0.24435081 237.4882
## 74 0.7694359 860.5797 200.6024 0.16002615 175.3659
## 80 0.7406642 880.5964 214.7738 0.20928639 183.1864
## 86 0.7129254 999.6647 193.9955 0.22391792 171.8131
## 128 0.8046880 665.9934 235.2078 0.15135911 172.7503
## 134 0.8171326 654.6261 176.2806 0.13068185 120.5231
## 140 0.7318833 834.4031 220.8784 0.19749617 190.2966
## 182 0.7816048 689.7437 319.5899 0.14308295 221.5091
## 188 0.7619017 679.2801 248.1547 0.22243894 151.4112
## 194 0.7414056 732.0173 251.6836 0.18668582 186.9716
## 38 0.7611641 1240.0997 266.1051 0.20623334 228.9780
## 44 0.7411338 1257.4322 269.5754 0.18932820 230.7784
## 50 0.6730501 1298.0420 270.2845 0.23813045 235.9130
## 92 0.7596934 843.3668 205.7324 0.16210039 175.2430
## 98 0.7549935 897.3562 212.5258 0.15429599 198.9881
## 104 0.7188121 1003.5041 224.8050 0.19279223 194.4018
## 146 0.7903858 667.0352 191.8491 0.16715315 139.1503
## 152 0.7981431 692.0509 199.2165 0.14628422 154.1627
## 158 0.7532367 817.5209 210.1581 0.18775919 181.4655
## 200 0.7447641 721.9185 290.8893 0.18056998 208.2906
## 206 0.8174779 597.5451 277.7737 0.16796263 188.9898
## 212 0.7807533 737.2905 272.5805 0.13296872 208.6624
## 3 0.6093349 1209.6385 223.5396 0.26879744 203.6401
## 9 0.6237480 1207.9983 236.3507 0.24241346 216.2193
## 15 0.6435731 1212.8435 242.8390 0.25804638 220.2799
## 57 0.7476101 801.7754 163.5269 0.16440046 129.3925
## 63 0.7408354 821.2109 165.2304 0.17818732 138.1103
## 69 0.7433222 808.0047 181.4002 0.18590576 156.2358
## 111 0.7675169 664.1607 184.6435 0.16417663 105.0252
## 117 0.8096277 667.2055 192.9701 0.13283204 136.8722
## 123 0.7888526 692.6520 212.4543 0.13310100 180.5477
## 165 0.8487365 597.9470 153.2532 0.10358372 107.6930
## 171 0.8289969 612.0995 229.9321 0.10010066 179.5903
## 177 0.7988055 681.7760 254.9745 0.14930621 194.3396
## 21 0.7485089 1069.3849 228.4671 0.21869937 202.7998
## 27 0.7524890 1072.5057 250.7276 0.18131553 214.3675
## 33 0.6959177 1170.3438 250.8293 0.23394611 221.3799
## 75 0.8206071 616.8297 191.9819 0.12882282 127.8895
## 81 0.8071795 610.4984 226.5508 0.15536400 145.3822
## 87 0.7688499 763.1171 206.8758 0.18228194 164.8269
## 129 0.8421719 556.5137 230.3419 0.13755253 128.2761
## 135 0.8494594 561.3162 180.3276 0.10199378 118.9437
## 141 0.8042901 654.4629 212.0457 0.13670224 158.6837
## 183 0.7598067 708.3258 291.6380 0.15103441 161.3153
## 189 0.7278948 696.8230 246.1292 0.22838821 151.9918
## 195 0.7660891 685.3364 277.9059 0.19146498 218.5333
## 39 0.7749215 1045.4187 232.5509 0.17923736 195.1204
## 45 0.7647839 1079.3463 246.3002 0.17330809 215.2293
## 51 0.6955528 1159.2167 231.8086 0.22104986 213.9103
## 93 0.8231709 604.4619 170.8831 0.11666692 111.0394
## [ reached 'max' / getOption("max.print") -- omitted 116 rows ]
# Display information from the linear model fit
FitLinear$results
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 1557.86 0.3838726 1302.344 275.088 0.2097322 262.3488
Now, we make predictions on the test data sets using the best model fits. Then we compare RMSE to determine the best model.
predTree <- predict(fitTree, newdata = select(dayTest, -cnt))
postResample(predTree, dayTest$cnt)
## RMSE Rsquared MAE
## 1021.7688382 0.7457244 783.3663881
boostPred <- predict(fitBoost, newdata = select(dayTest, -cnt))
postResample(boostPred, dayTest$cnt)
## RMSE Rsquared MAE
## 742.9536892 0.8598832 568.6585069
linearPred <- predict(FitLinear, newdata = select(dayTest, -cnt))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
postResample(linearPred, dayTest$cnt)
## RMSE Rsquared MAE
## 1915.0286349 0.1673958 1652.6410116
When we compare the two models, the boosted tree model has lower RMSE values when applied on the test dataset. Hence, the boosted tree model is our final model and best model for interpreting the bike rental count on a daily basis.