I can’t currently find the source of data but it is an extensive set made of multiple tables for years (1950-2017)
x = c(rep(1,10), 1:5, 1:5, rep(1,10), 9:13, rep(11,10), 9:10)
y <- c(1:10, rep(10, 5), rep(7,5), 1:10, rep(1, 5), 1:10, 8:9)
plot(x,y, pch=19, xlab="", ylab = "", xaxt="n", yaxt="n")
qplot(x= x, y=y)
Data is broken down into multiple csv files which will need to be joined on common parameters to gain any insights: * Circuits * Lap Times * Races * Drivers * Driver Standings * Results * More used for reference
circuits = read.csv("Data(1950_2017)/circuits.csv")
lapTimes = read.csv("Data(1950_2017)/lapTimes.csv", stringsAsFactors = FALSE)
races = read.csv("Data(1950_2017)/races.csv", stringsAsFactors = FALSE)
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
## embedded nul(s) found in input
results = read.csv("Data(1950_2017)/results.csv")
drivers = read.csv("Data(1950_2017)/drivers.csv")
driverStandings = read.csv("Data(1950_2017)/driverStandings.csv")
Let’s start by exploring the Drivers dataset.
# What data types are the features in this set?
sapply(drivers, class)
## driverId driverRef number code forename surname
## "integer" "factor" "integer" "factor" "factor" "factor"
## dob nationality url
## "factor" "factor" "factor"
F1 is scored in points, looking at which drivers have the most all-time points. Keep in mind this is dating back to 1950. I’m not familiar if the way Grand Prix are scored has changed. It’s safe to say that for our purpose of predicting future grand prix winners, we will narrow our focus to current drivers. However out of curiousity, let’s look at the all time high scorers.
drivers_ = merge(drivers, driverStandings, by="driverId")
driverPoints = drivers_ %>%
group_by(driverId, driverRef, dob) %>%
summarize(points = sum(points))
ggplot(data = driverPoints, aes(x=driverId,y=points, label=driverRef)) + geom_point()
Too much, cut off at 10,000 points and label
fil = as_tibble(driverPoints) %>%
select(driverId, points, driverRef) %>%
group_by(driverId, driverRef) %>%
summarise(points = sum(points))
fil = fil[fil$points > 10000,]
ggplot(data = fil, aes(x=driverId, y = points, label = driverRef)) + geom_point() + geom_label()
Not a very good visual, we can better compare the drivers points with a bar/histogram style
ggplot(data = fil, aes(x=reorder(driverRef, points), y=points, fill=points)) + geom_bar(stat="sum") + ggtitle("Top drivers by points") +
xlab("Driver Name") + ylab("Points") +
scale_x_discrete(labels = function(labels) {
fixedLabels <- c()
for(n in 1:length(labels)) {
fixedLabels[n] <- paste0(ifelse(n %% 2 ==0, "" ,"\n"), labels[n])
}
return(fixedLabels)
})
Are all of these drivers still alive and competing?
summary(races)
## raceId year round circuitId
## Min. : 1 Min. :1950 Min. : 1.000 Min. : 1.00
## 1st Qu.: 250 1st Qu.:1974 1st Qu.: 4.000 1st Qu.: 9.00
## Median : 499 Median :1990 Median : 8.000 Median :18.00
## Mean : 500 Mean :1989 Mean : 8.234 Mean :21.76
## 3rd Qu.: 748 3rd Qu.:2005 3rd Qu.:12.000 3rd Qu.:30.00
## Max. :1009 Max. :2018 Max. :21.000 Max. :73.00
## name date time url
## Length:997 Length:997 Length:997 Length:997
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
Let’s do some EDA on the circuits. Take a look at which races have the shortest lap time.
driver_laps <- merge(races, lapTimes, by = "raceId")
head(driver_laps)
## raceId year round circuitId name date time.x
## 1 1 2009 1 1 Australian Grand Prix 2009-03-29 06:00:00
## 2 1 2009 1 1 Australian Grand Prix 2009-03-29 06:00:00
## 3 1 2009 1 1 Australian Grand Prix 2009-03-29 06:00:00
## 4 1 2009 1 1 Australian Grand Prix 2009-03-29 06:00:00
## 5 1 2009 1 1 Australian Grand Prix 2009-03-29 06:00:00
## 6 1 2009 1 1 Australian Grand Prix 2009-03-29 06:00:00
## url driverId lap position
## 1 http://en.wikipedia.org/wiki/2009_Australian_Grand_Prix 7 5 15
## 2 http://en.wikipedia.org/wiki/2009_Australian_Grand_Prix 7 6 16
## 3 http://en.wikipedia.org/wiki/2009_Australian_Grand_Prix 7 7 16
## 4 http://en.wikipedia.org/wiki/2009_Australian_Grand_Prix 7 8 16
## 5 http://en.wikipedia.org/wiki/2009_Australian_Grand_Prix 7 9 16
## 6 http://en.wikipedia.org/wiki/2009_Australian_Grand_Prix 7 10 16
## time.y milliseconds
## 1 1:32.565 92565
## 2 1:33.294 93294
## 3 1:30.891 90891
## 4 1:31.645 91645
## 5 1:52.298 112298
## 6 1:40.160 100160
# Only looking at last 10 years
driver_laps <- driver_laps[driver_laps$year > 2008,]
#Add driver name to set
driver_laps <- merge(driver_laps, drivers[,1:2], by = "driverId")
length(unique(driver_laps$circuitId))
## [1] 26
length(unique(driver_laps$raceId))
## [1] 173
temp = driver_laps %>%
group_by(driverRef, circuitId, raceId, name)
prix_list = races[races$year > 2015,]
prix_list = unique(prix_list$name)
temp = temp[as.character(temp$name) %in% prix_list,]
unique(temp$circuitId)
## [1] 73 18 11 10 20 6 15 3 24 4 9 1 7 17 2 70 32 12 13 14 71 22 69
for(i in 1:length(prix_list)){
set = temp %>%
group_by(driverRef) %>%
filter(driverRef %in% driverId_vec) %>%
filter(name == prix_list[i]) %>%
summarise(avg = mean(milliseconds))
driverLapTimePlot = ggplot(data = set, aes(x= reorder(driverRef,avg), y = avg, fill = avg)) + geom_bar(stat="sum") + ggtitle(prix_list[i]) + scale_x_discrete(labels = function(labels){
fixedLabels = c()
for(n in 1:length(labels)){
fixedLabels[n] = paste0(ifelse(n %% 2 == 0, "", "\n"), labels[n])
}
return(fixedLabels)
}
)
print(driverLapTimePlot)
}
## Warning: Removed 836 rows containing non-finite values (stat_sum).
## Warning: Removed 836 rows containing non-finite values (stat_sum).
## Warning: Removed 836 rows containing non-finite values (stat_sum).
## Warning: Removed 838 rows containing non-finite values (stat_sum).
## Warning: Removed 842 rows containing non-finite values (stat_sum).
summary(results)
## resultId raceId driverId constructorId
## Min. : 1 Min. : 1.0 Min. : 1.0 Min. : 1.00
## 1st Qu.: 5945 1st Qu.:273.0 1st Qu.: 55.0 1st Qu.: 6.00
## Median :11889 Median :478.0 Median :154.0 Median : 25.00
## Mean :11889 Mean :487.2 Mean :226.5 Mean : 46.28
## 3rd Qu.:17833 3rd Qu.:718.0 3rd Qu.:314.0 3rd Qu.: 57.00
## Max. :23781 Max. :988.0 Max. :843.0 Max. :210.00
##
## number grid position positionText
## Min. : 0.00 Min. : 0.00 Min. : 1.000 R :8517
## 1st Qu.: 7.00 1st Qu.: 5.00 1st Qu.: 4.000 F :1368
## Median : 15.00 Median :11.00 Median : 7.000 3 : 986
## Mean : 16.97 Mean :11.27 Mean : 7.782 4 : 986
## 3rd Qu.: 23.00 3rd Qu.:17.00 3rd Qu.:11.000 2 : 984
## Max. :208.00 Max. :34.00 Max. :33.000 5 : 982
## NA's :6 NA's :10550 (Other):9954
## positionOrder points laps time
## Min. : 1.00 Min. : 0.000 Min. : 0.00 :17773
## 1st Qu.: 7.00 1st Qu.: 0.000 1st Qu.: 20.00 +8:22.19: 5
## Median :13.00 Median : 0.000 Median : 52.00 +1:29.6 : 4
## Mean :13.08 Mean : 1.601 Mean : 45.27 0.7 : 4
## 3rd Qu.:19.00 3rd Qu.: 1.000 3rd Qu.: 66.00 46.2 : 4
## Max. :39.00 Max. :50.000 Max. :200.00 5.7 : 4
## (Other) : 5983
## milliseconds fastestLap rank fastestLapTime
## Min. : 1474899 Min. : 2.00 Min. : 0.0 :18394
## 1st Qu.: 5442948 1st Qu.:29.00 1st Qu.: 5.0 01:17.2: 28
## Median : 5859428 Median :44.00 Median :11.0 01:16.5: 27
## Mean : 6303313 Mean :41.06 Mean :10.6 01:16.8: 26
## 3rd Qu.: 6495440 3rd Qu.:53.00 3rd Qu.:16.0 01:28.0: 26
## Max. :15090540 Max. :78.00 Max. :24.0 01:19.1: 24
## NA's :17774 NA's :18394 NA's :18246 (Other): 5252
## fastestLapSpeed statusId
## :18394 Min. : 1.00
## 189.423: 3 1st Qu.: 1.00
## 195.933: 3 Median : 11.00
## 196.785: 3 Mean : 18.24
## 200.091: 3 3rd Qu.: 16.00
## 201.33 : 3 Max. :136.00
## (Other): 5368
summary(drivers)
## driverId driverRef number code forename
## Min. : 1.0 abate : 1 Min. : 2.00 :757 John : 14
## 1st Qu.:211.2 abecassis: 1 1st Qu.:10.25 BIA : 2 Mike : 14
## Median :421.5 acheson : 1 Median :21.50 HAR : 2 Peter : 13
## Mean :421.5 adamich : 1 Mean :30.50 MAG : 2 Bill : 11
## 3rd Qu.:631.8 adams : 1 3rd Qu.:38.25 VER : 2 Tony : 11
## Max. :843.0 ader : 1 Max. :99.00 ALB : 1 Bob : 10
## (Other) :836 NA's :804 (Other): 76 (Other):769
## surname dob nationality
## Taylor : 5 02/10/1921: 2 British :162
## Wilson : 4 04/09/1920: 2 American :157
## Brabham : 3 05/05/1932: 2 Italian : 99
## Brown : 3 06/06/1923: 2 French : 73
## Fittipaldi: 3 06/10/1918: 2 German : 49
## Hill : 3 12/12/1946: 2 Brazilian: 31
## (Other) :821 (Other) :830 (Other) :271
## url
## : 1
## http://en.wikipedia.org/wiki/%C3%89lie_Bayol : 1
## http://en.wikipedia.org/wiki/%C3%89ric_Bernard : 1
## http://en.wikipedia.org/wiki/%C3%89rik_Comas : 1
## http://en.wikipedia.org/wiki/%C3%93scar_Alfredo_G%C3%A1lvez: 1
## http://en.wikipedia.org/wiki/A.J._Foyt : 1
## (Other) :836
Let’s create our MASTER set here, containing all info we can foresee as necessary going forward.
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.