Here’s a classic question from discrete mathematics: how should one go about summing the integers from 1 to 1000? Using summation notation, we can express this sum as
\[\begin{equation} \sum_{i = 1}^{1000} i = 1+2 + \dots + 999 +1000. \end{equation}\]
Q: What does each part in the above formula correspond to?
There are two ways to go about evaluating this sum. First you can always just use the brute force method by listing the numbers you want to sum in a vector. We use a colon to write the sequence from 1 to n.
V = 1:1000
Now you can just sum the elements by writing:
sum(V)
## [1] 500500
A simpler method comes from using a closed formula. You’ve hopefully encountered (and remembered!) the following formula in calculus when evaluating Riemann sums: \[\begin{equation} \sum_{i = 1}^{n} i = \frac{n(n+1)}2 \end{equation}\]
Q: Do you know how to show this formula? If not, especially if you’re a computer science or math major, you should take a look at mathematical induction. The (most common) proof of the above formula is a textbook example of using induction. An alternate proof is given in the old chestnut about how Gauss solved summing 1 to 100 as a schoolboy by ``folding” the list of numbers. Whether the story is actually true is unknown.
The beauty of this closed formula is that you don’t need an entire list of numbers to find what you want. Here, we simply write
n = 1000
s = n*(n+1)/2
s
## [1] 500500
The above is a first example of computational complexity encountered in algorithms. The first formula requires summing \(n\) numbers (technically, \(n-1\) sums), while the second simply has two multiplications and one sum, regardless of what \(n\) is! We would say that the first method is \(O(n)\) (“order \(n\)”) while the second is \(O(1)\) (“order \(1\)”). The goal in developing algorithms is to make the number of operations as small as possible.
What about summing the first 1000 squares? This would be given by the sum
\[\begin{equation} \sum_{i = 1}^{1000} i^2 = 1^2+2^2 + \dots + 999^2 +1000^2. \end{equation}\]
To list squares of integers, we could write something that would look quite ghastly to a mathematician:
s = (1:1000)**2
s[1:10]
## [1] 1 4 9 16 25 36 49 64 81 100
sum(s)
## [1] 333833500
Q: Why would such an expression raise eyebrows for a mathematician?
Like the first sum we looked at, we can also use induction to show \[\begin{equation} \sum_{i = 1}^{n} i = \frac{n(n+1)(2n+1)}6. \end{equation}\]
This is another closed form, meaning we have the \(O(1)\) solution
n = 1000
s = n*(n+1)*(2*n+1)/6
s
## [1] 333833500
Matrices are the main currency of dealing with multiple dimensions. A fancy term for a matrix is a linear transformation, which means that matrices are functions that take in lists of numbers (vectors) and spit out another list of numbers. The linear part means that the function has very nice properties. I won’t go into the details now, but I strongly, strongly recommend that you take bone up on linear algebra, regardless of your discipline. For now, we will essentially view matrices as `numbers in a box’, as much as this phrase is an insult to the discipline. Matrices have rows and columns, and each number we put in a matrix is called an entry. For instance, let’s define the matrix
\[\begin{equation} M = \begin{bmatrix}1 & 3 \\ 5 & 2 \\ 6 & 8 \\ \end{bmatrix} \end{equation}\]
The matrix \(M\) has three rows and two columns. The \(i,j\)th entry of \(M\), often writen as \(M_{i,j}\), corresponds to the entry in the \(i\)th row and \(j\)th column. So entry \(M_{3,1} = 6\). In R, the matrix is written out as
M = matrix(c(1,5,6,3,2,8), 3,2)
M
## [,1] [,2]
## [1,] 1 3
## [2,] 5 2
## [3,] 6 8
Note a few things here: the data is represented as a single vector in the first argument, where the first entries correspond to the first column, and then the second, etc. The second and third arguments give the shape of the matrix (rows then columns).
Summing matrices is what you would expect (element-wise multiplication). This only makes sense if the two matrices we’re adding have the same size:
M = matrix(c(1,5,6,3,2,8), 3,2)
N = matrix(c(2,1,4,3,6,-1), 3,2)
M+N
## [,1] [,2]
## [1,] 3 6
## [2,] 6 8
## [3,] 10 7
Subtracting is essentially the same, but multiplication is not, at least in the traditional sense. For those who know how to properly multiply matrices, multiplication is given by the operator %*%.
M = matrix(c(1,5,6,3,2,8), 3,2)
N = matrix(c(2,1,4,3,6,-1), 2,3)
M %*% N
## [,1] [,2] [,3]
## [1,] 5 13 3
## [2,] 12 26 28
## [3,] 20 48 28
The `naive product’ of multiplying elements entry by entry (known as the Hadamard product), denoted \(M \circ N\), is again what you’d expect:
M = matrix(c(1,5,6,3,2,8), 3,2)
N = matrix(c(2,1,4,3,6,-1), 3,2)
M*N
## [,1] [,2]
## [1,] 2 9
## [2,] 5 12
## [3,] 24 -8
Thus far, we’ve been going through basic information on different datatypes in R. But recall, we’re in a data science class, so let’s take a look at some data. We’ll obtain data from all different sources, but for now, we will use built in datasets provided by R. For our first dataset, we’ll be looking at the mtcars dataset.
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
There’s a bunch to take in here. It looks like there’s a list of cars representing rows, and then a large collection of variables which describe different features for each car. What do they stand for? For built in datasets, we can simply type in
?mtcars
On the bottom right of the RStudio window, we get a summary description of mtcars. This includes a quick description (we’re looking at cars from a 1974 Motor Trend magazine), what the variables stand for, and where the dataset comes from (the source).
Q: What kind of an object is mtcars?
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
The object type is a data frame, which is the main form of representing datsets for R, at least for the purposes of this class. Data frames allow us to represent different types of variables for each column (although in this case, mtcars is all numeric variables).
To look at a specific variable, we use the accessor symbol $. So, for instance, if we’d like to look at the miles per gallon of each vehicle, we type
mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
Q: What is the highest mpg for all the cars? Which car does it belong to?
This is found using the max and which.max functions.
max(mtcars$mpg)
## [1] 33.9
which.max(mtcars$mpg)
## [1] 20
While max gives the largest value, which.max gives what’s called the argmax, meaning the argument which provides the maximum value. In our case, the maximum occurs at the 20th element. To get a name, we type
rownames(mtcars)[which.max(mtcars$mpg)]
## [1] "Toyota Corolla"
Q: Why did we use brackets in one place in parentheses in others?
We can do the same thing for gears:
max(mtcars$gear)
## [1] 5
rownames(mtcars)[which.max(mtcars$gear)]
## [1] "Porsche 914-2"
Q: Notice anything fishy here? How can we fix it?
The problem here is that there are several cars that have 5 gears. How to list all of them? One way is to index by a boolean statement
b = mtcars$gear == max(mtcars$gear)
rownames(mtcars)[b]
## [1] "Porsche 914-2" "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [5] "Maserati Bora"
Q: Return a list giving a list of cars from the least mpg to the most mpg.
As we explore different packages for R (these are additional sets of tools we can download for various purposes), solving this question will be a one-liner. For now, we use the order function.
b = order(mtcars$mpg)
rownames(mtcars)[b][1:10]
## [1] "Cadillac Fleetwood" "Lincoln Continental" "Camaro Z28"
## [4] "Duster 360" "Chrysler Imperial" "Maserati Bora"
## [7] "Merc 450SLC" "AMC Javelin" "Dodge Challenger"
## [10] "Ford Pantera L"
Q: What does the order function do? How about the function?
Q: What if we wanted to list to be ordered from most mpg to least mpg?