Building Your Own Functions in R

You can make your own functions in R to perform any calculations or tasks you might wish. This can be done with the keyword function.

Let's consider a simple example. Suppose we want to create a function to calculate the two real solutions (assuming they exist) to an arbitrary quadratic $ax^2 + bx + c = 0$. We would like our function to take in a vector of three values (i.e., $a$, $b$, and $c$) and produce a vector of two values that represent our solutions.

A function that does this is defined below. Take a look, and then we'll discuss the various parts involved with its definition...

solve.quadratic = function(coefficients) {
  a = coefficients[1]
  b = coefficients[2]
  c = coefficients[3]
  discriminant = sqrt(b^2-4*a*c)
  solution1 = (-b-discriminant)/(2*a)
  solution2 = (-b+discriminant)/(2*a)
  return (c(solution1,solution2))
}

If you wish to be able to use this function later, you can make R aware of it in one of two ways. First, you can copy it into a script file, select all of it, and either click the run "Run" button, or hit ⌘-enter (OS X) or Ctrl-Enter (windows), as you would to evaluate any expression. Alternatively, you can type the whole function directly in the console and hit enter.

Once the function has been evaluated, you can use/invoke it whenever you like. As an example of using the function above:

> solutions = solve.quadratic(c(1,5,6))
> solutions
[1] -3 -2

In the first line of the first code box above, solve.quadratic is the name we have given to this function. This is the name with which we will invoke it later -- as demonstrated by what is typed in the second code box immediately above.

Then in the second line of the first code box above comes the assignment operator. "=", followed by the keyword "function", and a list of parameters we want this function to use inside parentheses. Here, we require only a single parameter -- a vector that inside this function will be referred to as coefficients.

Then comes the body of the function, delimited by curly braces. Everything inside the body of the function will execute every time the function is invoked. Here, we simply create a couple of more conveniently named variables to help calculate the two solutions.

Last, but not least, we see a return statement. The value returned -- here, that would be c(solution1,solution2) -- is the value the function will take on when invoked later. This is seen to be the case when we type solutions in the last line of the second code box above, to display its elements.

Multiple Parameters

A function can have more than one parameter if desired. One needs only separate them by commas inside the parentheses. This allows us to write the function above in a different way. Before, we lumped the three values $a$, $b$, and $c$ into a single vector that we could pass as the lone input to our function. While this is a useful technique when writing a function that takes as input a vector representing some data (potentially a large amount of data) -- for a function that relies on just a few values, it might be better to give each input its own name, as shown below:

solve.quadratic = function(a,b,c) {
  discriminant = sqrt(b^2-4*a*c)
  solution1 = (-b-discriminant)/(2*a)
  solution2 = (-b+discriminant)/(2*a)
  return (c(solution1,solution2))
}

This time, we invoke the function in the following way:

> solve.quadratic(1,5,6)
[1] -3 -2

Identifying Parameters at Invocation

When you have more than one parameter for a function, you may find it useful to identify parameters by name when you invoke the function, as shown below. In this way, you can list the parameters in whatever order you wish (even backwards):

> solve.quadratic(c=6,b=5,a=1)
[1] -3 -2

This is particularly helpful when there are many parameters and it is difficult to remember which one comes before another.

Parameters with Default Values

One has the option to give parameters default values in a function's definition -- whereby one only has to provide values for those parameters not using their default values.

As an example, consider the definition of the root function below that computes a square root by default, unless one provides an index parameter. If an index is provided, that root is calculated instead (e.g., if the index is 3, a cube root should be calculated):

root = function(x,index=2) {
   return(x^(1/index))
}
And below, we invoke the function in a couple of different ways:
> root(25)
[1] 5
> root(index=3,8)
[1] 2

Scope

When writing functions, it is important to understand the concept of variable scope.

Variables that are seen only within the body of a function are called local variables and these are created when the function is invoked and disappear when the function returns.

Variables created outside of functions are called global variables and can be referenced both outside and inside functions.

Importantly, when a function is invoked, the values supplied for its parameters are copied to the corresponding local variables. When these local variables are modified inside the function, it does not affect any global variables used in the function's invocation.

See if you can follow all the different assignments below and make sense of the output shown in the following example:

x = 1           # <-- this is a global variable
d = c(1,2,3)    # <-- this is a global variable
y = 15          # <-- this is a global variable

mystery = function(x,d,e) { # <-- Inside this function x, d, and e
                            #     are variables local to the mystery() function. 
                            #
                            #     In particular, when x & d are now used inside
                            #     this function, they will refer to new and different
                            #     variables than the global variables x & d defined
                            #     earlier.
                            #
                            #     The values given to this new x & d are copies of the 
                            #     values given to them at invocation (see below).
                            #
                            #     On the other hand, if y is used inside this function
                            #     it refers to the global y variable defined earlier
  
  x = x + y        # <-- Here, the local x given at invocation and the global y are 
                   #     added together and their sum is stored in the local x. 
  print(x)         #     Then, this local x is printed.
  
  d[2]=5           # <-- Here, the local vector d's second element is assigned the value 5
  print(d)         #     and the local d is printed
  
  print(e+y)       # <-- Here, the sum of the local e given at invocation and the 
                   #     global y are summed and printed
  
  y = 200          # <-- Curiously, this statement has no real effect.  This actually
                   #     creates a local y -- but as the function is about to end, its
                   #     value disappears before it is used.  If one had wanted to assign
                   #     a value to the global y, one needs to use a special operator
                   #     called the "superassignment operator, <<-"
                   
  y <<- 300        # <-- Here, we use the superassignment operator just mentioned to
                   #     actually change the global variable y.

}

mystery(7,c(8,9,10),100)
print(x)
print(d)
print(y)
The first three assignments above (i.e., the initial assignments of the global variables x,d, and y) and the definition of the function mystery() produce easily predictable outputs. Starting with the function's invocation, predicting what R will do requires a bit more thinking. Below are the corresponding outputs:
> mystery(7,c(8,9,10),100)
[1] 22                         # <-- printed local x equal to 7 + 15
[1]  8  5 10                   # <-- printed local d with 2nd element changed to 5
[1] 115                        # <-- printed sum of local e (i.e., 100) and global y (15)
> print(x)                            
[1] 1                          # <-- printed global x, which was not changed by mystery() 
> print(d)
[1] 1 2 3                      # <-- printed global d, also not changed by mystery() 
> print(y)
[1] 300                        # <-- printed global y, explicitly changed by mystery()

Applying a function to every element of a vector

One can use the sapply(v,f) to apply a function $f$ to every element of a vector $v$. As an example, suppose one has defined the function given below:
square = function(n) {
  return(n^2)
}
We can apply this function to every integer from 1 to 10 with the following:
sapply(1:10,square)
 [1]   1   4   9  16  25  36  49  64  81 100
Of course, the following also works in this case:
square(1:10)
 [1]   1   4   9  16  25  36  49  64  81 100
Given this last example, one might wonder why R provides sapply() at all. The answer to that questions lies in the fact that some operations (like exponentiation via the "^") can be applied to any vector of values and work just fine -- but others don't.

While we will save discussing the particulars of it for a later time, the use of an if-conditional statement counts itself among these things that cause problems when applied to vectors of length greater than one.

As an example, consider the function below that returns the value $n/2$ if $n$ is even, and $(3n+1)$ otherwise:

collatz = function(n) {
  if (n %% 2 == 0) {
    return(n/2)
  } else {
    return(3*n+1)
  }
}

Note what happens when we try to apply the function to the vector 1:10 directly:

collatz(1:10)

> collatz(1:10)
 [1]  4  7 10 13 16 19 22 25 28 31
Warning message:
In if (n%%2 == 0) { :
  the condition has length > 1 and only the first element will be used
versus using sapply() to do the same:
sapply(1:10,collatz)
 [1]  4  1 10  2 16  3 22  4 28  5
Only in the second case was the output what we actually wanted. Looks like sapply() might be valuable after all!