12 Exam 1 Practice

You should expect to be able to do 7 to 10 of these type problems for your exam (50 minutes).

  1. Load the rock data set. How many observations and variables are in the rock data set?

  2. What was the purpose of this dataset?

  3. What are the units of measurement for peri?

  4. In the rock data set what is the standard deviation of area? 2683.8488617

  5. How many rocks have an area of greater than 4333? 41

  6. How many rocks have an area of greater than 4333 and less than 7333? 14

  7. Create a variable called good.fun which is area divided by perimeter squared. What is the average of good.fun? 0.0020362

  8. Add a new observation that to this dataframe that has an area of 17777, peri of 7777, shape of 7777, and a perm of 7777. Can you do this?

  9. What is the mean of area for including the new rock you have added?

  10. Using the women data set, what is the average BMI for all women?

  11. What is the average BMI for women?

  12. What is the correlation between height and weight in the women dataset?

  13. Make a histogram plot of the area variable from rock.

  14. Use a bivariate plot of rock data to identify which variable are most correlated. Confirm your answer with a correlation table. Were you right?

  15. Do this math in R: 3333 squared divided by the square root of 78.

  16. Make this sequence using rep and seq functions: 1 2 3 1 2 3 1 2 3 1 2 3

  17. Make this sequence using rep and seq functions: 97, 96, 95, 94, 93, 92, 91, 90, 89, 97, 96, 95, 94, 93, 92, 91, 90, 89, 97, 96, 95, 94, 93, 92, 91, 90, 89

  18. Make this plot:
  19. What is the heaviest car in the mtcars dataset in kg?

  20. How many cars in mtcars dataset have more than 4 cylinders?

  21. Create this dataframe called Gerry:

  22. Add column called aa to Gerry which are the numbers 11120, 11121, 11122, 11123, 11124, 11125, 11126, 11127, 11128, 11129

  23. Have fun. You are doing awesome!

#Machine Learning?

Have we talked about the Target Pregnancy Story yet?

How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did

This is where the fun stuff begins! What we have learned up to this point has barely scratched the surface of what R is capable of. In the world of data science, R is used for three primary purposes, those purposes are (1) data transformation, (2) data wrangling, (3) machine learning. The other two purposes have been covered in earlier chapters of this book. The reason we covered the other topics first is that that lay the foundation. In the real world, it is likely you will never be given a clean data set, and you will have to do some wrangling and transformation before anything else is possible. After all, in the experience of many data science students, cleaning the data is the most tedious and time consuming process of a project.

Enough of the old stuff, what is machine learning? According to the Merriam-Webster Dictionary, machine learning is “the process by which a computer is able to improve its own performance by continuously incorporating new data into an existing statistical model.” Let’s take a trip back to the Target story discussed in the introductory chapter. The data scientist, or more likely data scientists (collaborative work is essential in this field), that worked on that model were likely experts in machine learning. They were able to train the computer to look through thousands (probably more) of customers’ data and the computer, based off the algorithms written by these “data nerds,” was able to predict whether a customer was pregnant! Think of other possible applications of this technology? We could predict how well a student will preform on an exam, the risk of someone suffering from a heart attack, or the likelihood that someone will default on a loan. Every field in existence today could find a way to implement machine learning to optimize their business.

There are two branches of machine learning, supervised and unsupervised. Both have their own unique uses, however in this course we will focus on supervised machine learning. Supervised machine learning required us to provide a clean data set with clearly defined variables and instructions. Essentially, we give the computer the information it needs and provide it with specific instructions detailing what we would like to see happen, and it does the rest. Linear regression is typically the first method of supervised learning people are introduced to, and it will be the focus of this chapter.

Note, these concepts are not all common sense and can be difficult to wrap your head around at times. Be sure to constantly turn to your instructor or peers for assistance and remember that there are hundreds of online resources at your disposal. As with anything though, practice makes perfect. The popular rule states that mastering a skill can take upwards of 10,000 hours! Now, this course is not going to take you 10 years to complete, however the goal is that by the end of this chapter you will know your way around the basics of linear regression.