Problem set 1
For all questions please explain, step by step, all your computations and R commands 1 (5 points) Computations Calculate the following sum by implementing it as a for() loop in R: 99 Pi =10 (i2 +i3) 2 (15 points) Random Variable Generation For the following function f(x)=exp(−(x−1)2 2x )(x+1)/12 2.1 plot the function graph for x between [0.001, 20] 2.2 Generate more than 1000 sample points between [0.001, 20] for a random variable whose density is f(x) and plot it’s density 3. (10 points) Data Split The dataset mtcars has 32 rows.
Randomly divide the dataset mtcars into 4 folds so that exactly each fold has 8 rows, and calculate Min. 1st Qu. Median Mean 3rd Qu. Max. for each fold 4 (10 points) Working with Character Vectors Use the function paste() to create the following character vectors of length 30: 4.1. (“label 1”, “label 2”, ….., “label 30”). Note that there is a single space between label and the number following.
1
4.2. (“fn1”, “fn2”, …, “fn30”). In this case, there is no space between fn and the number following. 5 (10 points) Understanding Vectorized Instructions and Quirkyness of R Execute the following line on the R console. What do you get? Can you explain, step by step, how R is interpreting the command? 1:10 > 5 6. (20 points) Data split quality load mushroom5_1.csv data, we treat the column “class” as the dependent variable, the other columns are independent variables. For each column do
• split the data using values in the column • calculate entropy for the column “class” and find the information gain for the split (using 0*log(0)=0 )
Thenfindwhichcolumn’ssplithasthemaximalinformationgainforthecolumn “class” 7. (30 points) Exploratory Data Analysis (EDA) A common business problem that we discussed in class is customer churn: existing customers that leave for the competition. Please download churn.csv data.
• Load the data set into R. • Explore the general structure of the data: How many rows and columns does the data have? • Create a new variable IncomeGroup that characterizes users with as <$35k; $35k-$45k; $45k-$65k; $65k-$100k; $100k. Use the function cut() to divide the data into intervals. • Plot the distribution of OVERAGE for each of the income groups.
• The dataset has a variable that captures whether a certain customer can- celed his/her cellphone contract (variable LEAVE). Explore the data with regard to LEAVE decisions and make at 2 least one visual and quantitavie 2 comparisons across income groups and other variables to find out who is most likely to leave.
• Create one or two metrics/measures/statistics that summarize the data. Examples of potentials metrics include min, max, mean, variance (standard deviation) and these can be calculated across various user segments. Be selective. Think about what might be most important to track.
• You should aim to include 2-3 plots or tables in your submission. Describe and interpret any patterns you find with a short paragraph.