STAT 19000: Project 1 — Spring 2021

Motivation: In this course we require the majority of project submissions to include a compiled PDF, a .Rmd file based off of our template, and a code file (a .R file if the project is in R, a .py file if the project is in Python). Although RStudio makes it easy to work with both Python and R, there are occasions where working out a Python problem in a Jupyter Notebook could be convenient. For that reason, we will introduce Jupyter Notebook in this project.

Context: This is the first in a series of projects that will introduce Python and its tooling to students.

Scope: jupyter notebooks, rstudio, python

Learning objectives

Use Jupyter Notebook to run Python code and create Markdown text.
Use RStudio to run Python code and compile your final PDF.
Gain exposure to Python control flow and reading external data.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/open_food_facts/openfoodfacts.tsv

Questions

Question 1

Navigate to notebook.scholar.rcac.purdue.edu/ and sign in with your Purdue credentials (without BoilerKey). This is an instance of Jupyter Notebook. The main screen will show a series of files and folders that are in your $HOME directory. Create a new notebook by clicking on New > f2020-s2021.

Change the name of your notebook to "LASTNAME_FIRSTNAME_project01" where "LASTNAME" is your family name, and "FIRSTNAME" is your given name. Try to export your notebook (using the File dropdown menu, choosing the option Download as), what format options (for example, .pdf) are available to you?

f2020-s2021 is the name of our course notebook kernel. A notebook kernel is an engine that runs code in a notebook. ipython kernels run Python code. f2020-s2021 is an ipython kernel that we’ve created for our course Python environment, which contains a variety of compatible, pre-installed packages for you to use. When you select f2020-s2021 as your kernel, all of the packages in our course environment are automatically made available to you.

If the kernel f2020-s2021 does not appear in Jupyter Notebooks, you can make it appear as follows:

Login to rstudio.scholar.rcac.purdue.edu
Click on Tools > Shell… (in the menu)
In the shell (terminal looking thing that should say something like: bash-4.2$), type the following followed by Enter/Return: /class/datamine/apps/runme
Then click on Session > Restart R (in the menu) You should now have access to the course kernel named f2020-s2021 in notebook.scholar.rcac.purdue.edu

Items to submit

A list of export format options.

Question 2

Each "box" in a Jupyter Notebook is called a cell. There are two primary types of cells: code, and markdown. By default, a cell will be a code cell. Place the following Python code inside the first cell, and run the cell. What is the output?

from thedatamine import hello_datamine
hello_datamine()

You can run the code in the currently selected cell by using the GUI (the buttons), as well as by pressing Ctrl+Return/Enter.

Items to submit

Output from running the provided code.

Question 3

Jupyter Notebooks allow you to easily pull up documentation, similar to ?function in R. To do so, use the help function, like this: help(my_function). What is the output from running the help function on hello_datamine? Can you modify the code from question (2) to print a customized message? Create a new markdown cell and explain what you did to the code from question (2) to make the message customized.

Some Jupyter-only methods to do this are:

Click on the function of interest and type Shift+Tab or Shift+Tab+Tab.
Run function?, for example, print?.

You can also see the source code of a function in a Jupyter Notebook by typing function??, for example, print??.

Items to submit

Output from running the help function on hello_datamine.
Modified code from question (2) that prints a customized message.

Question 4

At this point in time, you’ve now got the basics of running Python code in Jupyter Notebooks. There is really not a whole lot more to it. For this class, however, we will continue to create RMarkdown documents in addition to the compiled PDFs. You are welcome to use Jupyter Notebooks for personal projects or for testing things out, however, we will still require an RMarkdown file (.Rmd), PDF (generated from the RMarkdown file), and .py file (containing your python code). For example, please move your solutions from Questions 1, 2, 3 from Jupyter Notebooks over to RMarkdown (we discuss RMarkdown below). Let’s learn how to run Python code chunks in RMarkdown.

Sign in to rstudio.scholar.rcac.purdue.edu (with BoilerKey). Projects in The Data Mine should all be submitted using our template found here or on Scholar (/class/datamine/apps/templates/project_template.Rmd).

Open the project template and save it into your home directory, in a new RMarkdown file named project01.Rmd. Prior to running any Python code, run datamine_py() in the R console, just like you did at the beginning of every project from the first semester.

Code chunks are parts of the RMarkdown file that contains code. You can identify what type of code a code chunk contains by looking at the engine in the curly braces "{" and "}". As you can see, it is possible to mix and match different languages just by changing the engine. Move the solutions for questions 1-3 to your project01.Rmd. Make sure to place all Python code in python code chunks. Run the python code chunks to ensure you get the same results as you got when running the Python code in a Jupyter Notebook.

Make sure to run datamine_py() in the R console prior to attempting to run any Python code.

The end result of the project01.Rmd should look similar to this.

Items to submit

project01.Rmd with the solutions from questions 1-3 (including any Python code in python code chunks).

Question 5

It is not a Data Mine project without data! [Here] (#p-csv-pkg) are some examples of reading in data line by line using the csv package. How many columns are in the following dataset: /class/datamine/data/open_food_facts/openfoodfacts.tsv? Print the first row, the number of columns, and then exit the loop after the first iteration using the break keyword.

You can get the number of elements in a list by using the len method. For example: len(my_list).

You can use the break keyword to exit a loop. As soon as break is executed, the loop is exited and the code immediately following the loop is run.

for my_row in my_csv_reader:
    print(my_row)
    break
print("Exited loop as soon as 'break' was run.")

'\t' represents a tab in Python.

If you get a Dtype warning, feel free to just ignore it.

Items to submit

Python code used to solve this problem.
The first row printed, and the number of columns printed.

Question 6 (OPTIONAL)

Unlike in R, where many of the tools you need are built-in (read.csv, data.frames, etc.), in Python, you will need to rely on packages like numpy and pandas to do the bulk of your data science work. {#p1-06}

In R it would be really easy to find the mean of the 151st column, caffeine_100g:

myDF <- read.csv("/class/datamine/data/open_food_facts/openfoodfacts.tsv", sep="\t", quote="")
mean(myDF$caffeine_100g, na.rm=T) # 2.075503

If you were to try to modify our loop from question (5) to do the same thing, you will run into a myriad of issues, just to try and get the mean of a column. Luckily, it is easy to do using pandas:

import pandas as pd
myDF = pd.read_csv("/class/datamine/data/open_food_facts/openfoodfacts.tsv", sep="\t")
myDF["caffeine_100g"].mean() # 2.0755028571428573

Take a look at some of the methods you can perform using pandas pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats]. Perform an interesting calculation in R, and replicate your work using pandas. Which did you prefer, Python or R?

Items to submit

R code used to solve the problem.
Python code used to solve the problem.