STAT 19000: Project 1 — Spring 2021
Motivation: In this course we require the majority of project submissions to include a compiled PDF, a .Rmd file based off of our template, and a code file (a .R file if the project is in R, a .py file if the project is in Python). Although RStudio makes it easy to work with both Python and R, there are occasions where working out a Python problem in a Jupyter Notebook could be convenient. For that reason, we will introduce Jupyter Notebook in this project.
Context: This is the first in a series of projects that will introduce Python and its tooling to students.
Scope: jupyter notebooks, rstudio, python
Dataset
The following questions will use the dataset found in Scholar:
/class/datamine/data/open_food_facts/openfoodfacts.tsv
Questions
Question 1
Navigate to notebook.scholar.rcac.purdue.edu/ and sign in with your Purdue credentials (without BoilerKey). This is an instance of Jupyter Notebook. The main screen will show a series of files and folders that are in your $HOME
directory. Create a new notebook by clicking on New > f2020-s2021
.
Change the name of your notebook to "LASTNAME_FIRSTNAME_project01" where "LASTNAME" is your family name, and "FIRSTNAME" is your given name. Try to export your notebook (using the File
dropdown menu, choosing the option Download as
), what format options (for example, .pdf
) are available to you?
|
If the kernel f2020-s2021
does not appear in Jupyter Notebooks, you can make it appear as follows:
-
Login to rstudio.scholar.rcac.purdue.edu
-
Click on
Tools > Shell…
(in the menu) -
In the shell (terminal looking thing that should say something like:
bash-4.2$
), type the following followed by Enter/Return:/class/datamine/apps/runme
-
Then click on
Session > Restart R
(in the menu) You should now have access to the course kernel namedf2020-s2021
in notebook.scholar.rcac.purdue.edu
-
A list of export format options.
Question 2
Each "box" in a Jupyter Notebook is called a cell. There are two primary types of cells: code, and markdown. By default, a cell will be a code cell. Place the following Python code inside the first cell, and run the cell. What is the output?
from thedatamine import hello_datamine
hello_datamine()
You can run the code in the currently selected cell by using the GUI (the buttons), as well as by pressing |
-
Output from running the provided code.
Question 3
Jupyter Notebooks allow you to easily pull up documentation, similar to ?function
in R. To do so, use the help
function, like this: help(my_function)
. What is the output from running the help function on hello_datamine
? Can you modify the code from question (2) to print a customized message? Create a new markdown cell and explain what you did to the code from question (2) to make the message customized.
Some Jupyter-only methods to do this are:
|
You can also see the source code of a function in a Jupyter Notebook by typing |
-
Output from running the
help
function onhello_datamine
. -
Modified code from question (2) that prints a customized message.
Question 4
At this point in time, you’ve now got the basics of running Python code in Jupyter Notebooks. There is really not a whole lot more to it. For this class, however, we will continue to create RMarkdown documents in addition to the compiled PDFs. You are welcome to use Jupyter Notebooks for personal projects or for testing things out, however, we will still require an RMarkdown file (.Rmd), PDF (generated from the RMarkdown file), and .py file (containing your python code). For example, please move your solutions from Questions 1, 2, 3 from Jupyter Notebooks over to RMarkdown (we discuss RMarkdown below). Let’s learn how to run Python code chunks in RMarkdown.
Sign in to rstudio.scholar.rcac.purdue.edu (with BoilerKey). Projects in The Data Mine should all be submitted using our template found here or on Scholar (/class/datamine/apps/templates/project_template.Rmd
).
Open the project template and save it into your home directory, in a new RMarkdown file named project01.Rmd
. Prior to running any Python code, run datamine_py()
in the R console, just like you did at the beginning of every project from the first semester.
Code chunks are parts of the RMarkdown file that contains code. You can identify what type of code a code chunk contains by looking at the engine in the curly braces "{" and "}". As you can see, it is possible to mix and match different languages just by changing the engine. Move the solutions for questions 1-3 to your project01.Rmd
. Make sure to place all Python code in python
code chunks. Run the python
code chunks to ensure you get the same results as you got when running the Python code in a Jupyter Notebook.
Make sure to run |
The end result of the |
-
project01.Rmd
with the solutions from questions 1-3 (including any Python code inpython
code chunks).
Question 5
It is not a Data Mine project without data! [Here] (#p-csv-pkg) are some examples of reading in data line by line using the csv
package. How many columns are in the following dataset: /class/datamine/data/open_food_facts/openfoodfacts.tsv
? Print the first row, the number of columns, and then exit the loop after the first iteration using the break
keyword.
You can get the number of elements in a list by using the |
You can use the |
for my_row in my_csv_reader:
print(my_row)
break
print("Exited loop as soon as 'break' was run.")
|
If you get a Dtype warning, feel free to just ignore it. |
-
Python code used to solve this problem.
-
The first row printed, and the number of columns printed.
Question 6 (OPTIONAL)
Unlike in R, where many of the tools you need are built-in (read.csv
, data.frames, etc.), in Python, you will need to rely on packages like numpy
and pandas
to do the bulk of your data science work. {#p1-06}
In R it would be really easy to find the mean of the 151st column, caffeine_100g
:
myDF <- read.csv("/class/datamine/data/open_food_facts/openfoodfacts.tsv", sep="\t", quote="")
mean(myDF$caffeine_100g, na.rm=T) # 2.075503
If you were to try to modify our loop from question (5) to do the same thing, you will run into a myriad of issues, just to try and get the mean of a column. Luckily, it is easy to do using pandas
:
import pandas as pd
myDF = pd.read_csv("/class/datamine/data/open_food_facts/openfoodfacts.tsv", sep="\t")
myDF["caffeine_100g"].mean() # 2.0755028571428573
Take a look at some of the methods you can perform using pandas pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats]. Perform an interesting calculation in R, and replicate your work using pandas
. Which did you prefer, Python or R?
-
R code used to solve the problem.
-
Python code used to solve the problem.