TDM 10200: Project 3 — 2024
Motivation: Learning about Big Data. When working with large data sets, it is important to know how we can use control flow to find our information, a little bit at a time, without reading in all of the files at once. Control flow is the order that your code runs.
Scope: Python, Control Flow, if statements, for loops
Questions
Question 1 (2 pts)
-
Explore the files in the provided data set directory. Find out how many years are included in the data set. Briefly describe the contents of the files.
-
Import the library
pandas
aspd
, and importPath
frompathlib
. -
Create a list named
myfiles
, to holdPath
objects from1880.csv
to1883.csv
in the data set folder usinglist comprehension
. You can start with the following sample code (below), but you need to modify this for loop, to uselist comprehension
.Following is the sample code that will return a "Path" object for the file
1750.csv
.Path("/anvil/projects/tdm/data/noaa/1750.csv")
You can start with a for loop, to create a list of Path objects, as follows, BUT we want you to modify this example, to use
list comprehension
.myfiles=[] for year in range (1880, 1884): file= Path(f'/anvil/projects/tdm/data/noaa/{year}.csv') myfiles.append(file) print(myfiles)
Question 2 (2 pts)
-
Calculate how many records are in the file
1880.csv
. (Each line is one record.)The following is the sample code to calculate records in one sample file object named
file
:with open(file,"r") as f: mycount = 0 for line in f: mycount += 1 print(f'There are {mycount} records in the file called {file}')
There are 370779 records in the file called /anvil/projects/tdm/data/noaa/1880.csv
-
Calculate how many records there are (altogether) in the 4 files from
1880.csv
to1883.csv
. Use the listmyfiles
that you created in Question 1. Your output should give the total number of records altogether, so it should say something like:
There are [put your number of records here] records in the 4 files altogether.
|
Question 3 (2 pts)
-
Run the following statement, to read in the first file from the list
myfiles
into a DataFrame usingmyDF = pd.read_csv(myfiles[0])
. Display the column names formyDF
. Look at the head and tail ofmyDF
. Do you see anything unexpected? -
Please modify your work from Question 3a, to correct the problem that you found. What are the column names now? Hint: Using
The `header=None
argument will be useful. -
Now let us add these 7 column names:
id
,date
,element_code
,value
,mflag
,qflag
,sflag
, andobstime
to the data frame. You can do this using:pd.read_csv(myfiles[0],names = ["id","date","element_code","value","mflag","qflag","sflag","obstime"])
-
Make a list called
mydataframes
(of length 4) that contains 4 data frames, one for each year, from1880.csv
to1883.csv
. Starting with the sample code (above) for reading in the first file, modify our example, so that you have a "for" loop that reads in all 4 files. Test your work with afor
loop that displays the column names of each of the four data frames inmydataframes
. You can show the column names ofmyDF
usingmyDF.columns
.
Question 4 (2 pts)
Let’s look at the column element_code
. Use a loop to solve the following questions for all 4 DataFrames:
-
Print out the (unique) elements of the column
element_code
(i.e., show each element just one time). -
Find the number of times that
SNOW
occurs in theelement_code
column.
|
Question 5 (2 pts)
Now let us practice using the chunksize
feature for big data. You may refer to this document, to get more information about chunksize
.
-
Try to run the following 2 programs, to find the number of times that
SNOW
occurs in theelement_code
column, from the year 1880 to year 1883. Explain your understanding ofchunksize
.
Pre-work for the programs:
import pandas as pd
from pathlib import Path
myfiles=[]
for year in range (1880, 1884):
file= Path(f'/anvil/projects/tdm/data/noaa/{year}.csv')
myfiles.append(file)
Version 1 of the program:
count = 0
for file in myfiles:
for myDF in pd.read_csv(file,names=["id","date","element_code","value","mflag","qflag","sflag","obstime"],chunksize =10000):
count += len(myDF[myDF['element_code'] == 'SNOW'])
print(count)
Version 2 of the program:
count = 0
for file in myfiles:
for myDF in pd.read_csv(file,names=["id","date","element_code","value","mflag","qflag","sflag","obstime"],chunksize =10000):
for index, row in myDF.iterrows():
if row['element_code'] == 'SNOW':
count += 1
print(count)
Project 03 Assignment Checklist
-
Jupyter Lab notebook with your code, comments and output for the assignment
-
firstname-lastname-project03.ipynb
.
-
-
Python file with code and comments for the assignment
-
firstname-lastname-project03.py
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |