Jupyter notebook ATMS-305-Content/Week-6/Week 6 Exercise 3 intro_to_reading_and_writing_data.ipynb

ATMS-305-Content/Week-6/Week 6 Exercise 3 intro_to_reading_and_writing_data.ipynb

⁶⁵ views

Kernel: Python 3 (old Anaconda 3)

In [10]:

import numpy as np
data = np.genfromtxt('RMM1RMM2.74toRealtime.txt',delimiter='',skip_header=2)
data.shape

Out[10]:

(15073, 8)

Introduction to reading and writing datasets

So far we have been working with small arrays created in memory by python. Much of the work that data scientists perform is with input and output of data, that is, reading and writing data to/from local files or files on the Internet.

Data formats

There are many types of data that computers can read. Some of the more common ones we deal with as geoscientists include:

Text data (advantage: human readable. disadvantage: have to deal with formatting, slow!)

delimited text (has a character in each line that separates values, space, comma, etc.)
4,5.2,33.3,4.5,9.2
1.4,2,4.5,6,8.1
column-formatted text (each data point is arranged in a column using spaces or tabs).
423 5.2 33.3
1.4 2 4.5

Unstructured binary data (advantage, fast! disadvantage: not readable by humans, may depend on machine type.)

unformatted binary (numbers are arranged in binary format, defined by the program that wrote them out). If you don't know how the data is written, good luck!)

Self-describing binary (advantage, fast, humans can read the metadata with a tool built into python as well as other tools)

Examples, the last two particularly in the geosciences:

python pickle
netCDF - network common data format
HDF - heirarchical data format

Text data this week!

We will introduce the pandas package for reading and writing text files.

Comma-separated values (CSV) files

There is an example CSV file in your folder called chicago_hourly_aug_2015.csv. It contains data from the hourly airport weather observations from O'Hare Airport for August 2015. To see the contents of the file, do the following:

In [0]:

print(open('chicago_hourly_aug_2015.csv').read())

There are some header lines before we get to the actual data. We can use pandas to specify the column names, and read the subsequent data into what we call a pandas DataFrame.

In [0]:

import pandas as pd #this is how we typically load pandas 
pd.read_csv('chicago_hourly_aug_2015.csv', header=6)

Now, let's store the DataFrame into a variable, loading only the variables Date, Time, and Temperature.

In [0]:

data=pd.read_csv('chicago_hourly_aug_2015.csv', header=6, usecols=['Date','Time','DryBulbFarenheit'])

In [0]:

data

Accessing columns is easy. Note that pandas displays the value number for you.

In [0]:

data['DryBulbFarenheit']

How many hours was it above 90 degrees Fahrenheit? Easy!

In [0]:

[data['DryBulbFarenheit'] > 90.]

We can use numpy's where command store this index (returns row,col where condition is met):

In [0]:

import numpy as np
avb90=np.where(data['DryBulbFarenheit'] > 90.)
print(avb90)
print(data['Time'][avb90[0]])

Or we can subset the whole DataFrame!

In [0]:

data[data['DryBulbFarenheit'] > 90.]

We can do statistics on a data frame: What was the mean temperature for the month?

In [0]:

print(np.mean(data['DryBulbFarenheit']))

What was the minimum temperature for the month?

In [0]:

print(np.min(data['DryBulbFarenheit']))

When did it happen?

In [0]:

data[data['DryBulbFarenheit'] == 52]

Column-separated text

Pandas can help here too! We have a file that contains the MJO index for the last 30 plus years.

In [1]:

print(open('RMM1RMM2.74toRealtime.txt').read())

Out[1]:

WARNING: Some output was deleted.

In order to parse this file into a DataFrame, we simply need to supply the column specifications to the read_fwf function along with the file name:

In [3]:

import pandas as pd
widths=[12,12,12,15,15,12,15]
names=['year', 'month', 'day', 'RMM1', 'RMM2', 'phase', 'amplitude']
df = pd.read_fwf('RMM1RMM2.74toRealtime.txt', widths=widths, names=names, skiprows=2)

In [4]:

df

Out[4]:

How many days in each phase?

In [0]:

for i in range(1,9):
    print(i, len(df['phase'][df['phase'] == i]))

We can also use pandas built-in visualization!

In [0]:

%pylab inline
df['phase'].plot(kind='hist', alpha=0.5,bins=arange(.5,9.5))

How about for each month?

In [0]:

for m in range(1,13):
    for i in range(1,8):
        print("Month=",m,"Phase=",i,"Counts=",len(df['phase'][(df['phase'] == i) & (df['month'] == m)]))

Writing a text file

Easy as pie! Note we were able to write a csv from a column-delimited file!

In [6]:

df.to_csv('all_mjo.csv')

Take a peak!

In [7]:

print(open('all_mjo.csv').read())

Out[7]:

WARNING: Some output was deleted.

What does this do?

In [8]:

df[df['phase'] == 7].to_csv('phase7.csv')

In [9]:

print(open('phase7.csv').read())

Out[9]:

WARNING: Some output was deleted.

Pandas can also write binary files (pickle, hdf)! We'll revisit this next week!