Contact
CoCalc Logo Icon
StoreFeaturesDocsShareSupport News AboutSign UpSign In
| Download

Jupyter notebook ATMS-305-Content/Week-6/Week 6 Exercise 3 intro_to_reading_and_writing_data.ipynb

Views: 43
Kernel: Python 3 (old Anaconda 3)
import numpy as np data = np.genfromtxt('RMM1RMM2.74toRealtime.txt',delimiter='',skip_header=2) data.shape
(15073, 8)

Introduction to reading and writing datasets

So far we have been working with small arrays created in memory by python. Much of the work that data scientists perform is with input and output of data, that is, reading and writing data to/from local files or files on the Internet.

Data formats

There are many types of data that computers can read. Some of the more common ones we deal with as geoscientists include:

Text data (advantage: human readable. disadvantage: have to deal with formatting, slow!)

  • delimited text (has a character in each line that separates values, space, comma, etc.)

    4,5.2,33.3,4.5,9.2

    1.4,2,4.5,6,8.1

  • column-formatted text (each data point is arranged in a column using spaces or tabs).

    423 5.2 33.3

    1.4 2 4.5

Unstructured binary data (advantage, fast! disadvantage: not readable by humans, may depend on machine type.)

  • unformatted binary (numbers are arranged in binary format, defined by the program that wrote them out). If you don't know how the data is written, good luck!)

Self-describing binary (advantage, fast, humans can read the metadata with a tool built into python as well as other tools)

Examples, the last two particularly in the geosciences:
  • python pickle

  • netCDF - network common data format

  • HDF - heirarchical data format

Text data this week!

We will introduce the pandas package for reading and writing text files.

Comma-separated values (CSV) files

There is an example CSV file in your folder called chicago_hourly_aug_2015.csv. It contains data from the hourly airport weather observations from O'Hare Airport for August 2015. To see the contents of the file, do the following:

print(open('chicago_hourly_aug_2015.csv').read())

There are some header lines before we get to the actual data. We can use pandas to specify the column names, and read the subsequent data into what we call a pandas DataFrame.

import pandas as pd #this is how we typically load pandas pd.read_csv('chicago_hourly_aug_2015.csv', header=6)

Now, let's store the DataFrame into a variable, loading only the variables Date, Time, and Temperature.

data=pd.read_csv('chicago_hourly_aug_2015.csv', header=6, usecols=['Date','Time','DryBulbFarenheit'])
data

Accessing columns is easy. Note that pandas displays the value number for you.

data['DryBulbFarenheit']

How many hours was it above 90 degrees Fahrenheit? Easy!

[data['DryBulbFarenheit'] > 90.]

We can use numpy's where command store this index (returns row,col where condition is met):

import numpy as np avb90=np.where(data['DryBulbFarenheit'] > 90.) print(avb90) print(data['Time'][avb90[0]])

Or we can subset the whole DataFrame!

data[data['DryBulbFarenheit'] > 90.]

We can do statistics on a data frame: What was the mean temperature for the month?

print(np.mean(data['DryBulbFarenheit']))

What was the minimum temperature for the month?

print(np.min(data['DryBulbFarenheit']))

When did it happen?

data[data['DryBulbFarenheit'] == 52]

Column-separated text

Pandas can help here too! We have a file that contains the MJO index for the last 30 plus years.

print(open('RMM1RMM2.74toRealtime.txt').read())
WARNING: Some output was deleted.

In order to parse this file into a DataFrame, we simply need to supply the column specifications to the read_fwf function along with the file name:

import pandas as pd widths=[12,12,12,15,15,12,15] names=['year', 'month', 'day', 'RMM1', 'RMM2', 'phase', 'amplitude'] df = pd.read_fwf('RMM1RMM2.74toRealtime.txt', widths=widths, names=names, skiprows=2)
df
year month day RMM1 RMM2 phase amplitude
0 1974 6 1 1.634470 1.203040 5 2.029480
1 1974 6 2 1.602890 1.015120 5 1.897290
2 1974 6 3 1.516250 1.085510 5 1.864760
3 1974 6 4 1.509810 1.035730 5 1.830920
4 1974 6 5 1.559060 1.305180 5 2.033260
5 1974 6 6 1.206260 1.628890 6 2.026900
6 1974 6 7 0.611101 1.722480 6 1.827670
7 1974 6 8 0.326395 1.778180 6 1.807890
8 1974 6 9 0.093828 1.356940 6 1.360180
9 1974 6 10 -0.086126 0.775476 7 0.780244
10 1974 6 11 0.111394 0.389534 6 0.405148
11 1974 6 12 0.120489 0.013885 5 0.121286
12 1974 6 13 0.019281 -0.217670 3 0.218519
13 1974 6 14 -0.104360 -0.381050 2 0.395082
14 1974 6 15 -0.182940 -0.645350 2 0.670775
15 1974 6 16 -0.235960 -0.471070 2 0.526865
16 1974 6 17 -0.498690 -0.487520 1 0.697402
17 1974 6 18 -0.569800 -0.363980 1 0.676134
18 1974 6 19 -0.695030 -0.355570 1 0.780705
19 1974 6 20 -0.729150 -0.476460 1 0.871015
20 1974 6 21 -1.094430 -0.832750 1 1.375230
21 1974 6 22 -1.098510 -0.838670 1 1.382060
22 1974 6 23 -1.062440 -0.504050 1 1.175940
23 1974 6 24 -0.885070 -0.324440 1 0.942658
24 1974 6 25 -0.765820 -0.333550 1 0.835304
25 1974 6 26 -0.834470 -0.414180 1 0.931607
26 1974 6 27 -0.801300 -0.359890 1 0.878413
27 1974 6 28 -0.564420 -0.297990 1 0.638250
28 1974 6 29 -0.097050 -0.401890 2 0.413442
29 1974 6 30 0.009149 -0.243850 3 0.244017
... ... ... ... ... ... ... ...
15043 2015 8 8 -0.058148 0.746690 7 0.748950
15044 2015 8 9 0.015080 0.658913 6 0.659085
15045 2015 8 10 -0.137224 0.611919 7 0.627116
15046 2015 8 11 -0.272228 0.465795 7 0.539512
15047 2015 8 12 -0.221913 0.110486 8 0.247896
15048 2015 8 13 -0.471974 -0.148419 1 0.494760
15049 2015 8 14 -0.687839 -0.429319 1 0.810825
15050 2015 8 15 -0.620249 -0.471498 1 0.779114
15051 2015 8 16 -0.630506 -0.391187 1 0.742001
15052 2015 8 17 -0.707025 -0.224477 1 0.741805
15053 2015 8 18 -0.812492 0.006166 8 0.812515
15054 2015 8 19 -0.751221 0.243483 8 0.789694
15055 2015 8 20 -0.685623 0.148649 8 0.701552
15056 2015 8 21 -0.890929 -0.088134 1 0.895277
15057 2015 8 22 -0.952048 -0.346081 1 1.013000
15058 2015 8 23 -0.982136 -0.536792 1 1.119258
15059 2015 8 24 -1.059988 -0.776334 1 1.313876
15060 2015 8 25 -1.027758 -0.928825 1 1.385281
15061 2015 8 26 -0.960436 -0.792910 1 1.245449
15062 2015 8 27 -0.732713 -0.953306 2 1.202356
15063 2015 8 28 -0.601171 -1.030717 2 1.193225
15064 2015 8 29 -0.636365 -0.870867 2 1.078596
15065 2015 8 30 -0.673667 -0.692894 2 0.966400
15066 2015 8 31 -0.501893 -0.660847 2 0.829829
15067 2015 9 1 -0.559014 -0.693567 2 0.890804
15068 2015 9 2 -0.823788 -0.940298 2 1.250115
15069 2015 9 3 -1.233832 -0.801107 1 1.471093
15070 2015 9 4 -1.155825 -0.746959 1 1.376183
15071 2015 9 5 -1.087684 -0.640634 1 1.262327
15072 2015 9 6 -0.915993 -0.578681 1 1.083473

15073 rows × 7 columns

How many days in each phase?

for i in range(1,9): print(i, len(df['phase'][df['phase'] == i]))

We can also use pandas built-in visualization!

%pylab inline df['phase'].plot(kind='hist', alpha=0.5,bins=arange(.5,9.5))

How about for each month?

for m in range(1,13): for i in range(1,8): print("Month=",m,"Phase=",i,"Counts=",len(df['phase'][(df['phase'] == i) & (df['month'] == m)]))

Writing a text file

Easy as pie! Note we were able to write a csv from a column-delimited file!

df.to_csv('all_mjo.csv')

Take a peak!

print(open('all_mjo.csv').read())
WARNING: Some output was deleted.

What does this do?

df[df['phase'] == 7].to_csv('phase7.csv')
print(open('phase7.csv').read())
WARNING: Some output was deleted.

Pandas can also write binary files (pickle, hdf)! We'll revisit this next week!