Jupyter notebook ATMS-305-Content/Week-6/Week 6 Exercise 3 intro_to_reading_and_writing_data.ipynb
Introduction to reading and writing datasets
So far we have been working with small arrays created in memory by python. Much of the work that data scientists perform is with input and output of data, that is, reading and writing data to/from local files or files on the Internet.
Data formats
There are many types of data that computers can read. Some of the more common ones we deal with as geoscientists include:
Text data (advantage: human readable. disadvantage: have to deal with formatting, slow!)
delimited text (has a character in each line that separates values, space, comma, etc.)
4,5.2,33.3,4.5,9.2
1.4,2,4.5,6,8.1
column-formatted text (each data point is arranged in a column using spaces or tabs).
423 5.2 33.3
1.4 2 4.5
Unstructured binary data (advantage, fast! disadvantage: not readable by humans, may depend on machine type.)
unformatted binary (numbers are arranged in binary format, defined by the program that wrote them out). If you don't know how the data is written, good luck!)
Self-describing binary (advantage, fast, humans can read the metadata with a tool built into python as well as other tools)
python pickle
netCDF - network common data format
HDF - heirarchical data format
Text data this week!
We will introduce the pandas
package for reading and writing text files.
Comma-separated values (CSV) files
There is an example CSV file in your folder called chicago_hourly_aug_2015.csv
. It contains data from the hourly airport weather observations from O'Hare Airport for August 2015. To see the contents of the file, do the following:
There are some header lines before we get to the actual data. We can use pandas to specify the column names, and read the subsequent data into what we call a pandas DataFrame.
Now, let's store the DataFrame into a variable, loading only the variables Date
, Time
, and Temperature
.
Accessing columns is easy. Note that pandas displays the value number for you.
How many hours was it above 90 degrees Fahrenheit? Easy!
We can use numpy's where command store this index (returns row,col where condition is met):
Or we can subset the whole DataFrame!
We can do statistics on a data frame: What was the mean temperature for the month?
What was the minimum temperature for the month?
When did it happen?
Column-separated text
Pandas can help here too! We have a file that contains the MJO index for the last 30 plus years.
In order to parse this file into a DataFrame, we simply need to supply the column specifications to the read_fwf function along with the file name:
year | month | day | RMM1 | RMM2 | phase | amplitude | |
---|---|---|---|---|---|---|---|
0 | 1974 | 6 | 1 | 1.634470 | 1.203040 | 5 | 2.029480 |
1 | 1974 | 6 | 2 | 1.602890 | 1.015120 | 5 | 1.897290 |
2 | 1974 | 6 | 3 | 1.516250 | 1.085510 | 5 | 1.864760 |
3 | 1974 | 6 | 4 | 1.509810 | 1.035730 | 5 | 1.830920 |
4 | 1974 | 6 | 5 | 1.559060 | 1.305180 | 5 | 2.033260 |
5 | 1974 | 6 | 6 | 1.206260 | 1.628890 | 6 | 2.026900 |
6 | 1974 | 6 | 7 | 0.611101 | 1.722480 | 6 | 1.827670 |
7 | 1974 | 6 | 8 | 0.326395 | 1.778180 | 6 | 1.807890 |
8 | 1974 | 6 | 9 | 0.093828 | 1.356940 | 6 | 1.360180 |
9 | 1974 | 6 | 10 | -0.086126 | 0.775476 | 7 | 0.780244 |
10 | 1974 | 6 | 11 | 0.111394 | 0.389534 | 6 | 0.405148 |
11 | 1974 | 6 | 12 | 0.120489 | 0.013885 | 5 | 0.121286 |
12 | 1974 | 6 | 13 | 0.019281 | -0.217670 | 3 | 0.218519 |
13 | 1974 | 6 | 14 | -0.104360 | -0.381050 | 2 | 0.395082 |
14 | 1974 | 6 | 15 | -0.182940 | -0.645350 | 2 | 0.670775 |
15 | 1974 | 6 | 16 | -0.235960 | -0.471070 | 2 | 0.526865 |
16 | 1974 | 6 | 17 | -0.498690 | -0.487520 | 1 | 0.697402 |
17 | 1974 | 6 | 18 | -0.569800 | -0.363980 | 1 | 0.676134 |
18 | 1974 | 6 | 19 | -0.695030 | -0.355570 | 1 | 0.780705 |
19 | 1974 | 6 | 20 | -0.729150 | -0.476460 | 1 | 0.871015 |
20 | 1974 | 6 | 21 | -1.094430 | -0.832750 | 1 | 1.375230 |
21 | 1974 | 6 | 22 | -1.098510 | -0.838670 | 1 | 1.382060 |
22 | 1974 | 6 | 23 | -1.062440 | -0.504050 | 1 | 1.175940 |
23 | 1974 | 6 | 24 | -0.885070 | -0.324440 | 1 | 0.942658 |
24 | 1974 | 6 | 25 | -0.765820 | -0.333550 | 1 | 0.835304 |
25 | 1974 | 6 | 26 | -0.834470 | -0.414180 | 1 | 0.931607 |
26 | 1974 | 6 | 27 | -0.801300 | -0.359890 | 1 | 0.878413 |
27 | 1974 | 6 | 28 | -0.564420 | -0.297990 | 1 | 0.638250 |
28 | 1974 | 6 | 29 | -0.097050 | -0.401890 | 2 | 0.413442 |
29 | 1974 | 6 | 30 | 0.009149 | -0.243850 | 3 | 0.244017 |
... | ... | ... | ... | ... | ... | ... | ... |
15043 | 2015 | 8 | 8 | -0.058148 | 0.746690 | 7 | 0.748950 |
15044 | 2015 | 8 | 9 | 0.015080 | 0.658913 | 6 | 0.659085 |
15045 | 2015 | 8 | 10 | -0.137224 | 0.611919 | 7 | 0.627116 |
15046 | 2015 | 8 | 11 | -0.272228 | 0.465795 | 7 | 0.539512 |
15047 | 2015 | 8 | 12 | -0.221913 | 0.110486 | 8 | 0.247896 |
15048 | 2015 | 8 | 13 | -0.471974 | -0.148419 | 1 | 0.494760 |
15049 | 2015 | 8 | 14 | -0.687839 | -0.429319 | 1 | 0.810825 |
15050 | 2015 | 8 | 15 | -0.620249 | -0.471498 | 1 | 0.779114 |
15051 | 2015 | 8 | 16 | -0.630506 | -0.391187 | 1 | 0.742001 |
15052 | 2015 | 8 | 17 | -0.707025 | -0.224477 | 1 | 0.741805 |
15053 | 2015 | 8 | 18 | -0.812492 | 0.006166 | 8 | 0.812515 |
15054 | 2015 | 8 | 19 | -0.751221 | 0.243483 | 8 | 0.789694 |
15055 | 2015 | 8 | 20 | -0.685623 | 0.148649 | 8 | 0.701552 |
15056 | 2015 | 8 | 21 | -0.890929 | -0.088134 | 1 | 0.895277 |
15057 | 2015 | 8 | 22 | -0.952048 | -0.346081 | 1 | 1.013000 |
15058 | 2015 | 8 | 23 | -0.982136 | -0.536792 | 1 | 1.119258 |
15059 | 2015 | 8 | 24 | -1.059988 | -0.776334 | 1 | 1.313876 |
15060 | 2015 | 8 | 25 | -1.027758 | -0.928825 | 1 | 1.385281 |
15061 | 2015 | 8 | 26 | -0.960436 | -0.792910 | 1 | 1.245449 |
15062 | 2015 | 8 | 27 | -0.732713 | -0.953306 | 2 | 1.202356 |
15063 | 2015 | 8 | 28 | -0.601171 | -1.030717 | 2 | 1.193225 |
15064 | 2015 | 8 | 29 | -0.636365 | -0.870867 | 2 | 1.078596 |
15065 | 2015 | 8 | 30 | -0.673667 | -0.692894 | 2 | 0.966400 |
15066 | 2015 | 8 | 31 | -0.501893 | -0.660847 | 2 | 0.829829 |
15067 | 2015 | 9 | 1 | -0.559014 | -0.693567 | 2 | 0.890804 |
15068 | 2015 | 9 | 2 | -0.823788 | -0.940298 | 2 | 1.250115 |
15069 | 2015 | 9 | 3 | -1.233832 | -0.801107 | 1 | 1.471093 |
15070 | 2015 | 9 | 4 | -1.155825 | -0.746959 | 1 | 1.376183 |
15071 | 2015 | 9 | 5 | -1.087684 | -0.640634 | 1 | 1.262327 |
15072 | 2015 | 9 | 6 | -0.915993 | -0.578681 | 1 | 1.083473 |
15073 rows × 7 columns
How many days in each phase?
We can also use pandas built-in visualization!
How about for each month?
Writing a text file
Easy as pie! Note we were able to write a csv from a column-delimited file!
Take a peak!
What does this do?
Pandas can also write binary files (pickle, hdf)! We'll revisit this next week!