## OS3307 (Modeling Practices for Computing)

## Lesson 1: Introduction to Statistics with Python

### Author: Michelle L. Isenhour
### Last Updated: Fall 2018

<br>
References:  

* Section 2.1 (pp. 5 - 17), _An Introduction to Statistics with Python_ (2016) by Thomas Haslwanter.
* "Python for Data Analysis" (October 2016) by Andrew Tedstone __[http://chris35wills.github.io/courses/pydata_stack/](http://chris35wills.github.io/courses/pydata_stack/)__
* _Python Package Index_ by Python Software Foundation __[https://pypi.org](https://pypi.org)__


### Python Packages for Statistics

<center><img width="600px" src = "http://chris35wills.github.io/courses/pydata_stack.png" alt = "python data analysis stack" title="Python Packages for Statistics"></center>


### Foundational Packages

* __[Python](https://www.python.org)__ (_Python_ 3.7.0): a very popular open source programming language. The core distribution contains only the essential features of a general programming language. In order to conduct statistical analysis, we will need to explicitly load several additional packages.

> To manually find the current version of _Python_, open a new command window and type: `python --version`



* __[IPython](https://pypi.org/project/ipython/)__ (_ipython_ 6.5.0): the computational kernel running the _Python_ commands. Provides the tools for interactive data analysis. Allows you to quickly display graphs and change directories, explore the workspace, and provides a command history.

> To manually find the current version of _IPython_, open a new command window and type: `ipython --version`

> To manually upgrade to the current version of _IPython_, type: `pip install --upgrade ipython`


### PyPI: The Python Package Index

The __[Python Package Index](https://pypi.io)__ (_PyPI_) is a repository of software for the _Python_ programming language. Individual packages from _PyPI_ can be installed easily from the Windows command shell (`cmd`) or the `terminal` window:
> `pip install [_package_]`

To update a package:
> `pip install --upgrade [_package_]`

To get a list of all _Python_ packages on your computer:
> `pip list`


### Anaconda Navigator

__[Anaconda](https://www.anaconda.com)__ (_Anaconda Navigator 1.8.7_) is a package manager, and environment manager, a _Python_ distribution, and a collection of over 1,000 most-commonly used open source packages. _Anaconda_ uses `conda`  (_conda_ 4.5.11) , a more powerful installation manager; however, `pip` also works from the command prompt with _Anaconda_.

To update _Anaconda_, read this first: __[https://stackoverflow.com/questions/45197777/how-do-i-update-anaconda](https://stackoverflow.com/questions/45197777/how-do-i-update-anaconda)__

### Basic Building Blocks

* __[NumPy](https://pypi.org/project/numpy/)__ (_numpy_ 1.15.1): the most important package for scientific applications which makes working with vectors and matrices fast and efficient. Provides N-dimensional numerical arrays and vectors, linear algebra, Fourier transforms. 
<br><br>
* __[SciPy](https://pypi.org/project/scipy/)__ (_scipy_ 1.1.0): builds closely on _NumPy_, providing more advanced numerical methods, integration, ordinary differential equation (ODE) solvers. For the statistical data analysis, `script.stats` contains the algorithms for basic statistics.
<br><br>
* __[Matplotlib](https://pypi.org/project/matplotlib/)__ (_matplotlib_ 2.2.3): _Python’s_ main graphing/plotting library. The documentation on the __[*Matplotlib* website](https://matplotlib.org)__ is good, especially the gallery. 
<br><br>
* __[JuPyter](https://pypi.org/project/jupyter/)__ (_jupyter_ 1.0.0): rather than using the interactive _IPython_ command line, during class we will use _Python_ in a ‘notebook’ style from inside the web browser. This keeps the commands and their outputs together in a single document that you can reference later on. 

### Analyzing and Manipulating Data

* __[Pandas](https://pypi.org/project/pandas/)__ (_pandas_ 0.23.4): provides fast, flexible, and expressive data structures (called _DataFrames_) designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in _Python_.

* __[patsy](https://pypi.org/project/patsy/)__ (_patsy_ 0.5.0): for describing statistical models (especially linear models, or models that have a linear component) and building design matrices. Brings the convenience of *R* "formulas" to _Python_.

* __[xlrd](https://pypi.org/project/xlrd/)__ (_xlrd_ 1.1.0): extracts data from _MS Excel_ spreadsheets on any platform.

* __[PyMC](https://pypi.org/project/pymc/)__ (_pymc_ 2.3.6): for Bayesian statistics, including Markov chain Monte Carlo simulations.


* __[scikit-learn](https://pypi.org/project/scikit-learn/)__ (_scikit-learn_ 0.19.2): machine learning tools for _Python_. Increasingly popular, contains all the main algorithms used in this field such as K-means clustering. 

* __[scikits.bootstrap](https://pypi.org/project/scikits.bootstrap/)__ (_scikits.bootstrap_ 1.0.0): provides bootstrap confidence interval algorithms for _SciPy_.

* __[scikit-image](https://pypi.org/project/scikit-image/)__ (_scikit-image_ 0.14.0): a bunch of functionality for doing image analysis, including satellite images.

* __[lifelines](https://pypi.org/project/lifelines/)__ (_lifelines_ 0.14.6): survival analysis in _Python_.

* __[xarray](https://pypi.org/project/xarray/)__ (_xarray_ 0.10.8): brings the labeled data power of _Pandas_ to the physical sciences, by providing N-dimensional variants of the core _Pandas_ data structures. Provides a _pandas_-like and _pandas_-compatible toolkit for analytics on multi-dimensional arrays.

### Advanced Statistics

* __[Statsmodels](https://pypi.org/project/statsmodels/)__ (_statsmodels_ 0.9.0): provides implementations of all the major statistical algorithms. Preferentially works with _Pandas DataFrames_. Has the option of using *R*-like syntax, which you’ll probably like if you’re familiar with *R*.
<br><br>
* __[seaborn](https://pypi.org/project/seaborn/)__ (_seaborn_ 0.9.0): a set of statistical plotting tools which extends the plotting abilities of _matplotlib_. The plots look very elegant. Well worth looking at if you do a lot of statistical work. Takes _Pandas DataFrames_ as standard.

### Other Application-Dependent Packages

* __[SPy](https://pypi.org/project/spectral/)__ (_spectral_ 0.19): for processing hyperspectral image data (imaging spectroscopy data). It has functions for reading, displaying, manipulating, and classifying hyperspectral imagery.
* __[AstroPy](https://pypi.org/project/astropy/)__ (_astropy_ 3.0.4):  contains core functionality and some common tools needed for performing astronomy and astrophysics research with _Python_.
* __[PyTables](https://pypi.org/project/tables/)__ (_tables_ 3.4.4): or managing hierarchical datasets and designed to efficently cope with extremely large amounts of data (note Pandas does this pretty well for the most part).
* __[Bokeh](https://pypi.org/project/bokeh/)__ (_bokeh_ 0.13.0): for interactive plotting
* __[CartoPy](https://pypi.org/project/Cartopy/)__ (_cartopy_ 0.16.0): for geographic plotting. Requires install of _Proj4_ 4.9.3
* __[Matplotlib basemap](https://pypi.org/project/basemap/)__ (_basemap_ 1.0.7): An add-on toolkit for _matplotlib_ that lets you plot data on map projections with coastlines, lakes, rivers and political boundaries. See http://matplotlib.github.com/basemap/users/examples.html for examples of what it can do.
* __[GDAL](https://pypi.org/project/GDAL/)__ (_gdal_ 2.3.1) and OGR: geographic transformations and warping. Fantastic and the gold standard if you can get it to work, expect a bit of a fight but well worth it.
* __[PySAL](https://pypi.org/project/PySAL/)__ (_PySAL_ 1.14.4.post2): Spatial Analysis Library. Particularly good at spatial econometrics, location modelling...

### Python Tips

Packages should be imported with their commonly used names:
>`import numpy as np`<br>
>`import matplotlib.pyplot as plt`<br>
>`import scipy as sp`<br>
>`import pandas as pd`<br>
>`import seaborn as sns`<br>

### Lesson Summary

* _Python_ and _IPython_ provide the foundation for conducting Statistical Analysis in Python.


* The basic building blocks consists of the _JuPyter_, _NumPy_, _SciPy_, and _Matplotlib_ packages.


* _Pandas_ enables the use of _Series_ (1-dimensional) and _DataFrames_ (2-dimensional) to conduct statistical analysis.


* _Statsmodels_ and _Seaborn_ provide advanced statistical modeling, analysis, and visualization capabilities in _Python_.


* _Python_ packages for statistics can be managed using the package manager in _Anaconda_ (preferred) or through the use of the _Python Package Index_ (_PyPi_).

In [2]:
import numpy as np