Path: blob/develop/src/doc/en/prep/Quickstarts/Statistics-and-Distributions.rst
4117 views
.. -*- coding: utf-8 -*- .. linkall .. _prep-quickstart-statistics-and-distributions: Sage Quickstart for Statistics ============================== This `Sage <https://www.sagemath.org>`_ quickstart tutorial was developed for the MAA PREP Workshop "Sage: Using Open\-Source Mathematics Software with Undergraduates" (funding provided by NSF DUE 0817071). It is licensed under the Creative Commons Attribution\-ShareAlike 3.0 license (`CC BY\-SA <https://creativecommons.org/licenses/by-sa/3.0/>`_). Basic Descriptive Statistics ---------------------------- Basic statistical functions are best to get from `numpy <https://numpy.org/doc/stable/reference/routines.statistics.html>`_ and `scipy.stats <https://docs.scipy.org/doc/scipy/reference/stats.html>`_. NumPy provides, for example, functions to compute the arithmetic mean and the standard deviation:: sage: import numpy as np sage: if int(np.version.short_version[0]) > 1: ....: _ = np.set_printoptions(legacy="1.25") sage: np.mean([1, 2, 3, 5]) 2.75 sage: np.std([1, 2, 2, 4, 5, 6, 8]) # rel to 1e-13 2.32992949004287 SciPy offers many more functions, for example a function to compute the harmonic mean:: sage: from scipy.stats import hmean sage: hmean([1, 2, 2, 4, 5, 6, 8]) # rel to 1e-13 2.5531914893617023 We do not recommend to use Python's built in ``statistics`` module with Sage. It has a known incompatibility with number types defined in Sage, see :issue:`28234`. Distributions ------------- Let's generate a random sample from a given type of continuous distribution using native Python random generators. - We use the simplest method of generating random elements from a log normal distribution with (normal) mean 2 and :math:`\sigma=3`. - Notice that there is really no way around making some kind of loop. :: sage: my_data = [lognormvariate(2, 3) for i in range(10)] sage: my_data # random [13.189347821530054, 151.28229284782799, 0.071974845847761343, 202.62181449742425, 1.9677158880100207, 71.959830176932542, 21.054742855786007, 3.9235315623286406, 4129.9880239483346, 16.41063858663054] We can check whether the mean of the log of the data is close to 2. :: sage: np.mean([log(item) for item in my_data]) # random 3.0769518857697618 Here is an example using the `Gnu scientific library <http://www.gnu.org/software/gsl/>`_ under the hood. - Let ``dist`` be the variable assigned to a continuous Gaussian/normal distribution with standard deviation of 3. - Then we use the ``.get_random_element()`` method ten times, adding 2 each time so that the mean is equal to 2. :: sage: dist=RealDistribution('gaussian',3) sage: my_data=[dist.get_random_element()+2 for _ in range(10)] sage: my_data # random [3.18196848067, -2.70878671264, 0.445500746768, 0.579932075555, -1.32546445128, 0.985799587162, 4.96649083229, -1.78785287243, -3.05866866979, 5.90786474822] For now, it's a little annoying to get histograms of such things directly. Here, we get a larger sampling of this distribution and plot a histogram with 10 bins. :: sage: my_data2 = [dist.get_random_element()+2 for _ in range(1000)] sage: T = stats.TimeSeries(my_data) sage: T.plot_histogram(normalize=False,bins=10) Graphics object consisting of 10 graphics primitives To access discrete distributions, we again use `Scipy <http://www.scipy.org>`_. - We have to ``import`` the module ``scipy.stats``. - We use ``binom_dist`` to denote the binomial distribution with 20 trials and 5% expected failure rate. - The ``.pmf(x)`` method gives the probability of :math:`x` failures, which we then plot in a bar chart for :math:`x` from 0 to 20. (Don't forget that ``range(21)`` means all integers from *zero to twenty*.) :: sage: import scipy.stats sage: binom_dist = scipy.stats.binom(20,.05) sage: bar_chart([binom_dist.pmf(x) for x in range(21)]) Graphics object consisting of 1 graphics primitive The ``bar_chart`` function performs some of the duties of histograms. Scipy's statistics can do other things too. Here, we find the median (as the fiftieth percentile) of an earlier data set. :: sage: scipy.stats.scoreatpercentile(my_data, 50) # random 0.51271641116183286 The key thing to remember here is to look at the documentation! Using R from within Sage ------------------------ The `R project <http://www.r-project.org>`_ is a leading software package for statistical analysis with widespread use in industry and academics. There are several ways to access R; they are based on the ``rpy2`` package. - One of the easiest is to just put ``r()`` around things you want to make into statistical objects, and then ... - Use R commands via ``r.method()`` to pass them on to Sage for further processing. The following example of the Kruskal\-Wallis test comes directly from the examples in ``r.kruskal_test?`` in the notebook. :: sage: # optional - rpy2 sage: x = r([2.9, 3.0, 2.5, 2.6, 3.2]) # normal subjects sage: y = r([3.8, 2.7, 4.0, 2.4]) # with obstructive airway disease sage: z = r([2.8, 3.4, 3.7, 2.2, 2.0]) # with asbestosis sage: a = r([x,y,z]) # make a long R vector of all the data sage: b = r.factor(5*[1] + 4*[2] + 5*[3]) # create something for R to tell ....: # which subjects are which sage: a; b # show them [1] 2.9 3.0 2.5 2.6 3.2 3.8 2.7 4.0 2.4 2.8 3.4 3.7 2.2 2.0 [1] 1 1 1 1 1 2 2 2 2 3 3 3 3 3 Levels: 1 2 3 .. skip :: sage: r.kruskal_test(a,b) # do the KW test! # optional - rpy2 Kruskal-Wallis rank sum test data: sage17 and sage33 Kruskal-Wallis chi-squared = 0.7714, df = 2, p-value = 0.68 Looks like we can't reject the null hypothesis here. The best way to use R seriously is to simply ask each individual cell to evaluate completely in R, using a so\-called "percent directive". Here is a sample linear regression from John Verzani's `simpleR <http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf>`_ text. Notice that R also uses the ``#`` symbol to indicate comments. .. skip :: sage: %r # optional - rpy2 ....: x = c(18,23,25,35,65,54,34,56,72,19,23,42,18,39,37) # ages of individuals ....: y = c(202,186,187,180,156,169,174,172,153,199,193,174,198,183,178) # maximum heart rate of each one ....: png() # turn on plotting ....: plot(x,y) # make a plot ....: lm(y ~ x) # do the linear regression ....: abline(lm(y ~ x)) # plot the regression line ....: dev.off() # turn off the device so it plots Call: lm(formula = y ~ x) Coefficients: (Intercept) x 210.0485 -0.7977 null device 1 .. image:: ../media/Rplot001.png :align: center To get a whole worksheet to evaluate in R (and be able to ignore the ``%``), you could also drop down the ``r`` option in the menu close to the top which currently has ``sage`` in it.