CoCalc -- script_plots

GitHub Repository: robertopucp/1eco35_2022_2
Path: blob/main/Lab10/script_plots_py.ipynb
²⁷¹⁴ views

Kernel: Python 3 (ipykernel)

In [129]:

from IPython.display import display, HTML

display(HTML(data="""
<style>
    div#notebook-container    { width: 80%; }
    div#menubar-container     { width: 65%; }
    div#maintoolbar-container { width: 80%; }a
</style>
"""))

Out[129]:

1. Quantities:

This type of graph shows the levels of variables. Also, these graphs show the variables according to categories or classifications.

In [130]:

import numpy as np   
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt  # libreria de gráficos 
import seaborn as sns  # libreria 2 para gráficos 
import datetime as dt # manejar fechas 
import warnings
warnings.filterwarnings('ignore') # eliminar warning messages

Datasets transformed into Latin characters. The following commands should be used in STATA:

cd "....\documents"

unicode analyze enaho.dta

unicode encoding set "latin1"

unicode translate enaho.dta

Plots, enlances de interés

World Bank repositorio para gráficos

https://worldbank.github.io/r-econ-visual-library/

Traducción del repositorios del Banco Mundial en Python

https://d2cml-ai.github.io/python_visual_library/python_notebooks/02_density.html

The following database includes enaho modules (200, 300 and 500) for the period 2011 - 2019

In [131]:

enaho = pd.read_stata(r"../data/ENAHO/enaho.dta")
enaho

Out[131]:

In [132]:

# cada observación indica en qué tipo de firma labora la persona se(pequeña, micro, mediana y gran empresa)

enaho['empresa']

Out[132]:

0                     NaN
1            microempresa
2            microempresa
3                     NaN
4            gran empresa
               ...       
160767                NaN
160768       microempresa
160769    mediana empresa
160770                NaN
160771       microempresa
Name: empresa, Length: 160772, dtype: category
Categories (4, object): ['microempresa' < 'pequeña empresa' < 'mediana empresa' < 'gran empresa']

Number of companies classification by number of workers hired

Microbusinesses < 10 workers ; Small businesses (10-20 workers); Medium businesses (21-100 workers); Big Businesses (>100 workers)

Why do many examples use fig, ax = plt.subplots() in Matplotlib/pyplot/python ?

https://stackoverflow.com/questions/34162443/why-do-many-examples-use-fig-ax-plt-subplots-in-matplotlib-pyplot-python

In [133]:

fig, ax = plt.subplots( figsize=(7,4) )  # primera linea es para definir el tamaño del gráfico: fig (objeto figura), ax : ejes 

sns.countplot( x = "empresa", data = enaho[enaho['year'] == "2019" ] )  # counplot: gráfico de barras 

# customize el gráfico 

# función title de la libreria matplot (plt)

plt.title('Cantidad de empresas según clasificación en el 2019')

# función labels para alterar las etiquetas de los ejes

plt.xlabel(' ')
plt.ylabel(' ')

Out[133]:

Text(0, 0.5, ' ')

1.0 Vertical Countplot and one color (red)

In [134]:

fig, ax = plt.subplots(figsize=(8,4))

x = sns.countplot(y = "empresa", data=enaho[enaho['year'] == "2019" ], color = 'red')

plt.title('Cantidad de empresas según clasificación en el 2019')
plt.xlabel(' ')
plt.ylabel(' ')

# codigo del color
# RGB 

#fig.savefig(r'imagen_python.png', dpi=800, bbox_inches='tight') # code to save graph 


# red en código es #FF0000
# red en RGB es rgb(255, 0, 0)

Out[134]:

Text(0, 0.5, ' ')

2.0 Evolutions of business in period 2017-2019

In [135]:

sns.set_context("paper") # Diseño del gráfico tipo paper

base2 = enaho[enaho['year'] > "2016" ]

fig, ax = plt.subplots(figsize=(10,6))

# hue: variable que permite desagregar la información

ax = sns.countplot(x="empresa", hue="year", alpha = 0.8, linewidth=1, data=base2)
plt.title('Evolución de las empresas en el périodo 2017-2019', fontsize=15) # fontsize: tamaño de titulo
plt.xlabel(' ')
plt.ylabel(' ')
 
txt="Elboración propia - ENAHO (2011-2019)"   # nota del fuente 
plt.figtext(0.04, 0.02, txt, wrap=True, horizontalalignment='left', va="top", fontsize=10) # ubocación de la nota de fuente 

# fintzise: tamaño de la letra
# horizontalalignment: ubicación. En este caso a la izquierda

# 0.04, 0.02 coordenadas de la fuente 
# alpha : nivel de transparencia

Out[135]:

Text(0.04, 0.02, 'Elboración propia - ENAHO (2011-2019)')

In [138]:

# Todal de empresas por tamaño y área

base_2 = enaho[ enaho['year'] == "2019" ].groupby( [ 'empresa', 'area' ] ).size().reset_index(name='num_firms')
base_2

Out[138]:

In [139]:

enaho[ enaho['year'] == "2019" ]

Out[139]:

In [140]:

enaho["cantidad_empresas"] = enaho[ enaho['year'] == "2019" ].groupby( [ 'empresa', 'area' ])['empresa'].transform('size')
enaho

# tarnsform para recurar el resultado de gropby en la base de datos original

Out[140]:

In [141]:

# stacked information
base_3 = base_2.pivot(index = 'empresa', columns = 'area', values = 'num_firms')
base_3

# pivot: como un reshape: los valores area (urbana o rural) ahora estan como columna

Out[141]:

In [142]:

# A partir de la base de datos y usando matplot 
base_3.plot( kind='bar', stacked=True, title='Empresa por estrato', color = ['blue', 'lightblue'] )
plt.xlabel(' ')

# color = ['blue', 'lightblue'] color por cada categoría

Out[142]:

Text(0.5, 0, ' ')

2. Proportions

Understandable plots to show categorical variables. Use this plots to explain participations from categories.

Pie

First at all, collapse dataframe to count categories of a variable

In [143]:

base = enaho.groupby([ 'empresa' ])['empresa'].size()
base

# cantidad por tipo de empresa

Out[143]:

empresa
microempresa       58002
pequeña empresa     9306
mediana empresa    10304
gran empresa       29642
Name: empresa, dtype: int64

In [145]:

# Labels to correct categories names 

labels=['Microempresa','Pequeña empresa' , 'Mediana empresa', 'Gran empresa']

plt.figure(figsize=(10,10)) # tamaño de gráfico 

ax = plt.pie(base, labels=labels, autopct='%.1f %%')
plt.title('Distribución de las empresas peruanas (2019)')
plt.show()

# .3f , 3 indica la cantidad decimales 
# Este gráfico calcula los porcentages internamente

Out[145]:

In [146]:

base = enaho.groupby('labor')['labor'].size() # cantidad por tipo de ocupación

fig, ax = plt.subplots( figsize=(10,10) )

base.plot(kind='pie', autopct='%.1f %%')
plt.title("Ocupaciones laborales")
plt.ylabel("")
plt.show()

fig.savefig(r'../plots/imagen_python.png', dpi=800, bbox_inches='tight') 

# dpi: la resolución ; es deci, la calidad o nitidez de la imagen 

# bbox_inches:tight Ninguna parte del grpáfico se pierde

Out[146]:

Donuts

In [149]:

base2 = enaho.groupby([ 'sector' ]).size() # cantidad por tipo de sector económico

labels=['Agricultura y pesca','Minería','Manufactura','Construcción','Comercio','Transporte y telecomunicaciones', 'Finanzas', 'Servicios']
plt.figure(figsize=(10, 6))

ax = plt.pie(base2, labels=labels,
        autopct='%1.1f%%', pctdistance=0.85)
  
# centroid size and color

center_circle = plt.Circle((0, 0), 0.50, fc='white')
fig = plt.gcf()

fig.gca().add_artist(center_circle)
  
plt.title('Distribución de la mano de obra contratada por sector económico')

# Adding notes

txt="Elboración propia - ENAHO (2011-2019)"  
plt.figtext(0.2, 0.01, txt, wrap=True, horizontalalignment='right', fontsize=10)

plt.show()

Out[149]:

3. Distributions

Distribution plots visually assess the distribution of sample data by comparing the empirical distribution of the data with the theoretical values expected from a specified distribution.

In [17]:

#filter database to 2019
base4 = enaho[enaho['year'] == "2019" ]

fig, ax = plt.subplots( figsize=(10,10) ) # objeto figura

base4['l_salario'].plot(kind = 'hist', bins = 10) # cantidad de barras, dividir el rango de la variable en 30 
plt.title('Logaritmo del salario por hora')

# kind= hist es un histograma de frecuencias absolutas 

txt="Elboración propia - ENAHO (2011-2019)"  
plt.figtext(0.5, 0.01, txt, wrap=True, horizontalalignment='center', fontsize=12)
plt.show()

Out[17]:

Reducing intervals

Frequency distribution with a smaller interval (lower relative frequency). Therefore, the height of each bar accounts smaller amount.

In [151]:

sns.set('paper') # diseño tipo paper
sns.set_style("ticks") # fondo de blanco

base4['l_salario'].plot(kind = 'hist', bins = 20, figsize = (8,6))
plt.title('Logaritmo del salario por hora')

txt="Elboración propia - ENAHO (2011-2019)"  
plt.figtext(0.5, 0.01, txt, wrap=True, horizontalalignment='center', fontsize=11)
plt.show()

Out[151]:

Multiple histograms

In [19]:

# varias gráficos de histograma, debemos usar seaborn y su función  FacetGrid

figure1 = sns.FacetGrid(base4, col="empresa", margin_titles=True)

figure1.map(plt.hist, 'l_salario', bins=np.linspace(0, 20, 30))

# np.linspace(0, 20, 30)

Out[19]:

<seaborn.axisgrid.FacetGrid at 0x1f2df9831c0>

Real salary per hour density:

the distribution does not resemble a standard normal. The information is concentrated in lower values and there are some observations of high values.

In [20]:

plt.figure(figsize=(8, 8))

sns.distplot(base4['salario'], label = "Densidad", color = 'blue')
plt.title('Salario por hora')
plt.xlabel(' ')
plt.show()

Out[20]:

Logarithm of real hourly wage

This allows correcting the asymmetry presented by the original data.

In [152]:

#Alternative figure size
plt.figure(figsize=(8, 8))

sns.distplot(base4['l_salario'], label = "Densidad", color = 'black') # displot: gráfico de densidad
plt.title('Logaritmo del salario por hora')
plt.xlabel(' ')
plt.show()

Out[152]:

Real salaries for tres sectors (Construction, Mining and services)

In [22]:

plt.figure(figsize=(10, 6))

#Adding densities 

sns.kdeplot(base4.l_salario[base4.sector=='Construcción'], label='Construcción', shade=True)
sns.kdeplot(base4.l_salario[base4.sector=='Comercio, hoteles y restaurantes'], label='Comercio, hoteles y restaurantes ', shade=True)
sns.kdeplot(base4.l_salario[base4.sector=='minería'], label='Minería ', shade=True)
plt.xlabel('Logaritmo del salario by Industry')

# Construction sector shows certain stochastic dominance over mining and services

# los gráficos en python , R, stata se basan en una lógicas por capas

Out[22]:

Text(0.5, 0, 'Logaritmo del salario by Industry')

Labor occupations and real salary per hour densities

A For loop is used to include in the same graph the density function of hourly wages for different occupations.

In [23]:

fig, ax = plt.subplots(figsize=(8,6))
sector = [ 'Ocupaciones elementales', 'Profesionales y fuerzas armadas', 'Operadores de planta y maquinaria',
          'Trabajo en actividades agrícolas, selvicultura y pesca']
nombre = [ 'Ocupaciones elementales', 'Profesionales', 'Planta y maquinaria','Actividades extractivas']
    
for a, b in zip(sector, nombre):
        sns.kdeplot(base4.l_salario[base4.labor==a], label=b, shade=True)

plt.xlabel('Logaritmo del salario por tipo de ocupación')

# Two relevant findings: stochastic dominance of the salary of professionals and 
# concentration in lower levels of salary in the non-active primary sector.

Out[23]:

Text(0.5, 0, 'Logaritmo del salario por tipo de ocupación')

Box plot real salary and education

In [24]:

fig, ax = plt.subplots(figsize=(10,6))

box = sns.boxplot(x="educ", y="l_salario", data=enaho[enaho['year'] == "2019" ] ,palette='rainbow')
plt.xlabel('Nivel educativo alcanzado')
plt.ylabel('Logaritmo del salario por hora')
(box.set_xticklabels(["Sec. completa", "No uni. incompleta", "No uni. completa", "Uni. incompleta", "Uni. completa", "Posgrado"])) # etiqueta eje x 

# The real wage quartiles are increasing with the educational level.
# Lower salary dispersion for the postgraduate level.

Out[24]:

[Text(0, 0, 'Sec. completa'),
 Text(1, 0, 'No uni. incompleta'),
 Text(2, 0, 'No uni. completa'),
 Text(3, 0, 'Uni. incompleta'),
 Text(4, 0, 'Uni. completa'),
 Text(5, 0, 'Posgrado')]

In [25]:

fig, ax = plt.subplots(figsize=(10,6))

box = sns.boxplot(x="educ", y="l_salario", data=enaho[enaho['year'] == "2019" ] ,palette='rainbow', showfliers=False)
plt.xlabel('Nivel educativo alcanzado')
plt.ylabel('Logaritmo del salario por hora')
(box.set_xticklabels(["Sec. completa", "No uni. incompleta", "No uni. completa", "Uni. incompleta", "Uni. completa", "Posgrado"]))

#  showfliers=False eliminar outliers

Out[25]:

[Text(0, 0, 'Sec. completa'),
 Text(1, 0, 'No uni. incompleta'),
 Text(2, 0, 'No uni. completa'),
 Text(3, 0, 'Uni. incompleta'),
 Text(4, 0, 'Uni. completa'),
 Text(5, 0, 'Posgrado')]

Macroeconomics

IMP: log-Commodity price index, TC: log-exchange rate, RIN: log-international reserves, IPC: log-price consumption index, RATE: central bank rate reference

D: Annual difference

In [153]:

macro = pd.read_csv(r"../data/macroeconomia.csv")
macro['YEAR']  = pd.to_datetime(macro['Fecha']) # formato fecha en una nueva variable
macro

Out[153]:

In [158]:

sns.set_context("paper") # formato paper 

sns.relplot(x="YEAR", y="IPM", kind="line", color="red", data=macro, height=5, aspect=1.5)

# aspect=1.5: ancho de la figura
# height: lardo de la figura
# kind : line, figura de línea
plt.xlabel(' ')
plt.ylabel(' ')
plt.title('Índice del precio de materias primas 2013-2019')

txt="Elboración propia - BCRP"  
plt.figtext(0.5, 0.01, txt, wrap=True, horizontalalignment='center', fontsize=12)

Out[158]:

Text(0.5, 0.01, 'Elboración propia - BCRP')

Series in a single image

This graph shows a positive relationship between the change in international reserves and the commodity index.

In [159]:

sns.set_context("paper")
fig, ax = plt.subplots(figsize=(12,5)) # definimos el objeto figura y ejes

x = macro['YEAR']
y1 = macro['DIPM']
y2 = macro['DRIN']

plt.plot(x, y1, label ='Indice de materias primas (Var %)', color='blue')
plt.plot(x, y2, label ='Reservas internacionales (Var %)', color='red')
plt.axhline(y=0, color='black', linestyle='--', lw=0.8) # Se añade linea horizontal 

# linestyle='--' estilo de linea

plt.legend(loc='upper right')

txt="Elaboración propia - BCRP"  
plt.figtext(0.2, 0.01, txt, wrap=True, horizontalalignment='right', fontsize=10)

Out[159]:

Text(0.2, 0.01, 'Elaboración propia - BCRP')

Dual-line Plots

Exchange rate and monetary policy reaction

In [162]:

#sns.set('notebook')
# sns.set_theme(style="white")
sns.set_context("paper")
                
fig, ax = plt.subplots(figsize=(10,5))
lineplot = sns.lineplot(x= "YEAR" , y= "RATE", data=macro, 
                        label = 'Tasa interbancaria promedio ', color="k", legend=False)

#sns.despine()
plt.ylabel('Tasa de política monetaria')
plt.xlabel(' ')
plt.title('Exchange rate and monetary policy reaction');

ax2 = ax.twinx() # Compartir el mismo eje
lineplot2 = sns.lineplot(x= "YEAR", y= "DTC", data=macro, ax=ax2, color = "red", linestyle='--',
                          lw=1.5, label ='Tipo de cambio (variación anual %)', legend=False) 

# lw: ancho de línea

sns.despine(right=False) # eje secundario en en la derecha
plt.ylabel('Tipo de cambio')
ax.figure.legend(loc='lower center', bbox_to_anchor=(0.75, 0.15), ncol=1);


txt="Elboración propia - BCRP"  
plt.figtext(0.2, 0.01, txt, wrap=True, horizontalalignment='right', fontsize=10)

# Añdiendo texto: crisis financiera. Se define la ubicación a través de coordenadas

fig.text(0.45,0.6,'Crisis Financiera',color = 'darkblue', size = 10,
        bbox=dict(facecolor='none', edgecolor='blue', pad=4.0)) # bbox: añade una caja al texto, pad : largo y ancho de la caja

fig.savefig(r'../plots/BCRP.png', dpi=800, bbox_inches='tight')

Out[162]:

Reference:

Library of plots

https://www.python-graph-gallery.com/stacked-and-percent-stacked-barplot

Seaborn package:

https://seaborn.pydata.org/generated/seaborn.catplot.html

https://programmerclick.com/article/54791895404/