GitHub Repository: robertopucp/1eco35_2022_2
Path: blob/main/Trabajo_grupal/WG8/Grupo_4_jupyter.ipynb
⁴⁶⁸⁵ views

Kernel: Python 3 (ipykernel)

In [11]:

from IPython.display import display, HTML

display(HTML(data="""
<style>
    div#notebook-container    { width: 80%; }
    div#menubar-container     { width: 65%; }
    div#maintoolbar-container { width: 80%; }a
</style>
"""))

Out[11]:

Tarea 8 - Grupo 4:

Gráficos en Jupyter Notebook : Encuesta de población - EE. UU. (2015)

Integrantes:

Luana Morales
Marcela Quintero
Seidy Ascencios
Flavia Oré

In [127]:

import numpy as np   
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt  # libreria de gráficos 
import seaborn as sns  # libreria 2 para gráficos 
import datetime as dt # manejar fechas 
import warnings
warnings.filterwarnings('ignore') # eliminar warning messages

Cargamos la base de datos

In [128]:

!pip install pyreadr

import pyreadr

Out[128]:

Requirement already satisfied: pyreadr in c:\users\marcela quintero\anaconda3\lib\site-packages (0.4.7)
Requirement already satisfied: pandas>=1.2.0 in c:\users\marcela quintero\anaconda3\lib\site-packages (from pyreadr) (1.4.2)
Requirement already satisfied: numpy>=1.18.5 in c:\users\marcela quintero\anaconda3\lib\site-packages (from pandas>=1.2.0->pyreadr) (1.21.5)
Requirement already satisfied: pytz>=2020.1 in c:\users\marcela quintero\anaconda3\lib\site-packages (from pandas>=1.2.0->pyreadr) (2021.3)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\marcela quintero\anaconda3\lib\site-packages (from pandas>=1.2.0->pyreadr) (2.8.2)
Requirement already satisfied: six>=1.5 in c:\users\marcela quintero\anaconda3\lib\site-packages (from python-dateutil>=2.8.1->pandas>=1.2.0->pyreadr) (1.16.0)

In [129]:

encuesta = pyreadr.read_r("../1ECO35_2022_2/data/wage2015_subsample_inference.RData")
encuesta

Out[129]:

OrderedDict([('data',
                             wage     lwage  sex  shs  hsg  scl  clg   ad   mw   so   we  \
              rownames                                                                     
              10         9.615385  2.263364  1.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0   
              12        48.076923  3.872802  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0   
              15        11.057692  2.403126  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0   
              18        13.942308  2.634928  1.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0   
              19        28.846154  3.361977  1.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0   
              ...             ...       ...  ...  ...  ...  ...  ...  ...  ...  ...  ...   
              32620     14.769231  2.692546  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0   
              32624     23.076923  3.138833  1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  1.0   
              32626     38.461538  3.649659  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  1.0   
              32631     32.967033  3.495508  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0   
              32643     17.307692  2.851151  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  1.0   
              
                         ne  exp1  exp2    exp3     exp4   occ occ2   ind ind2  
              rownames                                                          
              10        1.0   7.0  0.49   0.343   0.2401  3600   11  8370   18  
              12        1.0  31.0  9.61  29.791  92.3521  3050   10  5070    9  
              15        1.0  18.0  3.24   5.832  10.4976  6260   19   770    4  
              18        1.0  25.0  6.25  15.625  39.0625   420    1  6990   12  
              19        1.0  22.0  4.84  10.648  23.4256  2015    6  9470   22  
              ...       ...   ...   ...     ...      ...   ...  ...   ...  ...  
              32620     0.0   9.0  0.81   0.729   0.6561  4700   16  4970    9  
              32624     0.0  12.0  1.44   1.728   2.0736  4110   13  8680   20  
              32626     0.0  11.0  1.21   1.331   1.4641  1550    4  3680    6  
              32631     0.0  10.0  1.00   1.000   1.0000  2920    9  6570   11  
              32643     0.0  14.0  1.96   2.744   3.8416  1610    5  7460   14  
              
              [5150 rows x 20 columns])])

In [130]:

type (encuesta)

Out[130]:

collections.OrderedDict

In [131]:

encuesta_eu = encuesta['data']
encuesta_eu

Out[131]:

In [132]:

type(encuesta_eu)

Out[132]:

pandas.core.frame.DataFrame

1) Histograma del salario

In [133]:

# Frecuencia absoluta del salario

In [134]:

encuesta_eu['wage']

Out[134]:

rownames
      9.615385
     48.076923
     11.057692
     13.942308
     28.846154
           ...    
  14.769231
  23.076923
  38.461538
  32.967033
  17.307692
Name: wage, Length: 5150, dtype: float64

In [135]:

fig, ax = plt.subplots( figsize=(5,5) ) # objeto figura

encuesta_eu['wage'].plot(kind = 'hist', bins = 30) # cantidad de barras, dividir el rango de la variable en 30 
plt.title('Salario por hora')

# kind= hist es un histograma de frecuencias absolutas 

txt="Elboración propia de Encuesta de población - EE. UU. (2015)"  
plt.figtext(0.5, 0.01, txt, wrap=True, horizontalalignment='center', fontsize=5)
plt.show()

Out[135]:

In [136]:

# Logaritmo del salario

In [137]:

encuesta_eu['lwage']

Out[137]:

rownames
     2.263364
     3.872802
     2.403126
     2.634928
     3.361977
           ...   
  2.692546
  3.138833
  3.649659
  3.495508
  2.851151
Name: lwage, Length: 5150, dtype: float64

In [138]:

fig, ax = plt.subplots( figsize=(5,5) ) # objeto figura

encuesta_eu['lwage'].plot(kind = 'hist', bins = 50) # cantidad de barras, dividir el rango de la variable en 30 
plt.title('Logaritmo del salario por hora')

# kind= hist es un histograma de frecuencias absolutas 

txt="Elboración propia de Encuesta de población - EE. UU. (2015)"  
plt.figtext(0.5, 0.01, txt, wrap=True, horizontalalignment='center', fontsize=5)
plt.show()

Out[138]:

Cuando vemos el histograma de los valores del salario, notamos que hay una gran parte de la población estodounidense que se recibe menos de 50 dolares por hora de trabajo. La distribución de los datos, nos podría llevar a pensar que la variable está censurada o truncada, ya que está concentrada en la cola izquierda. Por otro lado, cuando vemos el logaritmo del salario,notamos una distribución más parecida a una normal, ha disminuido la variabilidad, además se ha acortado el rango del salario en una cantidad más pequeña que la original. El log salario reduce la sensibilidad de las estimaciones a las observaciones extremas o atípicas.

2) Gráfico de densidad

Salario de mujeres y hombres que terminaron la universidad

In [139]:

# Para mujeres cig = 1 y sex = 0 
# Para hombres cig = 1 y sex = 1

In [140]:

base_muj_uni = encuesta_eu[(encuesta_eu["sex"] == 0.0) ^ (encuesta_eu["clg"] == 1.0)]
base_hom_uni = encuesta_eu[(encuesta_eu["sex"] == 1.0) ^ (encuesta_eu["clg"] == 1.0)]

In [141]:

plt.figure(figsize=(4, 4))

sns.kdeplot(base_muj_uni['lwage'], label = "Densidad mujer", color = 'red', shade=True) 
sns.kdeplot(base_hom_uni['lwage'], label = "Densidad hombre", color = 'blue', shade=True) 
plt.title('Densidad del salario por sexo')
plt.xlabel(' ')
plt.show()

Out[141]:

3) Gráfico Pie

Porcentaje de personas según nivel educativo

In [142]:

# Creamos una base que cuente los valores de cada categoría de nivel educativo

base_ne=encuesta_eu.groupby( [ 'shs', 'hsg','scl','clg','ad' ] ).size()
base_ne

Out[142]:

shs  hsg  scl  clg  ad 
0.0  0.0  0.0  0.0  1.0     706
               1.0  0.0    1636
          1.0  0.0  0.0    1432
     1.0  0.0  0.0  0.0    1256
1.0  0.0  0.0  0.0  0.0     120
dtype: int64

In [143]:

#Cambiamos los labels 

labels=['Secundaria incompleta','Secundaria completa' , 'Universitaria incompleta', 'Universitaria completa', 'Advanced degree']

#Realizamos el gráfico pie para los niveles educativos

plt.figure(figsize=(10,10)) # tamaño de gráfico 

ax = plt.pie(base_ne, labels=labels, autopct='%.1f %%')
plt.title('Niveles educativos')
plt.show()

Out[143]:

4) Box - Plot

In [144]:

# Para realizar el diagrama de cajas (box-plot), comenzamos por crear una variable dummy asociada a un mayor nivel educativo (ad)

encuesta_eu['dummyad']=np.where(encuesta_eu['ad']> 0.0,1,0)

encuesta_eu['dummyad']

Out[144]:

rownames
     0
     0
     0
     1
     0
        ..
  0
  0
  1
  0
  1
Name: dummyad, Length: 5150, dtype: int32

In [145]:

#Luego, procedemos a crear el box plot usando la dummy para seleccionar a las personas con mayor nivel educativo (ad) según género.

fig, ax = plt.subplots(figsize=(10,6))

box = sns.boxplot(data=encuesta_eu[encuesta_eu.dummyad == 1], x='sex', y='lwage', palette='pastel')
plt.xlabel('Individuos con mayor nivel educativo')
plt.ylabel('ln del salario por hora')
(box.set_xticklabels(["Hombre","Mujer"]))

Out[145]:

[Text(0, 0, 'Hombre'), Text(1, 0, 'Mujer')]