GitHub Repository: robertopucp/1eco35_2022_2
Path: blob/main/Lab11/Web_scrapping.ipynb
²⁷¹⁴ views

Kernel: Python 3 (ipykernel)

In [2]:

from IPython.display import display, HTML

display(HTML(data="""
<style>
    div#notebook-container    { width: 85%; }
    div#menubar-container     { width: 65%; }
    div#maintoolbar-container { width: 80%; }a
</style>
"""))

Out[2]:

7. Web Scraping

Web scraping is the practice of gathering data through any means otherthan a program interacting with an API (or, obviously, through a human using a webbrowser). This is most commonly accomplished by writing an automated programthat queries a web server, requests data (usually in the form of the HTML and otherfiles that comprise web pages), and then parses that data to extract needed information.

7.1 Selenium

Selenium automates browsers. That's it!
Selenium is a Python library and tool used for automating web browsers to do a number of tasks. One of such is web-scraping to extract useful data and information that may be otherwise unavailable.
For this course, we use Chrome.

7.1 Installing Libraries

We need to install these two libraries

In [1]:

#!pip install selenium
#!pip install webdriver-manager

Libreria Selenium

LINK

url para conocer la versión de Chrome de su computadora

chrome://version

Link para descaragar Chromedrive

LINK

7.2 Calling Libraries

In [1]:


from selenium import webdriver  # manipulación de driver 
from webdriver_manager.chrome import ChromeDriverManager # manejar diferentes versiones del driver





import re # expresiones regulares 
import time # time 
from selenium.webdriver.support.ui import Select  # Trabaja con el tag <select></select>
import os
import sys
from selenium.webdriver.common.by import By  # permite seleccionar los elementos en un html
import warnings
warnings.filterwarnings('ignore') # eliminar warning messages 

from selenium.webdriver.common.keys import Keys  # ingresar información a la página web (nombres, fechas)
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver import ActionChains # movilizarnos en la página web 
import pandas as pd
import numpy as np 
import unidecode  # usaremos para retirar tildes 
from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore') # eliminar warning messages

Unicode

https://pypi.org/project/Unidecode/

7.3 Launch/Set the Driver

Este código abre un controlador Chrome. Lo vamos a usar para navegar en la web.

In [3]:

# Case 1 - Download the driver

driver = webdriver.Chrome("chromedriver_07.exe")
driver.maximize_window()

url = 'https://resultadoshistorico.onpe.gob.pe/EG2021/'

driver.get( url ) # ingresar el url al browser

Chrome está siendo controlado por un software de prueba automatizado !!!

In [4]:

# Acceso al contenido del tag <title></title>
print('Title: ', driver.title)

Out[4]:

Title:  Presentación de Resultados Elecciones Generales y Parlamento Andino 2021

In [5]:

# Access al url

print('Current Page URL: ', driver.current_url)

Out[5]:

Current Page URL:  https://resultadoshistorico.onpe.gob.pe/EG2021/EleccionesPresidenciales/RePres/P/010000/010300

In [6]:

# Screenshot 
driver.save_screenshot('resultados_presidenciales.png')

Out[6]:

True

In [14]:

type(driver)

Out[14]:

selenium.webdriver.chrome.webdriver.WebDriver

In [8]:

#driver.quit()

In [7]:

dir(driver) #observamos los métodos y atributos del objeto

Out[7]:

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_authenticator_id',
 '_file_detector',
 '_get_cdp_details',
 '_is_remote',
 '_mobile',
 '_shadowroot_cls',
 '_switch_to',
 '_unwrap_value',
 '_web_element_cls',
 '_wrap_value',
 'add_cookie',
 'add_credential',
 'add_virtual_authenticator',
 'application_cache',
 'back',
 'bidi_connection',
 'capabilities',
 'caps',
 'close',
 'command_executor',
 'create_options',
 'create_web_element',
 'current_url',
 'current_window_handle',
 'delete_all_cookies',
 'delete_cookie',
 'delete_network_conditions',
 'desired_capabilities',
 'error_handler',
 'execute',
 'execute_async_script',
 'execute_cdp_cmd',
 'execute_script',
 'file_detector',
 'file_detector_context',
 'find_element',
 'find_element_by_class_name',
 'find_element_by_css_selector',
 'find_element_by_id',
 'find_element_by_link_text',
 'find_element_by_name',
 'find_element_by_partial_link_text',
 'find_element_by_tag_name',
 'find_element_by_xpath',
 'find_elements',
 'find_elements_by_class_name',
 'find_elements_by_css_selector',
 'find_elements_by_id',
 'find_elements_by_link_text',
 'find_elements_by_name',
 'find_elements_by_partial_link_text',
 'find_elements_by_tag_name',
 'find_elements_by_xpath',
 'forward',
 'fullscreen_window',
 'get',
 'get_cookie',
 'get_cookies',
 'get_credentials',
 'get_issue_message',
 'get_log',
 'get_network_conditions',
 'get_pinned_scripts',
 'get_screenshot_as_base64',
 'get_screenshot_as_file',
 'get_screenshot_as_png',
 'get_sinks',
 'get_window_position',
 'get_window_rect',
 'get_window_size',
 'implicitly_wait',
 'launch_app',
 'log_types',
 'maximize_window',
 'minimize_window',
 'mobile',
 'name',
 'orientation',
 'page_source',
 'pin_script',
 'pinned_scripts',
 'port',
 'print_page',
 'quit',
 'refresh',
 'remove_all_credentials',
 'remove_credential',
 'remove_virtual_authenticator',
 'save_screenshot',
 'service',
 'session_id',
 'set_network_conditions',
 'set_page_load_timeout',
 'set_permissions',
 'set_script_timeout',
 'set_sink_to_use',
 'set_user_verified',
 'set_window_position',
 'set_window_rect',
 'set_window_size',
 'start_client',
 'start_desktop_mirroring',
 'start_session',
 'start_tab_mirroring',
 'stop_casting',
 'stop_client',
 'switch_to',
 'timeouts',
 'title',
 'unpin',
 'vendor_prefix',
 'virtual_authenticator_id',
 'window_handles']

driver is an selenium.webdriver.chrome.webdriver.WebDriver object. This object has some attributes that will help us to navigate on the web.

7.4.1. HTML

HTML stands for HyperText Markup Language. You can deduce that it’s a language for creating web pages. It’s not a programming language like Python or Java, but it’s a markup language. It describes the elements of a page through tags characterized by angle brackets.

The document always begins and ends using <html> and </html>.
<body></body> constitutes the visible part of HTML document.
<h1> to <h3> tags are defined for the headings.

7.4.1.1. HTML Headings

HTML headings are defined with the <h1> to <h6> tags. <h1> defines the most important heading. <h6> defines the least important heading.

We can use text cells since markdown reads html tags.

This is heading 1

This is heading 2

This is heading 3

7.4.1.2. HTML Paragraphs

HTML paragraphs are defined with the <p> tag. <br> tag is similar to "\n".

My first paragraph.

This is another paragraph for this text cell.

7.4.1.3. HTML Links

HTML links are defined with the

This is a link for Judea Pearl Website

7.4.1.3. Unordered HTML List

An unordered list starts with the <ul> tag. Each list item starts with the <li> tag.

Coffee
Tea
Milk

7.4.1.4. Ordered HTML List

An ordered list starts with the <ol> tag. Each list item starts with the <li> tag.

Coffee
Tea
Milk

7.4.1.4. HTML Tables

A table in HTML consists of table cells inside rows and columns. Each table cell is defined by a <td> and a </td> tag. Each table row starts with a <tr> and end with a </tr> tag.

Manager	Club	Nationality

Mikel Arteta Arsenal Spain

Thomas Tuchel Chelsea Germany

7.4.1.5. HTML Iframes

An HTML iframe is used to display a web page within a web page.

1.0 HTML adrres

style="text-align: center">

Diploma

HTML iframe

Add personal information

Written by Jon Doe.
Visit us at:
Example.com
Box 564, Disneyland
USA

2.0 The td element

The td element defines a cell in a table:

Cell A	Cell B
Cell C	Cell D

3.0 Bottom

Click the button below to display the hidden content from the template element.

Click here

4.0 The form element

First name:

Last name:

Submit

Click the "Submit" button and the form-data will be sent to a page on the server called "action_page.php".

5.0 The label element

Click on one of the text labels to toggle the related radio button:

HTML
CSS
JavaScript

Submit

6.0 The select element

The select element is used to create a drop-down list.

Choose a car:

Show hidden content

Click the "Submit" button and the form-data will be sent to a page on the server called "action_page.php".

7.0 Class atribute

CHINA

China has the largest population in the world.

INDIA

India has the second largest population in the world.

UNITED STATES

United States has the third largest population in the world.

8.0 Style

This is a heading

This is a paragraph.

9.0 The id Attribute

Use CSS to style an element with the id "myHeader":

My Header

10.0 Div tagname

This is a heading in a div element

This is some text in a div element.

7.4.1.6. HTML Tags - Key

Tag	Description
`<h1>` to `<h6>`	Defines HTML headings
`<ul>`	Defines an unordered list
`<ol>`	Defines an ordered list
`<p>`	Defines a paragraph
`<a>`	It is termed as anchor tag and it creates a hyperlink or link.
`<div>`	It defines a division or section within HTML document.
`<strong>`	It is used to define important text.
`<table>`	It is used to present data in tabular form or to create a table within HTML document.
`<td>`	It is used to define cells of an HTML table which contains table data
`<iframe>`	Defines an inline frame

7.4. Identifying elements in a web page

To identify elements of a webpage, we need to inspect the webpage. Open the driver and press Ctrl+ Shift + I.

One Element

Method	Description
find_element_by_id	Use id.
find_element_by_name	Use name.
find_element_by_xpath	Use Xpath.
find_element_by_tag_name	Use HTML tag.
find_element_by_class_name	Use class name.
find_element_by_css_selector	Use css selector.

Multiple elements

Method	Description
find_elements_by_id	Use id.
find_elements_by_name	Use name.
find_elements_by_xpath	Use Xpath.
find_elements_by_tag_name	Use HTML tag.
find_elements_by_class_name	Use class name.
find_elements_by_css_selector	Use css selector.

7.4.1. Xpath

XPath in Selenium is an XML path used for navigation through the HTML structure of the page. It is a syntax or language for finding any element on a web page using XML path expression.

The basic format of XPath in selenium is explained below with screen shot.

DO NOT COMPLICATE! Finding the XPath of a element:

Go to the element
Right click
Inspect - You may have to do it twice.
Go to the selected line
Right click
Copy
Copy Full Xpath

Example

Use find_element_by_xpath and click.

1.0. Uso de path completo (FULL PATH) para encontrar elementos HTML

In [9]:

driver = webdriver.Chrome("chromedriver_07.exe")
driver.maximize_window()

url = 'https://resultadoshistorico.onpe.gob.pe/EG2021/'

driver.get( url ) # ingresar el url al browser

In [10]:


time.sleep(4) # tiempo de espera

driver.refresh() #reload or refresh the browser

In [ ]:

/html/body/onpe-root/onpe-home-onpe/div[1]/div/div/div/div[2]/div[2]/div/div/a/div[1]/img

Me interesa hacer click en la opción de Resumen general

Para ello hago click derecho y selecciono inspeccionar (Hacer click dos veces)
Luego identifcar el elemento HTML corrrespondiente a la image. Click derecho y copiar el path y Listo.

In [11]:

# Elegimos la imagen de resumen general 

results = driver.find_element_by_xpath('/html/body/onpe-root/onpe-home-onpe/div[1]/div/div/div/div[2]/div[1]/div/div/a/div[1]/img')

In [12]:

# click en la imagen 
results.click()

In [13]:

time.sleep(4) # tiempo de espera
driver.back() # retroceder

Full Xpath extenso es sensible a cambios, por ello es preferible otros métodos

1.1. Usando Relative path (Xpath)

Class es el atributo más común de un elemento html

class asigna diseño (letra, color, tamaño, forma, espacios, etc)
Recuerde que el propósito es encontrar la forma de identificar al elemento html.
No obsatante, más de un elemento html pueden compartir los mismos diseños, lo cual hara compliado identificar el elemento HTML.
Recuerde que HTML no es un pseudocódigo de programción como python o R, sino solo entrega el diseño de la página web.

In [69]:

# veamos cuantos html tiene el diseño class = "pci"
# Para ello debemos usar by que nos permite hallar elementos html

In [ ]:

### El elemento del icono de Resumen General 

<div _ngcontent-ejv-c64="" class="pic">

<img _ngcontent-ejv-c64="" src="./assets/imagenes/resumen_general.jpg">  # inner html 
</div>

#"//img[@src='./assets/imagenes/resumen_general.jpg']"

In [16]:

driver.find_elements(By.XPATH, "//img[@src='./assets/imagenes/resumen_general.jpg']")

Out[16]:

[<selenium.webdriver.remote.webelement.WebElement (session="cdf155dbb66ad20f8b7a8364597e89ab", element="ea4050c2-97aa-483c-b682-3d2ea4ba2f98")>]

In [15]:

driver.find_element(By.XPATH, "//img[@src='./assets/imagenes/resumen_general.jpg']").click()

7.4.1. Xpath

XPath in Selenium is an XML path used for navigation through the HTML structure of the page. It is a syntax or language for finding any element on a web page using XML path expression.

The basic format of XPath in selenium is explained below with screen shot.

1. El tag es div
nombre del atributo: class
valor del atributo: pic

El xpath resulta "//div[@class = 'pic']"

No osbante habrá otros elementos html con el mismo atributo class = "pic"

Esperemos que no sino hará más complicado identificar el elemento html

In [40]:

lista_despegable = driver.find_element(By.ID, "select_ambito")
Select(lista_despegable).options[i].click()

In [41]:


# for i in range(5)
# Select(lista_despegable).options[i].click()

In [237]:

# Usaremos el método find_element del objeto driver que hemos definido para el browser 

opciones = driver.find_elements(By.XPATH, "//div[@class='pic']")

# Observe que es el resultado es una lista

In [238]:

opciones

Out[238]:

[<selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="f8b9c79f-8b08-4f79-9bc8-6c5dcf5d2f0d")>,
 <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="bb5065e5-920c-4ebb-bdf7-51ffc77120fb")>,
 <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="77416960-49bb-4627-ab8a-0956dde2f8e9")>,
 <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="0ff6a22e-3750-4a8d-badc-8845b2bcec99")>,
 <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="a6bcc2ee-67a6-4f9e-b7d8-92d9fce8c3d2")>,
 <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="33e61681-5f48-4d0a-8773-104c775dd4dd")>]

!!! No tenemos 6 elementos html con las mismos atributos

Estas son las 6 opciones cuadros de imagen

Entonces debe tener cuidado y elegir el elemento que me interesa. En este ejemplo la opción inicial de Resumen general

In [141]:

# elegimos el primer elemento dela lista

dir(opciones[0] ) # podemos observar los atributos y métodos

Out[141]:

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_execute',
 '_id',
 '_parent',
 '_upload',
 'accessible_name',
 'aria_role',
 'clear',
 'click',
 'find_element',
 'find_element_by_class_name',
 'find_element_by_css_selector',
 'find_element_by_id',
 'find_element_by_link_text',
 'find_element_by_name',
 'find_element_by_partial_link_text',
 'find_element_by_tag_name',
 'find_element_by_xpath',
 'find_elements',
 'find_elements_by_class_name',
 'find_elements_by_css_selector',
 'find_elements_by_id',
 'find_elements_by_link_text',
 'find_elements_by_name',
 'find_elements_by_partial_link_text',
 'find_elements_by_tag_name',
 'find_elements_by_xpath',
 'get_attribute',
 'get_dom_attribute',
 'get_property',
 'id',
 'is_displayed',
 'is_enabled',
 'is_selected',
 'location',
 'location_once_scrolled_into_view',
 'parent',
 'rect',
 'screenshot',
 'screenshot_as_base64',
 'screenshot_as_png',
 'send_keys',
 'shadow_root',
 'size',
 'submit',
 'tag_name',
 'text',
 'value_of_css_property']

In [142]:

# exploremos sus atributos 

opciones[0].text # No hay texto

Out[142]:

''

In [143]:

opciones[0].tag_name # exgraemos el nombre del tag usado para referenciar el elemento html

Out[143]:

'div'

In [144]:

opciones[0].size  # tamaño (largo y ancho del elemento div)

Out[144]:

{'height': 119, 'width': 230}

In [145]:

opciones[0].location  # ubocación en terminos de coordenadas

Out[145]:

{'x': 210, 'y': 390}

In [146]:

opciones[0].rect  # ancho, largo y coordenadas

Out[146]:

{'height': 119, 'width': 230, 'x': 209.87265014648438, 'y': 389.8302001953125}

In [126]:

opciones[0].value_of_css_property

Out[126]:

''

In [147]:

opciones[0].get_attribute('innerHTML')  # devuelve lo que contiene el tag div

Out[147]:

'<img _ngcontent-ejv-c64="" src="./assets/imagenes/resumen_general.jpg">'

In [154]:

print(opciones[0].get_attribute('id')) # No hau un identificador unico (id es un atributo que permite identificar un elemento html)
print(opciones[0].get_attribute('name'))
print(opciones[0].get_attribute('value'))
print(opciones[0].get_attribute('class')) # value del atributo classs

Out[154]:

None
None
pic

In [239]:

# Apliquemos el método Click en la opción de resultados presidenciales 

opciones[1].click()

Seleccionamos el ámbito. Este tiene 3 alternativas: TODOS, PERU Y EXTRANJERO

In [111]:

# Full path

/html/body/onpe-root/onpe-layout-container/onpe-onpe-rgen-rsgr/div/div[2]/div[1]/div[1]/div/div/div/select


# Relative path

//*[@id="select_ambito"]

# EUREKA: select ambito tiene in ID (identificado único), podremos unicar a este elemento con su identificador

Out[111]:

  Input In [111]
    //*[@id="select_ambito"]
    ^
SyntaxError: invalid syntax

In [ ]:

<select _ngcontent-ejv-c105="" id="select_ambito" name="cod_ambito" class="select_ubigeo ng-pristine ng-valid ng-touched">
<option _ngcontent-ejv-c105="" value="T">TODOS</option>
<option _ngcontent-ejv-c105="" value="P">PERÚ</option>
<option _ngcontent-ejv-c105="" value="E">EXTRANJERO</option>
<!---->
</select>

# Los atributos son id, name, classs y value dentro de cada opción

In [240]:

# Examenos si tiene atributos que hagan al tag <select></select> unico

#  "//select[@name='cod_ambito']"

driver.find_elements(By.XPATH, "//select[@name='cod_ambito']") 


# !!! Perfecto, tambien pudo identficar al elemento select por al atributo nombre
# Usemos entonces solo find_element
driver.find_element(By.XPATH, "//select[@name='cod_ambito']") 

# Alternativas 

# podemos usar by.NAME en este caso 
driver.find_element(By.NAME, "cod_ambito")

Out[240]:

<selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="4471d3fd-f687-4413-980a-e3e4666ca990")>

In [241]:

driver.find_elements(By.CLASS_NAME, "select_ubigeo") 

# Si deseas usar By.CLASS_NAME solo usar el primer nombre no todo el texto 

driver.find_element(By.CLASS_NAME, "select_ubigeo") # es unico

Out[241]:

<selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="4471d3fd-f687-4413-980a-e3e4666ca990")>

In [242]:

driver.find_elements(By.CLASS_NAME, "row") # usamos el nombre del atributo class, tambien es unico

Out[242]:

[<selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="f2c35b31-ee80-4429-a907-04a70c11669a")>,
 <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="ffa9c82c-56a2-4be2-9491-e237212c89e5")>,
 <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="4f456b44-9804-414c-8b0a-b670aad5d249")>,
 <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="90bd73c1-bd05-4c43-b478-486249b98b79")>,
 <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="1f6c39d9-96f3-4a82-9f85-89148f9ce633")>,
 <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="cf1a0068-0fe6-4266-8acc-3fcf676ddcae")>,
 <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="7fd25db2-cbc8-4ee4-b348-87e0ba7b982d")>,
 <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="1f558db5-4d4e-41fb-a9b8-0ff4d6ee61fe")>,
 <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="158a1988-5045-442b-b76d-f32fc8a69c4d")>,
 <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="31400c3b-c92e-4fda-ad8a-238f6903c6c7")>,
 <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="986ef5b7-ef29-45ec-8a33-c759d915e2f4")>]

In [256]:

# creamos el objeto de la lsita despegable 

bottom = driver.find_element(By.ID, 'select_ambito')

In [244]:

bottom.text  # las opciones

Out[244]:

'TODOS\nPERÚ\nEXTRANJERO'

In [245]:

bottom.tag_name

Out[245]:

'select'

In [246]:

bottom.get_attribute('innerHTML')  # inner html

Out[246]:

'<option _ngcontent-pyn-c101="" value="T">TODOS</option><option _ngcontent-pyn-c101="" value="P">PERÚ</option><option _ngcontent-pyn-c101="" value="E">EXTRANJERO</option><!---->'

In [247]:

bottom.get_attribute('value')  # de todos por el resultado inicial o por default

Out[247]:

'T'

In [209]:

# podemos observar los tags optiones que siempre estarán dentro de la lista depegable 
# Un tag opcion para cada opción
- <option _ngcontent-ejv-c105="" value="T">TODOS</option>
- <option _ngcontent-ejv-c105="" value="P">PERÚ</option>
- <option _ngcontent-ejv-c105="" value="E">EXTRANJERO</option>

Out[209]:

  Input In [209]
    - <option _ngcontent-ejv-c105="" value="T">TODOS</option>
      ^
SyntaxError: invalid syntax

In [248]:

bottom.text

Out[248]:

'TODOS\nPERÚ\nEXTRANJERO'

In [249]:

bottom.tag_name

Out[249]:

'select'

In [250]:

bottom.get_attribute('value')

Out[250]:

'T'

In [251]:

print(bottom.get_attribute('id')) # No hau un identificador unico (id es un atributo que permite identificar un elemento html)
print(bottom.get_attribute('name'))
print(bottom.get_attribute('value'))
print(bottom.get_attribute('class')) # value del atributo classs

Out[251]:

select_ambito
cod_ambito
T
select_ubigeo ng-untouched ng-pristine ng-valid

In [252]:

# Accedemos a cada opción 

bottom.find_elements(By.TAG_NAME, 'option') # podemso hacer esto pues el objeto bottom tiene en su inner-html a elementos html 

# Tres, uno por cada opción (Todos, Perú, Extranjero)

print(bottom.find_elements(By.TAG_NAME, 'option')[0].text)

print(bottom.find_elements(By.TAG_NAME, 'option')[1].text)

bottom.find_elements(By.TAG_NAME, 'option')[2].text

Out[252]:

TODOS
PERÚ

'EXTRANJERO'

In [253]:

bottom.find_elements(By.TAG_NAME, 'option')[1].click() # click en Perú 

# Aparecieron opciones para departamentos, provincia y distrito.
# La lógica es la misma para entrar sacar información de cada distrito

In [257]:

# Usemos Select para un mejor manejo de los opciones dentro del tag <select></select>

ambito = Select(bottom)

ambito.options[0] # podemos acceder a cada opción del tag <select></select>

Out[257]:

<selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="2c01b473-4d39-4a1c-ae67-2691f518f313")>

In [258]:

ambito.options[1] # Perú

Out[258]:

<selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="01fcc312-f92b-45d2-a974-2c6989de6c57")>

In [259]:

ambito.options[2]

Out[259]:

<selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="712c21b0-434e-4565-8567-372df896f045")>

In [260]:

# click Perú
ambito.options[1].click()

In [261]:

# diccionario vacío para guardar las tablas 

all_tables = {}

In [263]:

# tres iteraciones uno para cada departamento 

for dpt_idx in tqdm(range( 4 )):
    
    # Get again all departments since HTML is refreshing
    # all elements
    # Click on one specific department
    dpt = Select( driver.find_element(By.ID, "select_departamento" ) ) # seleción de la lista de departamentos
    department = dpt.options[ dpt_idx ]  # seleccioanmos un departamento 
    
    # extrate text (nombre del departamento)
    dpt_name = department.text
    
    print(dpt_name)
    
    # filtramos la primera opción que es --TODOS--
    
    if dpt_name != "--TODOS--" :
        
        # click en departamento
        
        department.click()
        
        time.sleep(1)  # esperar que carque la iformación
        prov = Select( driver.find_element(By.ID, "cod_prov" ) )  # seleción de la lista de provincias
        num_prov_options = len( prov.options )  # se contabiliza el total de provincias 
        
        for prov_idx in range( num_prov_options ): # por cada provincia vamos a extraer los datos de distritos 
            
            # Get again all districts since HTML is refreshing
            # all elements
            
            prov = Select( driver.find_element(By.ID, "cod_prov" ) ) # seleccionamos cada provincia 
            province = prov.options[ prov_idx ] # seleccionamos una provincia 
                
            # Get province name
            prov_name = province.text
            
            if prov_name != "--TODOS--" :
                
                # click on province
                province.click()
                
                time.sleep(1)  # esperar que carque la iformación
                
                # Get all elements from district
                dist = Select( driver.find_element_by_id( "cod_dist" ) )
                num_dist_options = len( dist.options )
                
                for dist_idx in range( num_dist_options ):
                    
                    # Get again all districts since HTML is refreshing
                    # all elements
                    dist = Select( driver.find_element_by_id( "cod_dist" ) ) # lista de dstritos
                    district = dist.options[ dist_idx ] # seleccionamos un distrito 
                    
                    # Get district name
                    dist_name = district.text
                    
                    if dist_name != "-- SELECCIONE --" :
                        
                        # click on district
                        district.click()
                        
                        time.sleep(1)  # esperar que carque la iformación
                        
                        # Get UBIGEO
                        ubigeo = driver.current_url.split("/")[ -1 ]
                        
                        ## Extracciónd e tabla de elecciones presidenciales 

                        table_path = driver.find_element( By.ID, "table-scroll" )
                        table_html = table_path.get_attribute( 'innerHTML' ) # extrae los tags de html
                        # read_html permite leer las tablas <table></table>
                        table = pd.read_html( table_html )
                        
                        #/html/body/onpe-root/onpe-layout-container/onpe-onpe-epres-re/div[1]/div[4]/div[1]/div[3]/div
                        
                        # selecicón del nombre de variables 
                        row_new_columns = table[ 0 ].iloc[ 0 , 2: ] # seleecionamos de la posición 3 de la primera fila 
                        clean_columns = row_new_columns \
                                              .str.replace( " ", "_") \
                                              .str.lower().str.replace( "%", "share_") \
                                              .apply( lambda x : unidecode.unidecode( x ) ) \
                                              .tolist()
                        
                        # Seleccionamos columnas con información relevante 
                        table_clean = table[0].iloc[ 1:, 2: ].copy() # solo seleccioanmos desde la fila posicón 2 y columns posición 3 
                        
                        # rename columns
                        table_clean.columns = clean_columns
                        
                        # se crea columnas con información del departamento, provincia, distrito y ubigeo 
                        table_clean[ 'department' ] = dpt_name
                        table_clean[ 'province' ]   = prov_name
                        table_clean[ 'district' ]   = dist_name
                        table_clean[ 'ubigeo' ]     = ubigeo
                        
                        # Se guarda cada tabla en el diccionario
                        all_tables[ ubigeo ] = table_clean
                        
                        # se usa como llave al ubigeo, {"010203": tabla1, "030201": tabla2 } tabla1, tabla2 son dataframes

Out[263]:

--TODOS--
AMAZONAS
ANCASH
APURIMAC

In [233]:

a = "https://resultadoshistorico.onpe.gob.pe/EG2021/ResumenGeneral/10/P/010000/010200/010202"
    
a.split("/") # separa por /

Out[233]:

['https:',
 '',
 'resultadoshistorico.onpe.gob.pe',
 'EG2021',
 'ResumenGeneral',
 '10',
 'P',
 '010000',
 '010200',
 '010202']

In [234]:

a.split("/")[-1] # ubigeo

Out[234]:

'010202'

/html/body/onpe-root/onpe-layout-container/onpe-onpe-epres-re/div[1]/div[4]/div[1]/div[3]/div

In [109]:

all_tables['010202']

Out[109]:

In [110]:

# concatenamos todos las tablas
# all_tables.values() estrae dataframe de cada llave del diccionario 
final_data = pd.concat( all_tables.values() ).reset_index( drop = True )

In [111]:

final_data

Out[111]:

In [113]:

final_data.to_excel( r'first_round.xlsx' , index = False ) # se exporta a excel

In [264]:

all_tables['010202']

Out[264]:

In [266]:

# pd.html nos devuelve en una lsita todas las tablas dentro del inner html.

table[0]

Out[266]:

In [267]:

table[1] # Solo hay una tabla , claro está

Out[267]:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Input In [267], in <cell line: 1>()
----> 1 table[1]
IndexError: list index out of range

References

https://pypi.org/project/webdriver-manager/

https://selenium-python.readthedocs.io/installation.html#drivers

simulador HTML

https://www.w3schools.com/tags/tryit.asp?filename=tryhtml_button_test

Class - HTML

https://www.geeksforgeeks.org/html-class-attribute/#:~:text=Class in html%3A,with the specified class name.