Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
robertopucp
GitHub Repository: robertopucp/1eco35_2022_2
Path: blob/main/Lab11/Web_scrapping.ipynb
2714 views
Kernel: Python 3 (ipykernel)
from IPython.display import display, HTML display(HTML(data=""" <style> div#notebook-container { width: 85%; } div#menubar-container { width: 65%; } div#maintoolbar-container { width: 80%; }a </style> """))

7. Web Scraping

Web scraping is the practice of gathering data through any means otherthan a program interacting with an API (or, obviously, through a human using a webbrowser). This is most commonly accomplished by writing an automated programthat queries a web server, requests data (usually in the form of the HTML and otherfiles that comprise web pages), and then parses that data to extract needed information.

7.1 Selenium

Selenium automates browsers. That's it!
Selenium is a Python library and tool used for automating web browsers to do a number of tasks. One of such is web-scraping to extract useful data and information that may be otherwise unavailable.
For this course, we use Chrome.

7.1 Installing Libraries

We need to install these two libraries

#!pip install selenium #!pip install webdriver-manager
  • Libreria Selenium

LINK

  • url para conocer la versión de Chrome de su computadora

chrome://version

  • Link para descaragar Chromedrive

LINK

7.2 Calling Libraries

from selenium import webdriver # manipulación de driver from webdriver_manager.chrome import ChromeDriverManager # manejar diferentes versiones del driver import re # expresiones regulares import time # time from selenium.webdriver.support.ui import Select # Trabaja con el tag <select></select> import os import sys from selenium.webdriver.common.by import By # permite seleccionar los elementos en un html import warnings warnings.filterwarnings('ignore') # eliminar warning messages from selenium.webdriver.common.keys import Keys # ingresar información a la página web (nombres, fechas) from selenium.common.exceptions import NoSuchElementException from selenium.webdriver import ActionChains # movilizarnos en la página web import pandas as pd import numpy as np import unidecode # usaremos para retirar tildes from tqdm import tqdm import warnings warnings.filterwarnings('ignore') # eliminar warning messages

7.3 Launch/Set the Driver

Este código abre un controlador Chrome. Lo vamos a usar para navegar en la web.

# Case 1 - Download the driver driver = webdriver.Chrome("chromedriver_07.exe") driver.maximize_window() url = 'https://resultadoshistorico.onpe.gob.pe/EG2021/' driver.get( url ) # ingresar el url al browser

Chrome está siendo controlado por un software de prueba automatizado !!!

# Acceso al contenido del tag <title></title> print('Title: ', driver.title)
Title: Presentación de Resultados Elecciones Generales y Parlamento Andino 2021
# Access al url print('Current Page URL: ', driver.current_url)
Current Page URL: https://resultadoshistorico.onpe.gob.pe/EG2021/EleccionesPresidenciales/RePres/P/010000/010300
# Screenshot driver.save_screenshot('resultados_presidenciales.png')
True
type(driver)
selenium.webdriver.chrome.webdriver.WebDriver
#driver.quit()
dir(driver) #observamos los métodos y atributos del objeto
['__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_authenticator_id', '_file_detector', '_get_cdp_details', '_is_remote', '_mobile', '_shadowroot_cls', '_switch_to', '_unwrap_value', '_web_element_cls', '_wrap_value', 'add_cookie', 'add_credential', 'add_virtual_authenticator', 'application_cache', 'back', 'bidi_connection', 'capabilities', 'caps', 'close', 'command_executor', 'create_options', 'create_web_element', 'current_url', 'current_window_handle', 'delete_all_cookies', 'delete_cookie', 'delete_network_conditions', 'desired_capabilities', 'error_handler', 'execute', 'execute_async_script', 'execute_cdp_cmd', 'execute_script', 'file_detector', 'file_detector_context', 'find_element', 'find_element_by_class_name', 'find_element_by_css_selector', 'find_element_by_id', 'find_element_by_link_text', 'find_element_by_name', 'find_element_by_partial_link_text', 'find_element_by_tag_name', 'find_element_by_xpath', 'find_elements', 'find_elements_by_class_name', 'find_elements_by_css_selector', 'find_elements_by_id', 'find_elements_by_link_text', 'find_elements_by_name', 'find_elements_by_partial_link_text', 'find_elements_by_tag_name', 'find_elements_by_xpath', 'forward', 'fullscreen_window', 'get', 'get_cookie', 'get_cookies', 'get_credentials', 'get_issue_message', 'get_log', 'get_network_conditions', 'get_pinned_scripts', 'get_screenshot_as_base64', 'get_screenshot_as_file', 'get_screenshot_as_png', 'get_sinks', 'get_window_position', 'get_window_rect', 'get_window_size', 'implicitly_wait', 'launch_app', 'log_types', 'maximize_window', 'minimize_window', 'mobile', 'name', 'orientation', 'page_source', 'pin_script', 'pinned_scripts', 'port', 'print_page', 'quit', 'refresh', 'remove_all_credentials', 'remove_credential', 'remove_virtual_authenticator', 'save_screenshot', 'service', 'session_id', 'set_network_conditions', 'set_page_load_timeout', 'set_permissions', 'set_script_timeout', 'set_sink_to_use', 'set_user_verified', 'set_window_position', 'set_window_rect', 'set_window_size', 'start_client', 'start_desktop_mirroring', 'start_session', 'start_tab_mirroring', 'stop_casting', 'stop_client', 'switch_to', 'timeouts', 'title', 'unpin', 'vendor_prefix', 'virtual_authenticator_id', 'window_handles']

driver is an selenium.webdriver.chrome.webdriver.WebDriver object. This object has some attributes that will help us to navigate on the web.

7.4.1. HTML

HTML stands for HyperText Markup Language. You can deduce that it’s a language for creating web pages. It’s not a programming language like Python or Java, but it’s a markup language. It describes the elements of a page through tags characterized by angle brackets.

  1. The document always begins and ends using <html> and </html>.

  2. <body></body> constitutes the visible part of HTML document.

  3. <h1> to <h3> tags are defined for the headings.

7.4.1.1. HTML Headings

HTML headings are defined with the <h1> to <h6> tags. <h1> defines the most important heading. <h6> defines the least important heading.

We can use text cells since markdown reads html tags.

This is heading 1

This is heading 2

This is heading 3

7.4.1.2. HTML Paragraphs

HTML paragraphs are defined with the <p> tag. <br> tag is similar to "\n".


My first paragraph.


This is another paragraph for this text cell.

HTML links are defined with the

7.4.1.3. Unordered HTML List

An unordered list starts with the <ul> tag. Each list item starts with the <li> tag.

  • Coffee
  • Tea
  • Milk

7.4.1.4. Ordered HTML List

An ordered list starts with the <ol> tag. Each list item starts with the <li> tag.

  1. Coffee
  2. Tea
  3. Milk

7.4.1.4. HTML Tables

A table in HTML consists of table cells inside rows and columns. Each table cell is defined by a <td> and a </td> tag. Each table row starts with a <tr> and end with a </tr> tag.

Manager Club Nationality
Mikel Arteta Arsenal Spain
Thomas Tuchel Chelsea Germany

7.4.1.5. HTML Iframes

An HTML iframe is used to display a web page within a web page.

1.0 HTML adrres
style="text-align: center">

Diploma

HTML iframe

Add personal information

Written by Jon Doe.
Visit us at:
Example.com
Box 564, Disneyland
USA

2.0 The td element

The td element defines a cell in a table:

Cell A Cell B
Cell C Cell D

3.0 Bottom

Click the button below to display the hidden content from the template element.

Click here

4.0 The form element

First name:

Last name:

Submit

Click the "Submit" button and the form-data will be sent to a page on the server called "action_page.php".

5.0 The label element

Click on one of the text labels to toggle the related radio button:

    HTML
    CSS
    JavaScript

Submit

6.0 The select element

The select element is used to create a drop-down list.

Choose a car:

Show hidden content

Click the "Submit" button and the form-data will be sent to a page on the server called "action_page.php".

7.0 Class atribute

CHINA

China has the largest population in the world.

INDIA

India has the second largest population in the world.

UNITED STATES

United States has the third largest population in the world.

8.0 Style

This is a heading

This is a paragraph.

9.0 The id Attribute

Use CSS to style an element with the id "myHeader":

My Header

10.0 Div tagname

This is a heading in a div element

This is some text in a div element.

7.4.1.6. HTML Tags - Key

TagDescription
<h1> to <h6>Defines HTML headings
<ul>Defines an unordered list
<ol>Defines an ordered list
<p>Defines a paragraph
<a>It is termed as anchor tag and it creates a hyperlink or link.
<div>It defines a division or section within HTML document.
<strong>It is used to define important text.
<table>It is used to present data in tabular form or to create a table within HTML document.
<td>It is used to define cells of an HTML table which contains table data
<iframe>Defines an inline frame

7.4. Identifying elements in a web page

To identify elements of a webpage, we need to inspect the webpage. Open the driver and press Ctrl+ Shift + I.

One Element

MethodDescription
find_element_by_idUse id.
find_element_by_nameUse name.
find_element_by_xpathUse Xpath.
find_element_by_tag_nameUse HTML tag.
find_element_by_class_nameUse class name.
find_element_by_css_selectorUse css selector.

Multiple elements

MethodDescription
find_elements_by_idUse id.
find_elements_by_nameUse name.
find_elements_by_xpathUse Xpath.
find_elements_by_tag_nameUse HTML tag.
find_elements_by_class_nameUse class name.
find_elements_by_css_selectorUse css selector.

7.4.1. Xpath

XPath in Selenium is an XML path used for navigation through the HTML structure of the page. It is a syntax or language for finding any element on a web page using XML path expression.

The basic format of XPath in selenium is explained below with screen shot.

DO NOT COMPLICATE! Finding the XPath of a element:

  1. Go to the element

  2. Right click

  3. Inspect - You may have to do it twice.

  4. Go to the selected line

  5. Right click

  6. Copy

  7. Copy Full Xpath

Example

Use find_element_by_xpath and click.

driver = webdriver.Chrome("chromedriver_07.exe") driver.maximize_window() url = 'https://resultadoshistorico.onpe.gob.pe/EG2021/' driver.get( url ) # ingresar el url al browser
time.sleep(4) # tiempo de espera driver.refresh() #reload or refresh the browser
/html/body/onpe-root/onpe-home-onpe/div[1]/div/div/div/div[2]/div[2]/div/div/a/div[1]/img

Me interesa hacer click en la opción de Resumen general

  • Para ello hago click derecho y selecciono inspeccionar (Hacer click dos veces)

  • Luego identifcar el elemento HTML corrrespondiente a la image. Click derecho y copiar el path y Listo.

# Elegimos la imagen de resumen general results = driver.find_element_by_xpath('/html/body/onpe-root/onpe-home-onpe/div[1]/div/div/div/div[2]/div[1]/div/div/a/div[1]/img')
# click en la imagen results.click()
time.sleep(4) # tiempo de espera driver.back() # retroceder

Full Xpath extenso es sensible a cambios, por ello es preferible otros métodos

Class es el atributo más común de un elemento html

  • class asigna diseño (letra, color, tamaño, forma, espacios, etc)

  • Recuerde que el propósito es encontrar la forma de identificar al elemento html.

  • No obsatante, más de un elemento html pueden compartir los mismos diseños, lo cual hara compliado identificar el elemento HTML.

  • Recuerde que HTML no es un pseudocódigo de programción como python o R, sino solo entrega el diseño de la página web.

# veamos cuantos html tiene el diseño class = "pci" # Para ello debemos usar by que nos permite hallar elementos html
### El elemento del icono de Resumen General <div _ngcontent-ejv-c64="" class="pic"> <img _ngcontent-ejv-c64="" src="./assets/imagenes/resumen_general.jpg"> # inner html </div> #"//img[@src='./assets/imagenes/resumen_general.jpg']"
driver.find_elements(By.XPATH, "//img[@src='./assets/imagenes/resumen_general.jpg']")
[<selenium.webdriver.remote.webelement.WebElement (session="cdf155dbb66ad20f8b7a8364597e89ab", element="ea4050c2-97aa-483c-b682-3d2ea4ba2f98")>]
driver.find_element(By.XPATH, "//img[@src='./assets/imagenes/resumen_general.jpg']").click()

7.4.1. Xpath

XPath in Selenium is an XML path used for navigation through the HTML structure of the page. It is a syntax or language for finding any element on a web page using XML path expression.

The basic format of XPath in selenium is explained below with screen shot.

    1. El tag es div

  • nombre del atributo: class

  • valor del atributo: pic

El xpath resulta "//div[@class = 'pic']"

No osbante habrá otros elementos html con el mismo atributo class = "pic"

Esperemos que no sino hará más complicado identificar el elemento html

lista_despegable = driver.find_element(By.ID, "select_ambito") Select(lista_despegable).options[i].click()
# for i in range(5) # Select(lista_despegable).options[i].click()
# Usaremos el método find_element del objeto driver que hemos definido para el browser opciones = driver.find_elements(By.XPATH, "//div[@class='pic']") # Observe que es el resultado es una lista
opciones
[<selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="f8b9c79f-8b08-4f79-9bc8-6c5dcf5d2f0d")>, <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="bb5065e5-920c-4ebb-bdf7-51ffc77120fb")>, <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="77416960-49bb-4627-ab8a-0956dde2f8e9")>, <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="0ff6a22e-3750-4a8d-badc-8845b2bcec99")>, <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="a6bcc2ee-67a6-4f9e-b7d8-92d9fce8c3d2")>, <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="33e61681-5f48-4d0a-8773-104c775dd4dd")>]

!!! No tenemos 6 elementos html con las mismos atributos

Estas son las 6 opciones cuadros de imagen

Entonces debe tener cuidado y elegir el elemento que me interesa. En este ejemplo la opción inicial de Resumen general

# elegimos el primer elemento dela lista dir(opciones[0] ) # podemos observar los atributos y métodos
['__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_execute', '_id', '_parent', '_upload', 'accessible_name', 'aria_role', 'clear', 'click', 'find_element', 'find_element_by_class_name', 'find_element_by_css_selector', 'find_element_by_id', 'find_element_by_link_text', 'find_element_by_name', 'find_element_by_partial_link_text', 'find_element_by_tag_name', 'find_element_by_xpath', 'find_elements', 'find_elements_by_class_name', 'find_elements_by_css_selector', 'find_elements_by_id', 'find_elements_by_link_text', 'find_elements_by_name', 'find_elements_by_partial_link_text', 'find_elements_by_tag_name', 'find_elements_by_xpath', 'get_attribute', 'get_dom_attribute', 'get_property', 'id', 'is_displayed', 'is_enabled', 'is_selected', 'location', 'location_once_scrolled_into_view', 'parent', 'rect', 'screenshot', 'screenshot_as_base64', 'screenshot_as_png', 'send_keys', 'shadow_root', 'size', 'submit', 'tag_name', 'text', 'value_of_css_property']
# exploremos sus atributos opciones[0].text # No hay texto
''
opciones[0].tag_name # exgraemos el nombre del tag usado para referenciar el elemento html
'div'
opciones[0].size # tamaño (largo y ancho del elemento div)
{'height': 119, 'width': 230}
opciones[0].location # ubocación en terminos de coordenadas
{'x': 210, 'y': 390}
opciones[0].rect # ancho, largo y coordenadas
{'height': 119, 'width': 230, 'x': 209.87265014648438, 'y': 389.8302001953125}
opciones[0].value_of_css_property
''
opciones[0].get_attribute('innerHTML') # devuelve lo que contiene el tag div
'<img _ngcontent-ejv-c64="" src="./assets/imagenes/resumen_general.jpg">'
print(opciones[0].get_attribute('id')) # No hau un identificador unico (id es un atributo que permite identificar un elemento html) print(opciones[0].get_attribute('name')) print(opciones[0].get_attribute('value')) print(opciones[0].get_attribute('class')) # value del atributo classs
None None pic
# Apliquemos el método Click en la opción de resultados presidenciales opciones[1].click()

Web Element

Seleccionamos el ámbito. Este tiene 3 alternativas: TODOS, PERU Y EXTRANJERO

# Full path /html/body/onpe-root/onpe-layout-container/onpe-onpe-rgen-rsgr/div/div[2]/div[1]/div[1]/div/div/div/select # Relative path //*[@id="select_ambito"] # EUREKA: select ambito tiene in ID (identificado único), podremos unicar a este elemento con su identificador
Input In [111] //*[@id="select_ambito"] ^ SyntaxError: invalid syntax
<select _ngcontent-ejv-c105="" id="select_ambito" name="cod_ambito" class="select_ubigeo ng-pristine ng-valid ng-touched"> <option _ngcontent-ejv-c105="" value="T">TODOS</option> <option _ngcontent-ejv-c105="" value="P">PERÚ</option> <option _ngcontent-ejv-c105="" value="E">EXTRANJERO</option> <!----> </select> # Los atributos son id, name, classs y value dentro de cada opción
# Examenos si tiene atributos que hagan al tag <select></select> unico # "//select[@name='cod_ambito']" driver.find_elements(By.XPATH, "//select[@name='cod_ambito']") # !!! Perfecto, tambien pudo identficar al elemento select por al atributo nombre # Usemos entonces solo find_element driver.find_element(By.XPATH, "//select[@name='cod_ambito']") # Alternativas # podemos usar by.NAME en este caso driver.find_element(By.NAME, "cod_ambito")
<selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="4471d3fd-f687-4413-980a-e3e4666ca990")>
driver.find_elements(By.CLASS_NAME, "select_ubigeo") # Si deseas usar By.CLASS_NAME solo usar el primer nombre no todo el texto driver.find_element(By.CLASS_NAME, "select_ubigeo") # es unico
<selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="4471d3fd-f687-4413-980a-e3e4666ca990")>
driver.find_elements(By.CLASS_NAME, "row") # usamos el nombre del atributo class, tambien es unico
[<selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="f2c35b31-ee80-4429-a907-04a70c11669a")>, <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="ffa9c82c-56a2-4be2-9491-e237212c89e5")>, <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="4f456b44-9804-414c-8b0a-b670aad5d249")>, <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="90bd73c1-bd05-4c43-b478-486249b98b79")>, <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="1f6c39d9-96f3-4a82-9f85-89148f9ce633")>, <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="cf1a0068-0fe6-4266-8acc-3fcf676ddcae")>, <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="7fd25db2-cbc8-4ee4-b348-87e0ba7b982d")>, <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="1f558db5-4d4e-41fb-a9b8-0ff4d6ee61fe")>, <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="158a1988-5045-442b-b76d-f32fc8a69c4d")>, <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="31400c3b-c92e-4fda-ad8a-238f6903c6c7")>, <selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="986ef5b7-ef29-45ec-8a33-c759d915e2f4")>]
# creamos el objeto de la lsita despegable bottom = driver.find_element(By.ID, 'select_ambito')
bottom.text # las opciones
'TODOS\nPERÚ\nEXTRANJERO'
bottom.tag_name
'select'
bottom.get_attribute('innerHTML') # inner html
'<option _ngcontent-pyn-c101="" value="T">TODOS</option><option _ngcontent-pyn-c101="" value="P">PERÚ</option><option _ngcontent-pyn-c101="" value="E">EXTRANJERO</option><!---->'
bottom.get_attribute('value') # de todos por el resultado inicial o por default
'T'
# podemos observar los tags optiones que siempre estarán dentro de la lista depegable # Un tag opcion para cada opción - <option _ngcontent-ejv-c105="" value="T">TODOS</option> - <option _ngcontent-ejv-c105="" value="P">PERÚ</option> - <option _ngcontent-ejv-c105="" value="E">EXTRANJERO</option>
Input In [209] - <option _ngcontent-ejv-c105="" value="T">TODOS</option> ^ SyntaxError: invalid syntax
bottom.text
'TODOS\nPERÚ\nEXTRANJERO'
bottom.tag_name
'select'
bottom.get_attribute('value')
'T'
print(bottom.get_attribute('id')) # No hau un identificador unico (id es un atributo que permite identificar un elemento html) print(bottom.get_attribute('name')) print(bottom.get_attribute('value')) print(bottom.get_attribute('class')) # value del atributo classs
select_ambito cod_ambito T select_ubigeo ng-untouched ng-pristine ng-valid
# Accedemos a cada opción bottom.find_elements(By.TAG_NAME, 'option') # podemso hacer esto pues el objeto bottom tiene en su inner-html a elementos html # Tres, uno por cada opción (Todos, Perú, Extranjero) print(bottom.find_elements(By.TAG_NAME, 'option')[0].text) print(bottom.find_elements(By.TAG_NAME, 'option')[1].text) bottom.find_elements(By.TAG_NAME, 'option')[2].text
TODOS PERÚ
'EXTRANJERO'
bottom.find_elements(By.TAG_NAME, 'option')[1].click() # click en Perú # Aparecieron opciones para departamentos, provincia y distrito. # La lógica es la misma para entrar sacar información de cada distrito
# Usemos Select para un mejor manejo de los opciones dentro del tag <select></select> ambito = Select(bottom) ambito.options[0] # podemos acceder a cada opción del tag <select></select>
<selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="2c01b473-4d39-4a1c-ae67-2691f518f313")>
ambito.options[1] # Perú
<selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="01fcc312-f92b-45d2-a974-2c6989de6c57")>
ambito.options[2]
<selenium.webdriver.remote.webelement.WebElement (session="b5a230f4c5749e1bf5a18a8c61b67008", element="712c21b0-434e-4565-8567-372df896f045")>
# click Perú ambito.options[1].click()
# diccionario vacío para guardar las tablas all_tables = {}
# tres iteraciones uno para cada departamento for dpt_idx in tqdm(range( 4 )): # Get again all departments since HTML is refreshing # all elements # Click on one specific department dpt = Select( driver.find_element(By.ID, "select_departamento" ) ) # seleción de la lista de departamentos department = dpt.options[ dpt_idx ] # seleccioanmos un departamento # extrate text (nombre del departamento) dpt_name = department.text print(dpt_name) # filtramos la primera opción que es --TODOS-- if dpt_name != "--TODOS--" : # click en departamento department.click() time.sleep(1) # esperar que carque la iformación prov = Select( driver.find_element(By.ID, "cod_prov" ) ) # seleción de la lista de provincias num_prov_options = len( prov.options ) # se contabiliza el total de provincias for prov_idx in range( num_prov_options ): # por cada provincia vamos a extraer los datos de distritos # Get again all districts since HTML is refreshing # all elements prov = Select( driver.find_element(By.ID, "cod_prov" ) ) # seleccionamos cada provincia province = prov.options[ prov_idx ] # seleccionamos una provincia # Get province name prov_name = province.text if prov_name != "--TODOS--" : # click on province province.click() time.sleep(1) # esperar que carque la iformación # Get all elements from district dist = Select( driver.find_element_by_id( "cod_dist" ) ) num_dist_options = len( dist.options ) for dist_idx in range( num_dist_options ): # Get again all districts since HTML is refreshing # all elements dist = Select( driver.find_element_by_id( "cod_dist" ) ) # lista de dstritos district = dist.options[ dist_idx ] # seleccionamos un distrito # Get district name dist_name = district.text if dist_name != "-- SELECCIONE --" : # click on district district.click() time.sleep(1) # esperar que carque la iformación # Get UBIGEO ubigeo = driver.current_url.split("/")[ -1 ] ## Extracciónd e tabla de elecciones presidenciales table_path = driver.find_element( By.ID, "table-scroll" ) table_html = table_path.get_attribute( 'innerHTML' ) # extrae los tags de html # read_html permite leer las tablas <table></table> table = pd.read_html( table_html ) #/html/body/onpe-root/onpe-layout-container/onpe-onpe-epres-re/div[1]/div[4]/div[1]/div[3]/div # selecicón del nombre de variables row_new_columns = table[ 0 ].iloc[ 0 , 2: ] # seleecionamos de la posición 3 de la primera fila clean_columns = row_new_columns \ .str.replace( " ", "_") \ .str.lower().str.replace( "%", "share_") \ .apply( lambda x : unidecode.unidecode( x ) ) \ .tolist() # Seleccionamos columnas con información relevante table_clean = table[0].iloc[ 1:, 2: ].copy() # solo seleccioanmos desde la fila posicón 2 y columns posición 3 # rename columns table_clean.columns = clean_columns # se crea columnas con información del departamento, provincia, distrito y ubigeo table_clean[ 'department' ] = dpt_name table_clean[ 'province' ] = prov_name table_clean[ 'district' ] = dist_name table_clean[ 'ubigeo' ] = ubigeo # Se guarda cada tabla en el diccionario all_tables[ ubigeo ] = table_clean # se usa como llave al ubigeo, {"010203": tabla1, "030201": tabla2 } tabla1, tabla2 son dataframes
--TODOS-- AMAZONAS ANCASH APURIMAC
a = "https://resultadoshistorico.onpe.gob.pe/EG2021/ResumenGeneral/10/P/010000/010200/010202" a.split("/") # separa por /
['https:', '', 'resultadoshistorico.onpe.gob.pe', 'EG2021', 'ResumenGeneral', '10', 'P', '010000', '010200', '010202']
a.split("/")[-1] # ubigeo
'010202'

/html/body/onpe-root/onpe-layout-container/onpe-onpe-epres-re/div[1]/div[4]/div[1]/div[3]/div

all_tables['010202']
# concatenamos todos las tablas # all_tables.values() estrae dataframe de cada llave del diccionario final_data = pd.concat( all_tables.values() ).reset_index( drop = True )
final_data
final_data.to_excel( r'first_round.xlsx' , index = False ) # se exporta a excel
all_tables['010202']
# pd.html nos devuelve en una lsita todas las tablas dentro del inner html. table[0]
table[1] # Solo hay una tabla , claro está
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Input In [267], in <cell line: 1>() ----> 1 table[1] IndexError: list index out of range