7. Web Scraping
Web scraping is the practice of gathering data through any means otherthan a program interacting with an API (or, obviously, through a human using a webbrowser). This is most commonly accomplished by writing an automated programthat queries a web server, requests data (usually in the form of the HTML and otherfiles that comprise web pages), and then parses that data to extract needed information.
7.1 Selenium
Selenium automates browsers. That's it!
Selenium is a Python library and tool used for automating web browsers to do a number of tasks. One of such is web-scraping to extract useful data and information that may be otherwise unavailable.
For this course, we use Chrome.
7.1 Installing Libraries
We need to install these two libraries
7.2 Calling Libraries
7.3 Launch/Set the Driver
Este código abre un controlador Chrome. Lo vamos a usar para navegar en la web.
Chrome está siendo controlado por un software de prueba automatizado !!!
driver is an selenium.webdriver.chrome.webdriver.WebDriver object. This object has some attributes that will help us to navigate on the web.
7.4.1. HTML
HTML stands for HyperText Markup Language. You can deduce that it’s a language for creating web pages. It’s not a programming language like Python or Java, but it’s a markup language. It describes the elements of a page through tags characterized by angle brackets.
The document always begins and ends using
<html>and</html>.<body></body>constitutes the visible part of HTML document.<h1>to<h3>tags are defined for the headings.
7.4.1.1. HTML Headings
HTML headings are defined with the <h1> to <h6> tags. <h1> defines the most important heading. <h6> defines the least important heading.
We can use text cells since markdown reads html tags.
This is heading 1
This is heading 2
This is heading 3
7.4.1.2. HTML Paragraphs
HTML paragraphs are defined with the <p> tag. <br> tag is similar to "\n".
My first paragraph.
This is another paragraph for this text cell.
7.4.1.3. HTML Links
HTML links are defined with the
7.4.1.3. Unordered HTML List
An unordered list starts with the <ul> tag. Each list item starts with the <li> tag.
- Coffee
- Tea
- Milk
7.4.1.4. Ordered HTML List
An ordered list starts with the <ol> tag. Each list item starts with the <li> tag.
- Coffee
- Tea
- Milk
7.4.1.4. HTML Tables
A table in HTML consists of table cells inside rows and columns. Each table cell is defined by a <td> and a </td> tag. Each table row starts with a <tr> and end with a </tr> tag.
| Manager | Club | Nationality |
|---|
7.4.1.5. HTML Iframes
An HTML iframe is used to display a web page within a web page.
Diploma
HTML iframe
Add personal information
Written by Jon Doe.
Visit us at:
Example.com
Box 564, Disneyland
USA
2.0 The td element
The td element defines a cell in a table:
| Cell A | Cell B |
| Cell C | Cell D |
3.0 Bottom
Click the button below to display the hidden content from the template element.
Click here
4.0 The form element
Last name:
Submit
Click the "Submit" button and the form-data will be sent to a page on the server called "action_page.php".
5.0 The label element
Click on one of the text labels to toggle the related radio button:
CSS
JavaScript
Submit
6.0 The select element
The select element is used to create a drop-down list.
Show hidden content
Click the "Submit" button and the form-data will be sent to a page on the server called "action_page.php".
7.0 Class atribute
CHINA
China has the largest population in the world.
INDIA
India has the second largest population in the world.
UNITED STATES
United States has the third largest population in the world.
8.0 Style
This is a heading
This is a paragraph.
9.0 The id Attribute
Use CSS to style an element with the id "myHeader":
My Header
10.0 Div tagname
This is a heading in a div element
This is some text in a div element.
7.4.1.6. HTML Tags - Key
| Tag | Description |
|---|---|
<h1> to <h6> | Defines HTML headings |
<ul> | Defines an unordered list |
<ol> | Defines an ordered list |
<p> | Defines a paragraph |
<a> | It is termed as anchor tag and it creates a hyperlink or link. |
<div> | It defines a division or section within HTML document. |
<strong> | It is used to define important text. |
<table> | It is used to present data in tabular form or to create a table within HTML document. |
<td> | It is used to define cells of an HTML table which contains table data |
<iframe> | Defines an inline frame |
7.4. Identifying elements in a web page
To identify elements of a webpage, we need to inspect the webpage. Open the driver and press Ctrl+ Shift + I.
One Element
| Method | Description |
|---|---|
| find_element_by_id | Use id. |
| find_element_by_name | Use name. |
| find_element_by_xpath | Use Xpath. |
| find_element_by_tag_name | Use HTML tag. |
| find_element_by_class_name | Use class name. |
| find_element_by_css_selector | Use css selector. |
Multiple elements
| Method | Description |
|---|---|
| find_elements_by_id | Use id. |
| find_elements_by_name | Use name. |
| find_elements_by_xpath | Use Xpath. |
| find_elements_by_tag_name | Use HTML tag. |
| find_elements_by_class_name | Use class name. |
| find_elements_by_css_selector | Use css selector. |
7.4.1. Xpath
XPath in Selenium is an XML path used for navigation through the HTML structure of the page. It is a syntax or language for finding any element on a web page using XML path expression.
The basic format of XPath in selenium is explained below with screen shot. 
DO NOT COMPLICATE! Finding the XPath of a element:
Go to the element
Right click
Inspect - You may have to do it twice.
Go to the selected line
Right click
Copy
Copy Full Xpath
Example
Use find_element_by_xpath and click.
Me interesa hacer click en la opción de Resumen general
Para ello hago click derecho y selecciono inspeccionar (Hacer click dos veces)
Luego identifcar el elemento HTML corrrespondiente a la image. Click derecho y copiar el path y Listo.
Full Xpath extenso es sensible a cambios, por ello es preferible otros métodos
Class es el atributo más común de un elemento html
class asigna diseño (letra, color, tamaño, forma, espacios, etc)
Recuerde que el propósito es encontrar la forma de identificar al elemento html.
No obsatante, más de un elemento html pueden compartir los mismos diseños, lo cual hara compliado identificar el elemento HTML.
Recuerde que HTML no es un pseudocódigo de programción como python o R, sino solo entrega el diseño de la página web.
7.4.1. Xpath
XPath in Selenium is an XML path used for navigation through the HTML structure of the page. It is a syntax or language for finding any element on a web page using XML path expression.
The basic format of XPath in selenium is explained below with screen shot. 
El tag es div
nombre del atributo: class
valor del atributo: pic
El xpath resulta "//div[@class = 'pic']"
No osbante habrá otros elementos html con el mismo atributo class = "pic"
Esperemos que no sino hará más complicado identificar el elemento html
!!! No tenemos 6 elementos html con las mismos atributos
Estas son las 6 opciones cuadros de imagen
Entonces debe tener cuidado y elegir el elemento que me interesa. En este ejemplo la opción inicial de Resumen general

Seleccionamos el ámbito. Este tiene 3 alternativas: TODOS, PERU Y EXTRANJERO
Input In [111]
//*[@id="select_ambito"]
^
SyntaxError: invalid syntax
Input In [209]
- <option _ngcontent-ejv-c105="" value="T">TODOS</option>
^
SyntaxError: invalid syntax
/html/body/onpe-root/onpe-layout-container/onpe-onpe-epres-re/div[1]/div[4]/div[1]/div[3]/div
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Input In [267], in <cell line: 1>()
----> 1 table[1]
IndexError: list index out of range