Webscraping With Selenium

Web Scraping With Selenium Tutorial
Web Scraping With Selenium Ide
Web Scraping With Selenium Java

What is Selenium: – Selenium is an open-source web-based automation tool. Selenium primarily used for testing in the industry but It can also be used for web scraping. We’ll use the Chrome browser but you can try on any browser, It’s almost the same. Now let us see how to use selenium for Web Scraping.

Build your own web scraping projects. Learn core components of two of the most powerful scraping libraries: BeautifulSoup and Selenium. How to click on a button, send text to an input box, and self-scroll using Selenium. Scraping data off of single page, multiple page, and infinite scrolling websites. 5 projects each with it's own unique challenge.
Home » web scraping » RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium Scraping data from the web is a common tool for data analysis. In fact, it is very creative and ensures a unique data set that no one else has analyzed before.
Using the Python programming language, it is possible to “scrape” data from the web.
To align with terms, web scraping, also known as web harvesting, or web data extraction is data scraping used for data extraction from websites. The web scraping script may access the url directly using HTTP requests or through simulating a web browser. The second approach is exactly how selenium works – it simulates a web browser.

In this post I’ll talk about the RSelenium package as a tool to navigate websites and how it can be combined with the rvest package to scrape dynamic web pages. To understand this post, you’ll need basic knowledge of rvest, HTML and CSS. You can download the full R script HERE!

Observation: Even if you are not familiar with them, I explained as much as possible everything I did. For that reason, those who know about this stuff might find some parts of the post redundant. Feel free to read what you need and skip what you aldeady know!

Let’s compare the following websites:

On IMDb, if you search for a particular movie (for example, this one), you can see that the URL changes, and that URL is different from any other movie (for example, this one). The same behavior is shown if you search for different actors.

On the other hand, if you go to Premier League Player Stats, you will notice that modifying the filters or clicking the pagination button to access more data doesn’t produce changes on the URL.

As I understand it, the first website is an example of a static web page, while the second one is an example of a dynamic webpage.

The following definitions where taken from https://www.pcmag.com/.

Static Web Page: A Web page (HTML page) that contains the same information for all users. Although it may be periodically updated from time to time, it does not change with each user retrieval.
Dynamic Web Page: A Web page that provides custom content for the user based on the results of a search or some other request. Also known as “dynamic HTML” or “dynamic content”, the “dynamic” term is used when referring to interactive Web pages created for each user.

rvest is a great tool to scrape data from static web pages (check out Creating a Movies Dataset to see an example!).

But when it comes to dynamic web pages, rvest alone can’t get the job done. This is when RSelenium joins the party…

Java

You need to have Java installed. You can use Windows’ Command Prompt to check this. Just type java -version and press Enter. You should see something that looks like this:

If it throws an error, it might mean that you don’t have Java installed. You can download it from HERE.

R Packages

The following packages need to be installed and loaded in order to run the code written in this post.

Starting a Selenium server and browser is pretty straightforward using rsDriver().

However, when you run the code above it may produce the following error:

This error is addressed in this StackOverflow post. Basically, it means that there is a mismatch between the ChromeDriver and the Chrome Browser versions. As mentioned in the post, each version of ChromeDriver supports Chrome with matching major, minor, and build version numbers. For example, ChromeDriver 73.0.3683.20 supports all Chrome versions that start with 73.0.3683.

The parameter chromever defined this way always uses the latest compatible ChromeDriver version (the code was edited from this StackOverflow post).

After you run rD <- RSelenium::rsDriver(...), if everything worked correctly, a new chrome window will open. This window should look like this:

You can find more information about rsDriver() in the Basics Vignette.

In this section I’ll apply different methods to the remDr object created above. I’m only going to describe the methods that I think will be used most frequently. For a complete reference, check the package documentation.

navigate(url): Navigate to a given url.

goBack(): Equivalent to hitting the back button on the browser.
goForward(): Equivalent to hitting the forward button on the browser.

refresh(): Reload the current page.

getCurrentUrl(): Retrieve the url of the current page.

maxWindowSize(): Set the size of the browser window to maximum. By default, the browser window size is small, and some elements of the website you navigate to might not be available right away (I’ll talk more about this in the next section).

getPageSource()[[1]] Get the current page source. This method combined with rvest is what makes possible to scrape dynamic web pages. The xml document returned by the method can then be read using rvest::read_html(). This method returns a list object, that’s the reason behind [[1]].

open(silent = FALSE): Send a request to the remote server to instantiate the browser. I use this method when the browser closes for some reason (for example, inactivity). If you have already started the Selenium server, you should run this instead of rD <- RSelenium::rsDriver(...) to re-open the browser.

close(): Close the current session.

Working with Elements

findElement(using, value). Search for an element on the page, starting from the document root. The located element will be returned as an object of webElement class. To use this function you need some basic knowledge of HTML and CSS (or xpath, etc). This chrome extension, called SelectorGadget, might help.
highlightElement(): Utility function to highlight current Element. This helps to check that you selected the wanted element.
sendKeysToElement(): Send a sequence of key strokes to an element. The key strokes are sent as a list. Plain text is enter as an unnamed element of the list. Keyboard entries are defined in ‘selKeys‘ and should be listed with name ‘key‘.
clearElement(): Clear a TEXTAREA or text INPUT element’s value.
clickElement(): Click the element. You can click links, check boxes, dropdown lists, etc.

Other Methods

Even though I have never used them, I believe this methods are worth mentioning. For more information, check the package documentation.

In this example, I’ll scrape data from Premier League Player Stats. This is what the website looks like:

You will notice that when you modify the Filters, the URL does not change. So you can’t use rvest alone to dynamically scrape this website. Also, if you scroll down to the end of the table you’ll see that there are pagination buttons. If you click them, you get more data, but again, the URL does not change. Here you can see how those pagination buttons look like:

Observation: Even though choosing a different stat does change the URL, I’ll work as if it didn’t.

Target Dataset

The dataset I want will have the following variables:

Player: Indicates the player name.
Nationality: Indicates the nationality of the player.
Season: Indicates the season the stats corresponds to.
Club: Indicates the club the player belonged to in the season.
Position: Indicates the player position in the season.
Stats: One column for each Stat.

For simplicity, I’ll scrape data from seasons 2017/18 and 2018/19, and only from the Goals, Assists, Minutes Played, Passes, Shots and Fouls stats. This means that our dataset will have a total of 11 columns.

Before we start…

In order to run the code below, you have to start a Selenium server and browser, and create the remDr object. This procedure was described in the Start Selenium section.

First Steps

The code chunk below navigates to the website, increases the windows size to find elements that might be hidden (for example, when the window is small I can’t see the Filters) and then clicks the “Accept Cookies” button.

You might notice two things:

The use of the Sys.sleep() function. Here, this function is used to give the website enough time to load. Sometimes, if the element you want to find isn’t loaded when you search for it, it will produce an error.
The use of CSS selectors. To select an element using CSS you can press F12 an inspect the page source (right clicking the element and selecting Inspect will show you which part of that code refers to the element) and/or use this chrome extension, called SelectorGadget. I recommend learning a little about HTML and CSS and use this two approaches simultaneosly. SelectorGadget helps, but sometimes you will need to inspect the source to get exactly what you want. In the next subsection I’ll show how I selected certain elements by inspecting the page source.

Getting Values to Iterate Over

I know that in order to get the data, I’ll have to iterate over different lists of values. In particular, I need a list of stats, seasons, and player positions.

We can use rvest to scrape the website and get these lists. To do so, we need to find the corresponding nodes. As an example, after the code I’ll show where I searched for the required information in the page source for the stats and seasons lists.

The code below uses rvest to create the lists we’ll use in the loops.

Observation: Even though in the source we don’t see that each word has its first letteruppercased, when we check the dropdown list we see exactly that (for example, we have “Clean Sheets” instead of “Clean sheets”). I was getting an error when trying to scrape these type of stats, and making them look like the dropdown list solved the issue. That’s the reason behind str_to_title().

Stats

This is my view when I open the stats dropdown list and right click and inspect the Clean Sheets stat.

Taking a closer look to the source where that element is present we get:

Seasons

This is my view when I open the seasons dropdown list and right click and inspect the 2016/17 season.

Taking a closer look to the source where that element is present we get:

As you can see, we have an attribute named data-dropdown-list whose value is FOOTBALL_COMPSEASON and inside we have li tags where the attribute data-option-name changes for each season. This will be useful when defining how to iterate using RSelenium.

Positions

The logic behind getting the CSS for the positions is similar to the one described above, so I won’t be showing it.

Webscraping Loop

The code has comments on each step, so you can check it out! But before that, I’ll give an overview of the loop.

Preallocate stats vector. This list will have a length equal to the number of stats to be scraped.
For each stat:
1. Click the stat dropdown list
2. Click the corresponding stat
3. Preallocate seasons vector. This list will have a length equal to the number of seasons to be scraped.
4. For each season inside stat:
  1. Click the seasons dropdown list
  2. Click the corresponding season
  3. Preallocate positions vector. This list will have length = 4 (positions are fixed: GOALKEEPER, DEFENDER, MIDFIELDER and FORWARD).
  4. For each position inside season inside stat
    1. Click the position dropdown list
    2. Click the corresponding position
    3. Check that there is a table with data (if not, go to next position)
    4. Scrape the first table
    5. While “Next Page” button exists
      1. Click “Next Page” button
      2. Scrape new table
      3. Append new table to table
    6. Change stat colname and add position data
    7. Go to the top of the website
  5. Rowbind each position table
  6. Add season data
5. Rowbind each season table
6. Assign the table to the corresponding stat element.

The result of this loop is a populated list with a number of elements equal to the number of stats scraped. Each of this elements is a tibble.

This may take some time to run, so you can choose less stats to try it out.

As I mentioned, you can check the code!

Observation: Be careful when you add more stats to the loop. For example, Clean Sheets has the Position filter hidden, so the code should be modified (for example, by adding some “if” statement).

Data Wrangling

Web Scraping With Selenium Tutorial

Finally, some data wrangling is needed to create our dataset. data_topStats is a list with 6 elements, each one of those elements is a tibble. The next code chunk removes the Rank column from each tibble, reorders the columns and then makes a full join by all the non-stat variables using reduce() (the reason behind this full join is that not all players have all stats). In the last line of code I replace NA values with zero in the stats variables.

This is how the data looks like.

Season	Position	Club	Player	Nationality	Goals	Assists	Minutes Played	Passes	Shots	Fouls
2018/19	DEFENDER	Brighton and Hove Albion	Shane Duffy	Ireland	5	1	3088	1305	37	22
2018/19	DEFENDER	AFC Bournemouth	Nathan Aké	Netherlands	4	0	3412	1696	25	28
2018/19	DEFENDER	Cardiff City	Sol Bamba	Cote D’Ivoire	4	1	2475	550	22	35
2018/19	DEFENDER	Wolverhampton Wanderers	Willy Boly	France	4	0	3168	1715	24	29
2018/19	DEFENDER	Everton	Lucas Digne	France	4	4	2966	1457	34	39
2018/19	DEFENDER	Wolverhampton Wanderers	Matt Doherty	Ireland	4	5	3147	1399	46	30

The framework described here is an approach to working in parallel with RSelenium.

First, we load the libraries we need.

The function defined below stops Selenium on each core.

We determine the number of cores we’ll use. In this example, I use four cores.

We have to list the ports that are going to be used to start Selenium.

We use clusterApply() to start Selenium on each core. Pay attention to the use of the Superassignment operator. When you run this function, you will see that four chrome windows are opened.

This is an example of pages that we will open in parallel. This list will change depending on the particular scenario.

Use parLapply() to work in parallel. When you run this, you will see that each browser opens one website, and one is still blank. This is a simple example, I haven’t defined any scraping, but of course you can!

when you are done, stop Selenium on each core and stop the cluster.

Observation: Sometimes, when working in parallel some of the browsers close for no apparent reason (or at least a reason that I don’t understand).

Workaround browser closing for no reason

Consider the following scenario: your loop navigates to a certain website, clicks some elements and then gets the page source to scrape using rvest. If in the middle of that loop the browser closes, you will get an error (for example, it won’t navigate to the website, or the element won’t be found). You can work around these errors using tryCatch(), but when you skip the iteration where the error occurred, when you try to navigate to the website in the following iteration, an error would occur again (because there is no browser open!).

Web Scraping With Selenium Ide

You could, for example, use remDr$open() in the beggining of the loop, and remDr$close() in the end, but I think that will open and close many browsers and make the process slower.

So I created this function that handles part of the problem (even though the iteration where the browser closed will not finish, the next one will and the process won’t stop).

It basically tries to get the current URL using remDr$getCurrentUrl(). If no browser is open, this will throw an error, and if we get an error, it will open a browser.

Closing Selenium

Web Scraping With Selenium Java

Sometimes, even if the browser window is closed, when you re-run rD <- RSelenium::rsDriver(...) you might encounter an error like:

This means that the connection was not completely closed. You can execute the lines of code below to stop Selenium.

You can check this. StackOverflow post for more information.

Wrapper Functions

You can create functions in order to type less. Suppose that you navigate to a certain website where you have to click one link that sends you to a site with different tabs. You can use something like this:

Observation: this function is theoretical, it won’t work if you run it.

I won’t show it here, but you can create functions to find elements, check if an element exists on the DOM (Document Object Model), try to click an element if it exists, parse the data table you are interested in, etc. You can check this StackOverflow for examples.

The following list contains different videos, posts and StackOverflow posts that I found useful when learning and working with RSelenium.

The ultimate online collection toolbox: Combining RSelenium and Rvest ( Part I and Part II ). If you know about rvest and just want to learn about RSelenium, I’d recommend watching Part II. It gives an overview of what you can do when combining RSelenium and rvest. It has nice an practical examples. As a final comment regarding these videos, I wouldn’t pay too much attention to setting up Docker because at least I didn’t need to work that way in order to get RSelenium going. In fact, at least now, getting it going is pretty straightforward.
RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium. I found this post really useful when trying to set up RSelenium. The solution given in this StackOverflow post, which is mentioned in the article, seems to be enough.
Dungeons and Dragons Web Scraping with rvest and RSelenium. This is a great post! It starts with a general tutorial for scraping with rvest and then dives into RSelenium. If you are not familiar with rvest, you can start here.
RSelenium Tutorial. This post might be helpful too.
RSelenium Package Website. It has more advanced and detailed content. I just took a look to the Basics Vignette.
These StackOverflow posts helped me when working with dropdown lists:
RSelenium: server signals port is already in use. This post gives a solution to the “port already in use” problem. Even though is not marked as best, the last line of code of the second answer is useful.
Data Scraping in R. Thanks to this post I found the Premier League Stats website, which was exactly what I was looking for to write a post about RSelenium. Also, I took some hints from the answer marked as best.
CSS Tutorials: