Selenium Web Scrapping

Selenium web scraping java

Selenium Web Scraping Python
Selenium Web Scraping Python
Selenium Web Scraping

On Example of Scraping with Selenium WebDriver in C#. In this article I will show you how it is easy to scrape a web site using Selenium WebDriver. I will guide you through a sample project which is written in C# and uses WebDriver in conjunction with the Chrome browser to login on the testing page and scrape the text from the private area of the website.

Oct 13, 2020 about writing web spiders that crawl and scrape large portions of the web Free Bonus: Click here to download a 'Python + Selenium' project skeleton with full source code that you can use as a foundation for your own Python web scraping and automation apps. In this post we share with you how to perform web scraping of a JS-rendered website. The tools as seen in the header are JAVA with Selenium library driving headless Chrome instances (download driver) and JSoup as parser to fetch data of the acquired HTML. Selenium was not initially developed for web scraping – it was initially developed for testing web applications but has found its usage in web scraping. In technical terms, Selenium or, more appropriately, Selenium WebDriver is a portable framework for testing web applications. In simple terms, all Selenium does is to automate web browsers.

What this is for: Scraping web pages to collect review data and storing the data into a CSV
Requirements: Python Anaconda distribution, Basic knowledge of HTML structure and Chrome Inspector tool
Concepts covered: Selenium, Error exception handling
Download the entire Python file

In an earlier blog post, I wrote a brief tutorial on web scraping with BeautifulSoup. This is a great tool but has some limitations, particularly if you need to scrape a page with content loaded via AJAX.

Enter Selenium. This is a Python library that is capable of scraping AJAX generated content. Before we continue, it is important to note that Selenium is technically a testing tool, not a scraper.

That said, Selenium is simple to use and can get the job done. In this tutorial, we’ll set up a code similar to what you would need to scrape review data from a website and store it in a CSV file.

Install Selenium library

First, we’ll install the Selenium library in Anaconda.

Click on your Start menu and search for Anaconda Prompt. Open a new Anaconda Prompt window.

Change the directory to where you have Anaconda installed. For example

Next, type

It will take a moment to load and ask for consent to install. Once installed, open Anaconda Navigator and go to the Environment tab. Search packages to make sure it installed.

We’ll also need to install Chromedriver for the code to work. This essentially lets the code take control of a Chrome browser window.

Chromedriver is available for download here. Extract the ZIP file and save the .EXE somewhere on your computer.

Getting started in Python

First we’ll import our libraries and establish our CSV and Pandas dataframe.

Next we’ll define the URLs we want to scrape as an array. We’ll also define the location of our web driver EXE file.

Because we’re scraping multiple pages, we’ll create a for loop to repeat our data gathering steps for each site.

Selenium has the ability to grab elements by their ID, class, tag, or other properties. To find the ID, class, tag or other property you want to scrape, right click within Chrome browser and select Inspect (or you can press F12 to open the Inspector window).

In this case we’ll start with collecting the H1 data. This is simple with the find_element_by_tag_name method.

Cleaning strings

Next, we’ll collect the type of business. For this example, the site I was scraping needed this data cleaned a little bit because of how the data was stored. You may run into a similar situation, so let’s do some basic text cleaning.

When I looked at the section markup with Chrome Inspector, it looks something like this:

In order to send clean data to the CSV, we’ll need to remove the “Categories:” text and replace line breaks with a pipe character to store data like this: “Type1|Type2”. This is how we can accomplish that:

Scraping other elements

For the other elements, we’ll use Selenium’s other methods to capture by class.

Now, let’s piece all the data together and add it to our dataframe. Using the variables we created, we’ll populate a new row to the dataframe.

Handling errors

One error you may encounter is if data is missing. For example, if a business doesn’t have any reviews or comments, the site may not render this div that contains this info into to the page.

If you attempt to scrape a div that doesn’t exist, you’ll get an error. But Python lets you handle errors with the try block.

So let’s assume our business may not have a star rating. In the try: block we’ll write the code for what to do if the “starrating” class exists. In the except: block, we’ll write code for what to do if the try: block returns an error.

A word of caution: If you are planning to do statistical analysis of the data, be careful how you replace error data in the “except” block. For example, if your code cannot find the number of stars, entering this data as “0” will skew any data because there is a difference between having a 0-star rating and not having a star rating. So for this example, data that returns an error will produce a “-” in the dataframe and CSV file instead of a 0.

in Analytics / Marketing 0 comments

Beginner's guide to web scraping with python's selenium

In the first part of this series, we introduced ourselves to the concept of web scraping using two python libraries to achieve this task. Namely, requests and BeautifulSoup. The results were then stored in a JSON file. In this walkthrough, we'll tackle web scraping with a slightly different approach using the selenium python library. We'll then store the results in a CSV file using the pandas library.

The code used in this example is on github.

Why use selenium

Selenium is a framework which is designed to automate test for web applications.You can then write a python script to control the browser interactions automatically such as link clicks and form submissions. However, in addition to all this selenium comes in handy when we want to scrape data from javascript generated content from a webpage. That is when the data shows up after many ajax requests. Nonetheless, both BeautifulSoup and scrapy are perfectly capable of extracting data from a webpage. The choice of library boils down to how the data in that particular webpage is rendered.

Other problems one might encounter while web scraping is the possibility of your IP address being blacklisted. I partnered with scraper API, a startup specializing in strategies that'll ease the worry of your IP address from being blocked while web scraping. They utilize IP rotation so you can avoid detection. Boasting over 20 million IP addresses and unlimited bandwidth.

In addition to this, they provide CAPTCHA handling for you as well as enabling a headless browser so that you'll appear to be a real user and not get detected as a web scraper. For more on its usage, check out my post on web scraping with scrapy. Although you can use it with both BeautifulSoup and selenium.

If you want more info as well as an intro the scrapy library check out my post on the topic.

Using this scraper api link and the codelewis10, you'll get a 10% discount off your first purchase!

For additional resources to understand the selenium library and best practices, this article by towards datascience and accordbox.

Setting up

We'll be using two python libraries. selenium and pandas. To install them simply run pip install selenium pandas

In addition to this, you'll need a browser driver to simulate browser sessions.Since I am on chrome, we'll be using that for the walkthrough.

Driver downloads

Chrome.

Getting started

For this example, we'll be extracting data from quotes to scrape which is specifically made to practise web scraping on.We'll then extract all the quotes and their authors and store them in a CSV file.

The code above is an import of the chrome driver and pandas libraries.We then make an instance of chrome by using driver = Chrome(webdriver)Note that the webdriver variable will point to the driver executable we downloaded previously for our browser of choice. If you happen to prefer firefox, import like so

Main script

On close inspection of the sites URL, we'll notice that the pagination URL isHttp://quotes.toscrape.com/js/page/{{current_page_number}}/

where the last part is the current page number. Armed with this information, we can proceed to make a page variable to store the exact number of web pages to scrape data from. In this instance, we'll be extracting data from just 10 web pages in an iterative manner.

The driver.get(url) command makes an HTTP get request to our desired webpage.From here, it's important to know the exact number of items to extract from the webpage.From our previous walkthrough, we defined web scraping as

This is the process of extracting information from a webpage by taking advantage of patterns in the web page's underlying code.

We can use web scraping to gather unstructured data from the internet, process it and store it in a structured format.

On inspecting each quote element, we observe that each quote is enclosed within a div with the class name of quote. By running the directive driver.get_elements_by_class('quote')we get a list of all elements within the page exhibiting this pattern.

Final step

To begin extracting the information from the webpages, we'll take advantage of the aforementioned patterns in the web pages underlying code.

We'll start by iterating over the quote elements, this allows us to go over each quote and extract a specific record.From the picture above we notice that the quote is enclosed within a span of class text and the author within the small tag with a class name of author.

Finally, we store the quote_text and author names variables in a tuple which we proceed to append to the python list by the name total.

Using the pandas library, we'll initiate a dataframe to store all the records(total list) and specify the column names as quote and author.Finally, export the dataframe to a CSV file which we named quoted.csv in this case.

Don't forget to close the chrome driver using driver.close().

Adittional resources

1. finding elements

You'll notice that I used the find_elements_by_class method in this walkthrough. This is not the only way to find elements. This tutorial by Klaus explains in detail how to use other selectors.

2. Video

Selenium Web Scraping Python

If you prefer to learn using videos this series by Lucid programming was very useful to me.https://www.youtube.com/watch?v=zjo9yFHoUl8

3. Best practises while using selenium

Selenium Web Scraping Python

4. Toptal's guide to modern web scraping with selenium

And with that, hopefully, you too can make a simple web scraper using selenium 😎.

If you enjoyed this post subscribe to my newsletter to get notified whenever I write new posts.

open to collaboration

I recently made a collaborations page on my website. Have an interesting project in mind or want to fill a part-time role?You can now book a session with me directly from my site.

Selenium Web Scraping

Thanks.