Looking for Ruby on Rails development company?
Contact Us

Web Scraping using Selenium in Ruby | Inkoop Blog

Web Scraping using Selenium in Ruby. We talk about how we can use selenium-webdriver gem to scrape websites and get data. We use chrome driver.

Posted by Ameena on 05 Jan 2017
Web Scraping using Selenium in Ruby | Inkoop Blog

Not every website offer an API or mechanism to access the data programmatically, web scraping will be the only way to extract the website information.

There are different tools available to scrape the information from a website and one amongst them is Selenium-webdriver. The rest of the document exclusively deals with selenium.

Before doing anything make sure the gem selenium-webdriver is installed.

gem install selenium-webdriver
bundle install

Let’s Get to Scraping Now...

You should be familiar with atleast the basic html tags to scrape the basic information from the website. Once you know the basics, you are good to go.

The first thing is to run a webdriver. Selenium by default supports Mozilla Firefox browser and in case you want to run the webdriver in chrome, you can simply do it in two steps:

Download the latest version of ChromeDriver server. And then copy the chromedriver into the bin directory to run the webdriver perfectly in chrome.

Scraping in selenium is mainly about retrieving the page and finding the UI elements to display the content.

# scraping.rb

require "selenium-webdriver"
driver = Selenium::WebDriver.for :chrome

Navigate the driver to the page that you need to scrape and load the url. I will be scraping Yukihiro Matsumoto and will be concentrating on fetching the name, birth place and the image url of the person.

# scraping.rb

require "selenium-webdriver"
driver = Selenium::WebDriver.for :chrome
driver.navigate.to "https://en.wikipedia.org/wiki/Yukihiro_Matsumoto"

Now that we have loaded the page, start locating the elements. But, before which define the explicit wait for 20 seconds so that it waits for 20 seconds before throwing a TimeoutException.

To locate any elements of the page, find_element method can be used which will return only one single WebElement where as find_elements method will return a list of WebElement.

# scraping.rb

require "selenium-webdriver"
driver = Selenium::WebDriver.for :chrome
driver.navigate.to "https://en.wikipedia.org/wiki/Yukihiro_Matsumoto"
wait = Selenium::WebDriver::Wait.new(:timeout => 20)

In order to get only the text then call text method on the variable to display the name Yukihiro Matsumoto.

# scraping.rb

require "selenium-webdriver"
driver = Selenium::WebDriver.for :chrome
driver.navigate.to "https://en.wikipedia.org/wiki/Yukihiro_Matsumoto"
wait = Selenium::WebDriver::Wait.new(:timeout => 20)
name = wait.until {
  element_1 = driver.find_element(:class, "firstHeading")
}
puts name.text
# Yukihiro Matsumoto

In order to get the birth place, call the text method on born to obtain the birth place.

# scraping.rb

...
born = wait.until {
  element_2 = driver.find_element(:css, ".infobox.biography.vcard")
  element_2.find_element(:class, "birthplace")
}
puts born.text
# Osaka Prefecture, Japan

The final thing is to get the image url, hence look for the class name image and call attribute method by passing href as an argument which eventually returns the url.

# scraping.rb

...
image_url = wait.until {
  element_3 = driver.find_element(:class, "image").attribute("href")
}
puts image_url
# https://en.wikipedia.org/wiki/File:Yukihiro_Matsumoto.JPG

After everything is scraped, close the driver.

# scraping.rb
...
puts image_url
# https://en.wikipedia.org/wiki/File:Yukihiro_Matsumoto.JPG
driver.quit

Enjoy Scraping!!!

Ameena


Looking for Ruby on Rails development company?
Contact Us

Related Services.



Hire ReactJS Developers
Hire Gatsby Developers
Hire NextJS Developers

We support the Open Source community.



Have a Project in mind?