Servage Magazine

Information about YOUR hosting company – where we give you a clear picture of what we think and do!

How to scrape content from the web

Wednesday, February 1st, 2017 by Servage

imagesWeb scraping is a technique to extract data from a web page. The data can be text, images, hyperlinks or anything else found on websites. Today we will look into what web scraping is and why and in what situations it may be useful. We will also meet Gouette, a PHP web scraper library.

How Web Scraping Works

Web scraping starts by sending a GET request to a URL like a browser normally does when you visit a website. When the request is sent, the HTML response is saved so that it can be parsed and data can be extracted from it.

When to do Scraping

In desired situations, the website you want to get data from has a public API that you can communicate with. However, not all websites have one. Let’s imagine you were to build a price comparison website for online stores. It is very likely that you want to include some web shop in the comparison that does not have an API. In fact, most online shops do not have one.

In situations like this, the only choice to get data from the store is to load the website, get its HTML and find the pricing information from the HTML. This is where web scrapers come into play.

Introducing Gouette

Gouette is a popular scraping and crawling tool written in PHP. It allows you to send a GET request, retrieve the response and tinker with the received data. Data can be extracted using CSS selectors, DOM elements and Xpath.

Sending a GET request is easy, and here is a quick introduction. First, we create a new instance of Gouette:

$client = new \Gouette\Client()

Now we are ready to send GET requests:

$client->request(“GET”, “http://www.php.net”)

This returns a Crawler object from the DomCrawler library of Symfony. This allows you to filter and extract data from the HTML, such as:

$crawler->filter(“.product-1 > .price > span)

It is recommended to read through the DomCrawler component documentation on www.symfony.com to get more familiar with filtering the received data.

Things to Note

Before using any web scraping techniques on a website, you should make sure it is not against the terms of the website. Extracting data using automated tools is not allowed on all websites. In addition to this, extracting heavy objects, such as images, causes load on the website you are scraping. Keep these things in mind and use web scraping sparingly.

How to scrape content from the web, 5.0 out of 5 based on 2 ratings
Categories: Guides & Tutorials

Keywords: ,

You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

No comments yet (leave a comment)

You are welcome to initiate a conversation about this blog entry.

Leave a comment

You must be logged in to post a comment.