Servage Magazine

Information about YOUR hosting company – where we give you a clear picture of what we think and do!

Collect data from other websites

Tuesday, June 9th, 2015 by Servage

data-scrapingSometimes web developers need to fetch data from other sources for their own databases or websites. This process may be for simple tasks like including some external content on your own page, or be more complex by feeding data to own algorithms that perform complex calculations. In more practical examples the simple scenario is collecting content from other pages, where the more complex one is systematically crawling the web for relevant information.  News aggregators for example continuously collect and display content from other pages in their own feeds. The intention is not to steal external content, but to facilitate their overview of current news, and linking users back to the original content. It could also be travel websites that show prices for various hotels or airlines. They continuously collect data from many resources, so their users can always get redirected to the cheapest offer at any time.

The amount of examples are endless. There are many websites today which provide an aggregation service based on other sites’ data and content. As long as you adhere to the legal principles required to use such data in a legit manner, there is great potential for valuable information to be used.

Structured vs. unstructured data

Many websites nowadays actually want you to use their data, and they even provide APIs for you to engage with, to make the process of getting (and even submitting data) seamless. However, there are also more reserved content providers, who may not even officially want others to collect their information. Fortunately most times the content providers are interested in getting their information out there, because it traces users back to them somehow, but they just never got around to make an API. Therefore collecting data from regular websites is more or less unstructured data. There may be a structure on the website, but there is rarely a good and guaranteed stable way of continuously collecting information. This is where the fun begins for web scrapers, and where good content grabbers are divided from poor ones. If you can manage to gather unstructured data systematically from crucial sources, you may be giving yourself a very valuable competitive advantage.

Collect data from regular websites.

The key to collecting data successfully from HTML pages is to build a stable scraping system which can gather the information reliably. There are many caveats to be aware of. The largest one being HTML structure changing on a regular basis. I.e. the target website changing it’s underlying structure. This problem requires HTML scrapers to be continuously monitored for such changes.

Using SimpleDOM with PHP

There is a library which provides the underlying functionality to extract information from HTML. It is called SimpleDOM and works with PHP. The library enables you to write CSS selectors to grab content out of HTML sources. It is easy to use and offers a range of methods to aid the process of scraping websites. Consider the sample code below.

// Load HTML from a URL
$html = file_get_html('http://www.google.com/');

// Show all headlines 
foreach($html->find('h1') as $headline)
{
	echo $headline->innertext;
}

This example shows how fast you can load an HTML source from a target URL and how to process it. In this case all H1 tags are being selected and their inner text value displayed. Refer to the SimpleDOM documentation for further information on all the available functions.

Collect data from other websites, 4.3 out of 5 based on 6 ratings
Categories: Business

Keywords: , , ,

You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

No comments yet (leave a comment)

You are welcome to initiate a conversation about this blog entry.

Leave a comment

You must be logged in to post a comment.