Friday, 3 July 2015

ECJ clarifies Database Directive scope in screen scraping case

EC on the legal protection of databases (Database Directive) in a case concerning the extraction of data from a third party’s website by means of automated systems or software for commercial purposes (so called 'screen scraping').

Flight data extracted

The case, Ryanair Ltd vs. PR Aviation BV, C-30/14, is of interest to a range of companies such as price comparison websites. It stemmed from  Dutch company PR Aviation operation of a website where consumers can search through flight data of low-cost airlines  (including Ryanair), compare prices and, on payment of a commission, book a flight. The relevant flight data is extracted from third-parties’ websites by means of ‘screen scraping’ practices.

Ryanair claimed that PR Aviation’s activity:

• amounted to infringement of copyright (relating to the structure and architecture of the database) and of the so-called sui generis database right (i.e. the right granted to the ‘maker’ of the database where certain investments have been made to obtain, verify, or present the contents of a database) under the Netherlands law implementing the Database Directive;

• constituted breach of contract. In this respect, Ryanair claimed that a contract existed with PR Aviation for the use of its website. Access to the latter requires acceptance, by clicking a box, of the airline’s general terms and conditions which, amongst others, prohibit unauthorized ‘screen scraping’ practices for commercial purposes.

Ryanair asked Dutch courts to prohibit the infringement and order damages. In recent years the company has been engaged in several legal cases against web scrapers across Europe.

The Local Court, Utrecht, and the Court of Appeals of Amsterdam dismissed Ryanair’s claims on different grounds. The Court of Appeals, in particular, cited PR Aviation’s screen scraping of Ryanair’s website as amounting to a “normal use” of said website within the meaning of the lawful user exceptions under Sections 6 and 8 of the Database Directive, which cannot be derogated by contract (Section 15).

Ryanair appealed

Ryanair appealed the decision before the Netherlands Supreme Court (Hoge Raad der Nederlanden), which decided to refer the following question to the ECJ for a preliminary ruling: “Does the application of [Directive 96/9] also extend to online databases which are not protected by copyright on the basis of Chapter II of said directive or by a sui generis right on the basis of Chapter III, in the sense that the freedom to use such databases through the (whether or not analogous) application of Article[s] 6(1) and 8, in conjunction with Article 15 [of Directive 96/9] may not be limited contractually?.”

The ECJ’s ruling

The ECJ (without the need of the opinion of the advocate general) ruled that the Database Directive is not applicable to databases which are not protected either by copyright or by the sui generis database right. Therefore, exceptions to restricted acts set forth by Sections 6 and 8 of the Directive do not prevent the database owner from establishing contractual limitations on its use by third parties. In other words, restrictions to the freedom to contract set forth by the Database Directive do not apply in cases of unprotected databases. Whether Ryanair’s website may be entitled to copyright or sui generis database right protection needs to be determined by the competent national court.

The ECJ’s decision is not particularly striking from a legal standpoint. Yet, it could have a significant impact on the business model of price comparison websites, aggregators, and similar businesses. Owners of databases that could not rely on intellectual property protection may contractually prevent extraction and use (“scraping”) of content from their online databases. Thus, unprotected databases could receive greater protection than the one granted by IP law.

Antitrust implications

However, the lawfulness of contractual restrictions prohibiting access and reuse of data through screen scraping practices should be assessed under an antitrust perspective. In this respect, in 2013 the Court of Milan ruled that Ryanair’s refusal to grant access to its database to the online travel agency Viaggiare S.r.l. amounted to an abuse of dominant position in the downstream market of information and intermediation on flights (decision of June 4, 2013 Viaggiare S.r.l. vs Ryanair Ltd). Indeed, a balance should be struck between the need to compensate the efforts and investments made by the creator of the database with the interest of third parties to be granted with access to information (especially in those cases where the latter are not entitled to copyright protection).

Additionally, web scraping triggers other issues which have not been considered by the ECJ’s ruling. These include, but are not limited to trademark law (i.e., whether the use of a company’s names/logos by the web scraper without consent may amount to trademark infringement), data protection (e.g., in case the scraping involves personal data), or unfair competition.

Source: http://www.globallegalpost.com/blogs/global-view/ecj-clarifies-database-directive-scope-in-screen-scraping-case-128701/

Friday, 26 June 2015

Data Scraping - Increasing Accessibility by Scraping Information From PDF

You may have heard about data scraping which is a method that is being used by computer programs in extracting data from an output that comes from another program. To put it simply, this is a process which involves the automatic sorting of information that can be found on different resources including the internet which is inside an html file, PDF or any other documents. In addition to that, there is the collection of pertinent information. These pieces of information will be contained into the databases or spreadsheets so that the users can retrieve them later.

Most of the websites today have text that can be accessed and written easily in the source code. However, there are now other businesses nowadays that choose to make use of Adobe PDF files or Portable Document Format. This is a type of file that can be viewed by simply using the free software known as the Adobe Acrobat. Almost any operating system supports the said software. There are many advantages when you choose to utilize PDF files. Among them is that the document that you have looks exactly the same even if you put it in another computer so that you can view it. Therefore, this makes it ideal for business documents or even specification sheets. Of course there are disadvantages as well. One of which is that the text that is contained in the file is converted into an image. In this case, it is often that you may have problems with this when it comes to the copying and pasting.

This is why there are some that start scraping information from PDF. This is often called PDF scraping in which this is the process that is just like data scraping only that you will be getting information that is contained in your PDF files. In order for you to begin scraping information from PDF, you must choose and exploit a tool that is specifically designed for this process. However, you will find that it is not easy to locate the right tool that will enable you to perform PDF scraping effectively. This is because most of the tools today have problems in obtaining exactly the same data that you want without personalizing them.

Nevertheless, if you search well enough, you will be able to encounter the program that you are looking for. There is no need for you to have programming language knowledge in order for you to use them. You can easily specify your own preferences and the software will do the rest of the work for you. There are also companies out there that you can contact and they will perform the task since they have the right tools that they can use. If you choose to do things manually, you will find that this is indeed tedious and complicated whereas if you compare this to having professionals do the job for you, they will be able to finish it in no time at all. Scraping information from PDF is a process where you collect the information that can be found on the internet and this does not infringe copyright laws.

Source: http://ezinearticles.com/?Increasing-Accessibility-by-Scraping-Information-From-PDF&id=4593863

Saturday, 20 June 2015

Web Scraping: working with APIs

APIs present researchers with a diverse set of data sources through a standardised access mechanism: send a pasted together HTTP request, receive JSON or XML in return. Today we tap into a range of APIs to get comfortable sending queries and processing responses.

These are the slides from the final class in Web Scraping through R: Web scraping for the humanities and social sciences

This week we explore how to use APIs in R, focusing on the Google Maps API. We then attempt to transfer this approach to query the Yandex Maps API. Finally, the practice section includes examples of working with the YouTube V2 API, a few ‘social’ APIs such as LinkedIn and Twitter, as well as APIs less off the beaten track (Cricket scores, anyone?).

I enjoyed teaching this course and hope to repeat and improve on it next year. When designing the course I tried to cram in everything I wish I had been taught early on in my PhD (resulting in information overload, I fear). Still, hopefully it has been useful to students getting started with digital data collection, showing on the one hand what is possible, and on the other giving some idea of key steps in achieving research objectives.

Download the .Rpres file to use in Rstudio here

A regular R script with code-snippets only can be accessed here

Slides from the first session here

Slides from the second session here

Slides from the third session here

Source: http://www.r-bloggers.com/web-scraping-working-with-apis/

Tuesday, 9 June 2015

Three Common Methods For Web Data Extraction

Probably the most common technique used traditionally to extract data from web pages this is to cook up some regular expressions that match the pieces you want (e.g., URL's and link titles). Our screen-scraper software actually started out as an application written in Perl for this very reason. In addition to regular expressions, you might also use some code written in something like Java or Active Server Pages to parse out larger chunks of text. Using raw regular expressions to pull out the data can be a little intimidating to the uninitiated, and can get a bit messy when a script contains a lot of them. At the same time, if you're already familiar with regular expressions, and your scraping project is relatively small, they can be a great solution.

Other techniques for getting the data out can get very sophisticated as algorithms that make use of artificial intelligence and such are applied to the page. Some programs will actually analyze the semantic content of an HTML page, then intelligently pull out the pieces that are of interest. Still other approaches deal with developing "ontologies", or hierarchical vocabularies intended to represent the content domain.

There are a number of companies (including our own) that offer commercial applications specifically intended to do screen-scraping. The applications vary quite a bit, but for medium to large-sized projects they're often a good solution. Each one will have its own learning curve, so you should plan on taking time to learn the ins and outs of a new application. Especially if you plan on doing a fair amount of screen-scraping it's probably a good idea to at least shop around for a screen-scraping application, as it will likely save you time and money in the long run.

So what's the best approach to data extraction? It really depends on what your needs are, and what resources you have at your disposal. Here are some of the pros and cons of the various approaches, as well as suggestions on when you might use each one:

Raw regular expressions and code

Advantages:

- If you're already familiar with regular expressions and at least one programming language, this can be a quick solution.

- Regular expressions allow for a fair amount of "fuzziness" in the matching such that minor changes to the content won't break them.

- You likely don't need to learn any new languages or tools (again, assuming you're already familiar with regular expressions and a programming language).

- Regular expressions are supported in almost all modern programming languages. Heck, even VBScript has a regular expression engine. It's also nice because the various regular expression implementations don't vary too significantly in their syntax.

Disadvantages:

- They can be complex for those that don't have a lot of experience with them. Learning regular expressions isn't like going from Perl to Java. It's more like going from Perl to XSLT, where you have to wrap your mind around a completely different way of viewing the problem.

- They're often confusing to analyze. Take a look through some of the regular expressions people have created to match something as simple as an email address and you'll see what I mean.

- If the content you're trying to match changes (e.g., they change the web page by adding a new "font" tag) you'll likely need to update your regular expressions to account for the change.

- The data discovery portion of the process (traversing various web pages to get to the page containing the data you want) will still need to be handled, and can get fairly complex if you need to deal with cookies and such.

When to use this approach: You'll most likely use straight regular expressions in screen-scraping when you have a small job you want to get done quickly. Especially if you already know regular expressions, there's no sense in getting into other tools if all you need to do is pull some news headlines off of a site.

Ontologies and artificial intelligence

Advantages:

- You create it once and it can more or less extract the data from any page within the content domain you're targeting.

- The data model is generally built in. For example, if you're extracting data about cars from web sites the extraction engine already knows what the make, model, and price are, so it can easily map them to existing data structures (e.g., insert the data into the correct locations in your database).

- There is relatively little long-term maintenance required. As web sites change you likely will need to do very little to your extraction engine in order to account for the changes.

Disadvantages:

- It's relatively complex to create and work with such an engine. The level of expertise required to even understand an extraction engine that uses artificial intelligence and ontologies is much higher than what is required to deal with regular expressions.

- These types of engines are expensive to build. There are commercial offerings that will give you the basis for doing this type of data extraction, but you still need to configure them to work with the specific content domain you're targeting.

- You still have to deal with the data discovery portion of the process, which may not fit as well with this approach (meaning you may have to create an entirely separate engine to handle data discovery). Data discovery is the process of crawling web sites such that you arrive at the pages where you want to extract data.

When to use this approach: Typically you'll only get into ontologies and artificial intelligence when you're planning on extracting information from a very large number of sources. It also makes sense to do this when the data you're trying to extract is in a very unstructured format (e.g., newspaper classified ads). In cases where the data is very structured (meaning there are clear labels identifying the various data fields), it may make more sense to go with regular expressions or a screen-scraping application.

Screen-scraping software

Advantages:

- Abstracts most of the complicated stuff away. You can do some pretty sophisticated things in most screen-scraping applications without knowing anything about regular expressions, HTTP, or cookies.

- Dramatically reduces the amount of time required to set up a site to be scraped. Once you learn a particular screen-scraping application the amount of time it requires to scrape sites vs. other methods is significantly lowered.

- Support from a commercial company. If you run into trouble while using a commercial screen-scraping application, chances are there are support forums and help lines where you can get assistance.

Disadvantages:

- The learning curve. Each screen-scraping application has its own way of going about things. This may imply learning a new scripting language in addition to familiarizing yourself with how the core application works.

- A potential cost. Most ready-to-go screen-scraping applications are commercial, so you'll likely be paying in dollars as well as time for this solution.

- A proprietary approach. Any time you use a proprietary application to solve a computing problem (and proprietary is obviously a matter of degree) you're locking yourself into using that approach. This may or may not be a big deal, but you should at least consider how well the application you're using will integrate with other software applications you currently have. For example, once the screen-scraping application has extracted the data how easy is it for you to get to that data from your own code?

When to use this approach: Screen-scraping applications vary widely in their ease-of-use, price, and suitability to tackle a broad range of scenarios. Chances are, though, that if you don't mind paying a bit, you can save yourself a significant amount of time by using one. If you're doing a quick scrape of a single page you can use just about any language with regular expressions. If you want to extract data from hundreds of web sites that are all formatted differently you're probably better off investing in a complex system that uses ontologies and/or artificial intelligence. For just about everything else, though, you may want to consider investing in an application specifically designed for screen-scraping.

As an aside, I thought I should also mention a recent project we've been involved with that has actually required a hybrid approach of two of the aforementioned methods. We're currently working on a project that deals with extracting newspaper classified ads. The data in classifieds is about as unstructured as you can get. For example, in a real estate ad the term "number of bedrooms" can be written about 25 different ways. The data extraction portion of the process is one that lends itself well to an ontologies-based approach, which is what we've done. However, we still had to handle the data discovery portion. We decided to use screen-scraper for that, and it's handling it just great. The basic process is that screen-scraper traverses the various pages of the site, pulling out raw chunks of data that constitute the classified ads. These ads then get passed to code we've written that uses ontologies in order to extract out the individual pieces we're after. Once the data has been extracted we then insert it into a database.

Source: http://ezinearticles.com/?Three-Common-Methods-For-Web-Data-Extraction&id=165416

Tuesday, 2 June 2015

Scraping the Royal Society membership list

To a data scientist any data is fair game, from my interest in the history of science I came across the membership records of the Royal Society from 1660 to 2007 which are available as a single PDF file. I’ve scraped the membership list before: the first time around I wrote a C# application which parsed a plain text file which I had made from the original PDF using an online converting service, looking back at the code it is fiendishly complicated and cluttered by boilerplate code required to build a GUI. ScraperWiki includes a pdftoxml function so I thought I’d see if this would make the process of parsing easier, and compare the ScraperWiki experience more widely with my earlier scraper.

The membership list is laid out quite simply, as shown in the image below, each member (or Fellow) record spans two lines with the member name in the left most column on the first line and information on their birth date and the day they died, the class of their Fellowship and their election date on the second line.

Later in the document we find that information on the Presidents of the Royal Society is found on the same line as the Fellow name and that Royal Patrons are formatted a little differently. There are also alias records where the second line points to the primary record for the name on the first line.

pdftoxml converts a PDF into an xml file, wherein each piece of text is located on the page using spatial coordinates, an individual line looks like this:

<text top="243" left="135" width="221" height="14" font="2">Abbot, Charles, 1st Baron Colchester </text>

This makes parsing columnar data straightforward you simply need to select elements with particular values of the “left” attribute. It turns out that the columns are not in exactly the same positions throughout the whole document, which appears to have been constructed by tacking together the membership list A-J with that of K-Z, but this can easily be resolved by accepting a small range of positions for each column.

Attempting to automatically parse all 395 pages of the document reveals some transcription errors: one Fellow was apparently elected on 16th March 197 – a bit of Googling reveals that the real date is 16th March 1978. Another fellow is classed as a “Felllow”, and whilst most of the dates of birth and death are separated by a dash some are separated by an en dash which as far as the code is concerned is something completely different and so on. In my earlier iteration I missed some of these quirks or fixed them by editing the converted text file. These variations suggest that the source document was typed manually rather than being output from a pre-existing database. Since I couldn’t edit the source document I was obliged to code around these quirks.

ScraperWiki helpfully makes putting data into a SQLite database the simplest option for a scraper. My handling of dates in this version of the scraper is a little unsatisfactory: presidential terms are described in terms of a start and end year but are rendered 1st January of those years in the database. Furthermore, in historical documents dates may not be known accurately so someone may have a birth date described as “circa 1782″ or “c 1782″, even more vaguely they may be described as having “flourished 1663-1778″ or “fl. 1663-1778″. Python’s default datetime module does not capture this subtlety and if it did the database used to store dates would need to support it too to be useful – I’ve addressed this by storing the original life span data as text so that it can be analysed should the need arise. Storing dates as proper dates in the database, rather than text strings means we can query the database using date based queries.

ScraperWiki provides an API to my dataset so that I can query it using SQL, and since it is public anyone else can do this too. So, for example, it’s easy to write queries that tell you the the database contains 8019 Fellows, 56 Presidents, 387 born before 1700, 3657 with no birth date, 2360 with no death date, 204 “flourished”, 450 have birth dates “circa” some year.

I can count the number of classes of fellows:

select distinct class,count(*) from `RoyalSocietyFellows` group by class

Make a table of all of the Presidents of the Royal Society

select * from `RoyalSocietyFellows` where StartPresident not null order by StartPresident desc

…and so on. These illustrations just use the ScraperWiki htmltable export option to display the data as a table but equally I could use similar queries to pull data into a visualisation.

Comparing this to my earlier experience, the benefits of using ScraperWiki are:

•    Nice traceable code to provide a provenance for the dataset;

•    Access to the pdftoxml library;

•    Strong encouragement to “do the right thing” and put the data into a database;

•    Publication of the data;

•    A simple API giving access to the data for reuse by all.

My next target for ScraperWiki may well be the membership lists for the French Academie des Sciences, a task which proved too complex for a simple plain text scraper…

Source: https://scraperwiki.wordpress.com/2012/12/28/scraping-the-royal-society-membership-list/

Friday, 29 May 2015

Data Scraping Services - Login to Website Programmatically using C# for Web Scraping

In many scenario the data is available after login that you want to scrape. So to reach at the page where data is located you need to implement code in web scraper  that automatically takes usename/email and password to login into website, once login is done you can do crawling and parsing as required.

Many third party web scraping application provides functionality where you can locate login url and set login parameters and that login task will be called when scraper start and do web scraping.

Below is C# example of programmatically  login to demo login page

http://demo.webdata-scraping.com/login.php

Below is HTML code of Login form:

<form class="form-signin" id="login" method="post" role="form"> <h3 class="form-signin-heading">Please sign in</h3> <a href="#" id="flipToRecover" class="flipLink"> <div id="triangle-topright"></div> </a> <input type="email" class="form-control" name="loginEmail" id="loginEmail" placeholder="Email address" required autofocus> <input type="password" class="form-control" name="loginPass" id="loginPass" placeholder="Password" required> <button class="btn btn-lg btn-primary btn-block" name="login_submit" id="login_submit" type="submit">Sign in</button> </form>

<form class="form-signin" id="login" method="post" role="form">

            <h3 class="form-signin-heading">Please sign in</h3>

            <a href="#" id="flipToRecover" class="flipLink">

              <div id="triangle-topright"></div>

            </a>

            <input type="email" class="form-control" name="loginEmail" id="loginEmail" placeholder="Email address" required autofocus>

            <input type="password" class="form-control" name="loginPass" id="loginPass" placeholder="Password" required>

            <button class="btn btn-lg btn-primary btn-block" name="login_submit" id="login_submit" type="submit">Sign in</button>

</form>

In this code you can notice there is ID for email input box that is id=”loginEmail”  and password input box that is id=”loginPass”

so by taking this ID we will use below two method of webBrowser control and fill the value of each input box using following code

webBrowser1.Document.GetElementById("loginEmail").InnerText =textBox1.Text.ToString(); webBrowser1.Document.GetElementById("loginPass").InnerText = textBox2.Text.ToString();

webBrowser1.Document.GetElementById("loginEmail").InnerText =textBox1.Text.ToString();

webBrowser1.Document.GetElementById("loginPass").InnerText = textBox2.Text.ToString();

After the value filled to Email and Password input box we will just call click event of submit button which is named as Sign In

webBrowser1.Document.GetElementById("login_submit").InvokeMember("click");

webBrowser1.Document.GetElementById("login_submit").InvokeMember("click");

So this is very basic example how you can login to website programatically when you need to access data that is available after login to website.  This is very simple way in which you can work with Web Browser control but there are some other way as well using which you can do same thing.

Source: http://webdata-scraping.com/login-website-programmatically-using-c-web-scraping/

Tuesday, 26 May 2015

Web Scraping Services : What are the ethics of web scraping?

Someone recently asked: "Is web scraping an ethical concept?" I believe that web scraping is absolutely an ethical concept. Web scraping (or screen scraping) is a mechanism to have a computer read a website. There is absolutely no technical difference between an automated computer viewing a website and a human-driven computer viewing a website. Furthermore, if done correctly, scraping can provide many benefits to all involved.

There are a bunch of great uses for web scraping. First, services like Instapaper, which allow saving content for reading on the go, use screen scraping to save a copy of the website to your phone. Second, services like Mint.com, an app which tells you where and how you are spending your money, uses screen scraping to access your bank's website (all with your permission). This is useful because banks do not provide many ways for programmers to access your financial data, even if you want them to. By getting access to your data, programmers can provide really interesting visualizations and insight into your spending habits, which can help you save money.

That said, web scraping can veer into unethical territory. This can take the form of reading websites much quicker than a human could, which can cause difficulty for the servers to handle it. This can cause degraded performance in the website. Malicious hackers use this tactic in what’s known as a "Denial of Service" attack.

Another aspect of unethical web scraping comes in what you do with that data. Some people will scrape the contents of a website and post it as their own, in effect stealing this content. This is a big no-no for the same reasons that taking someone else's book and putting your name on it is a bad idea. Intellectual property, copyright and trademark laws still apply on the internet and your legal recourse is much the same. People engaging in web scraping should make every effort to comply with the stated terms of service for a website. Even when in compliance with those terms, you should take special care in ensuring your activity doesn't affect other users of a website.

One of the downsides to screen scraping is it can be a brittle process. Minor changes to the backing website can often leave a scraper completely broken. Herein lies the mechanism for prevention: making changes to the structure of the code of your website can wreak havoc on a screen scraper's ability to extract information. Periodically making changes that are invisible to the user but affect the content of the code being returned is the most effective mechanism to thwart screen scrapers. That said, this is only a set-back. Authors of screen scrapers can always update them and, as there is no technical difference between a computer-backed browser and a human-backed browser, there's no way to 100% prevent access.

Going forward, I expect screen scraping to increase. One of the main reasons for screen scraping is that the underlying website doesn't have a way for programmers to get access to the data they want. As the number of programmers (and the need for programmers) increases over time, so too will the need for data sources. It is unreasonable to expect every company to dedicate the resources to build a programmer-friendly access point. Screen scraping puts the onus of data extraction on the programmer, not the company with the data, which can work out well for all involved.

Source: https://quickleft.com/blog/is-web-scraping-ethical/

Monday, 25 May 2015

Screen Scraping with BeautifulSoup and lxml

Please enjoy this — a free Chapter of the Python network programming book that I revised for Apress in 2010!

I completely rewrote this chapter for the book's second edition, to feature two powerful libraries that have appeared since the book first came out. I show how to screen-scrape a real-life web page using both BeautifulSoup and also the powerful lxml library (their web sites are here and here).

I chose this chapter for release because screen scraping is often the first network task that a novice Python programmer tackles. Because this material is oriented towards beginners, it explains the entire process — from fetching web pages, to understanding HTML, to querying for specific elements in the document.

Program listings are available for this chapter in both Python 2 and also in Python 3. Let me know if you have any questions!

Most web sites are designed first and foremost for human eyes. While well-designed sites offer formal APIs by which you can construct Google maps, upload Flickr photos, or browse YouTube videos, many sites offer nothing but HTML pages formatted for humans. If you need a program to be able to fetch its data, then you will need the ability to dive into densely formatted markup and retrieve the information you need—a process known affectionately as screen scraping.

In one's haste to grab information from a web page sitting open in your browser in front of you, it can be easy for even experienced programmers to forget to check whether an API is provided for data that they need. So try to take a few minutes investigating the site in which you are interested to see if some more formal programming interface is offered to their services. Even an RSS feed can sometimes be easier to parse than a list of items on a full web page.

Also be careful to check for a “terms of service” document on each site. YouTube, for example, offers an API and, in return, disallows programs from trying to parse their web pages. Sites usually do this for very important reasons related to performance and usage patterns, so I recommend always obeying the terms of service and simply going elsewhere for your data if they prove too restrictive.

Regardless of whether terms of service exist, always try to be polite when hitting public web sites. Cache pages or data that you will need for several minutes or hours, rather than hitting their site needlessly over and over again. When developing your screen-scraping algorithm, test against a copy of their web page that you save to disk, instead of doing an HTTP round-trip with every test. And always be aware that excessive use can result in your IP being temporarily or permanently blocked from a site if its owners are sensitive to automated sources of load.

Fetching Web Pages

Before you can parse an HTML-formatted web page, you of course have to acquire some. Chapter 9 provides the kind of thorough introduction to the HTTP protocol that can help you figure out how to fetch information even from sites that require passwords or cookies. But, in brief, here are some options for downloading content.

From the Future

If you need a simple way to fetch web pages before scraping them, try Kenneth Reitz's requests library!

The library was not released until after the book was published, but has already taken the Python world by storm. The simplicity and convenience of its API has made it the tool of choice for making web requests from Python.

    You can use urllib2, or the even lower-level httplib, to construct an HTTP request that will return a web page. For each form that has to be filled out, you will have to build a dictionary representing the field names and data values inside; unlike a real web browser, these libraries will give you no help in submitting forms.

    You can to install mechanize and write a program that fills out and submits web forms much as you would do when sitting in front of a web browser. The downside is that, to benefit from this automation, you will need to download the page containing the form HTML before you can then submit it—possibly doubling the number of web requests you perform!

    If you need to download and parse entire web sites, take a look at the Scrapy project, hosted at scrapy.org, which provides a framework for implementing your own web spiders. With the tools it provides, you can write programs that follow links to every page on a web site, tabulating the data you want extracted from each page.

    When web pages wind up being incomplete because they use dynamic JavaScript to load data that you need, you can use the QtWebKit module of the PyQt4 library to load a page, let the JavaScript run, and then save or parse the resulting complete HTML page.

    Finally, if you really need a browser to load the site, both the Selenium and Windmill test platforms provide a way to drive a standard web browser from inside a Python program. You can start the browser up, direct it to the page of interest, fill out and submit forms, do whatever else is necessary to bring up the data you need, and then pull the resulting information directly from the DOM elements that hold them.

These last two options both require third-party components or Python modules that are built against large libraries, and so we will not cover them here, in favor of techniques that require only pure Python.

For our examples in this chapter, we will use the site of the United States National Weather Service, which lives at www.weather.gov.

Among the better features of the United States government is its having long ago decreed that all publications produced by their agencies are public domain. This means, happily, that I can pull all sorts of data from their web site and not worry about the fact that copies of the data are working their way into this book.

Of course, web sites change, so the online source code for this chapter includes the downloaded web page on which the scripts in this chapter are designed to work. That way, even if their site undergoes a major redesign, you will still be able to try out the code examples in the future. And, anyway—as I recommended previously—you should be kind to web sites by always developing your scraping code against a downloaded copy of a web page to help reduce their load.

Downloading Pages Through Form Submission

The task of grabbing information from a web site usually starts by reading it carefully with a web browser and finding a route to the information you need. Figure 10–1 shows the site of the National Weather Service; for our first example, we will write a program that takes a city and state as arguments and prints out the current conditions, temperature, and humidity. If you will explore the site a bit, you will find that city-specific forecasts can be visited by typing the city name into the small “Local forecast” form in the left margin.

Figure 10–1. The National Weather Service web site

(click to enlarge)

When using the urllib2 module from the Standard Library, you will have to read the web page HTML manually to find the form. You can use the View Source command in your browser, search for the words “Local forecast,” and find the following form in the middle of the sea of HTML:

<form method="post" action="http://forecast.weather.gov/zipcity.php" ...>

  ...

  <input type="text" id="zipcity" name="inputstring" size="9"

    value="City, St" onfocus="this.value='';" />

  <input type="submit" name="Go2" value="Go" />

</form>

The only important elements here are the <form> itself and the <input> fields inside; everything else is just decoration intended to help human readers.

This form does a POST to a particular URL with, it appears, just one parameter: an inputstring giving the city name and state. Listing 10–1 shows a simple Python program that uses only the Standard Library to perform this interaction, and saves the result to phoenix.html.

Listing 10–1. Submitting a Form with “urllib2”

#!/usr/bin/env python

# Foundations of Python Network Programming - Chapter 10 - fetch_urllib2.py

# Submitting a form and retrieving a page with urllib2

import urllib, urllib2

data = urllib.urlencode({'inputstring': 'Phoenix, AZ'})

info = urllib2.urlopen('http://forecast.weather.gov/zipcity.php', data)

content = info.read()

open('phoenix.html', 'w').write(content)

On the one hand, urllib2 makes this interaction very convenient; we are able to download a forecast page using only a few lines of code. But, on the other hand, we had to read and understand the form ourselves instead of relying on an actual HTML parser to read it. The approach encouraged by mechanize is quite different: you need only the address of the opening page to get started, and the library itself will take responsibility for exploring the HTML and letting you know what forms are present. Here are the forms that it finds on this particular page:

>>> import mechanize

>>> br = mechanize.Browser()

>>> response = br.open('http://www.weather.gov/')

>>> for form in br.forms():

...     print '%r %r %s' % (form.name, form.attrs.get('id'), form.action)

...     for control in form.controls:

...         print '   ', control.type, control.name, repr(control.value)

None None http://search.usa.gov/search

    hidden v:project 'firstgov'

    text query ''

    radio affiliate ['nws.noaa.gov']

    submit None 'Go'

None None http://forecast.weather.gov/zipcity.php

    text inputstring 'City, St'

    submit Go2 'Go'

'jump' 'jump' http://www.weather.gov/

    select menu ['http://www.weather.gov/alerts-beta/']

    button None None

Here, mechanize has helped us avoid reading any HTML at all. Of course, pages with very obscure form names and fields might make it very difficult to look at a list of forms like this and decide which is the form we see on the page that we want to submit; in those cases, inspecting the HTML ourselves can be helpful, or—if you use Google Chrome, or Firefox with Firebug installed—right-clicking the form and selecting “Inspect Element” to jump right to its element in the document tree.

Once we have determined that we need the zipcity.php form, we can write a program like that shown in Listing 10–2. You can see that at no point does it build a set of form fields manually itself, as was necessary in our previous listing. Instead, it simply loads the front page, sets the one field value that we care about, and then presses the form's submit button. Note that since this HTML form did not specify a name, we had to create our own filter function—the lambda function in the listing—to choose which of the three forms we wanted.

Listing 10–2. Submitting a Form with mechanize

#!/usr/bin/env python

# Foundations of Python Network Programming - Chapter 10 - fetch_mechanize.py

# Submitting a form and retrieving a page with mechanize

import mechanize

br = mechanize.Browser()

br.open('http://www.weather.gov/')

br.select_form(predicate=lambda(form): 'zipcity' in form.action)

br['inputstring'] = 'Phoenix, AZ'

response = br.submit()

content = response.read()


open('phoenix.html', 'w').write(content)

Many mechanize users instead choose to select forms by the order in which they appear in the page—in which case we could have called select_form(nr=1). But I prefer not to rely on the order, since the real identity of a form is inherent in the action that it performs, not its location on a page.

You will see immediately the problem with using mechanize for this kind of simple task: whereas Listing 10–1 was able to fetch the page we wanted with a single HTTP request, Listing 10–2 requires two round-trips to the web site to do the same task. For this reason, I avoid using mechanize for simple form submission. Instead, I keep it in reserve for the task at which it really shines: logging on to web sites like banks, which set cookies when you first arrive at their front page and require those cookies to be present as you log in and browse your accounts. Since these web sessions require a visit to the front page anyway, no extra round-trips are incurred by using mechanize.

The Structure of Web Pages

There is a veritable glut of online guides and published books on the subject of HTML, but a few notes about the format would seem to be appropriate here for users who might be encountering the format for the first time.

The Hypertext Markup Language (HTML) is one of many markup dialects built atop the Standard Generalized Markup Language (SGML), which bequeathed to the world the idea of using thousands of angle brackets to mark up plain text. Inserting bold and italics into a format like HTML is as simple as typing eight angle brackets:

The <b>very</b> strange book <i>Tristram Shandy</i>.

In the terminology of SGML, the strings <b> and </b> are each tags—they are, in fact, an opening and a closing tag—and together they create an element that contains the text very inside it. Elements can contain text as well as other elements, and can define a series of key/value attribute pairs that give more information about the element:

<p content="personal">I am reading <i document="play">Hamlet</i>.</p>

There is a whole subfamily of markup languages based on the simpler Extensible Markup Language (XML), which takes SGML and removes most of its special cases and features to produce documents that can be generated and parsed without knowing their structure ahead of time. The problem with SGML languages in this regard—and HTML is one particular example—is that they expect parsers to know the rules about which elements can be nested inside which other elements, and this leads to constructions like this unordered list <ul>, inside which are several list items <li>:

<ul><li>First<li>Second<li>Third<li>Fourth</ul>

At first this might look like a series of <li> elements that are more and more deeply nested, so that the final word here is four list elements deep. But since HTML in fact says that <li> elements cannot nest, an HTML parser will understand the foregoing snippet to be equivalent to this more explicit XML string:

<ul><li>First</li><li>Second</li><li>Third</li><li>Fourth</li></ul>

And beyond this implicit understanding of HTML that a parser must possess are the twin problems that, first, various browsers over the years have varied wildly in how well they can reconstruct the document structure when given very concise or even deeply broken HTML; and, second, most web page authors judge the quality of their HTML by whether their browser of choice renders it correctly. This has resulted not only in a World Wide Web that is full of sites with invalid and broken HTML markup, but also in the fact that the permissiveness built into browsers has encouraged different flavors of broken HTML among their different user groups.

If HTML is a new concept to you, you can find abundant resources online. Here are a few documents that have been longstanding resources in helping programmers learn the format:

    http://www.w3.org/MarkUp/Guide/

    http://www.w3.org/MarkUp/Guide/Advanced.html

    http://www.w3.org/MarkUp/Guide/Style

The brief Bare Bones Guide, and the long and verbose HTML standard itself, are good resources to have when trying to remember an element name or the name of a particular attribute value:

    http://werbach.com/barebones/barebones.html

    http://www.w3.org/TR/REC-html40/

When building your own web pages, try to install a real HTML validator in your editor, IDE, or build process, or test your web site once it is online by submitting it to

    http://validator.w3.org/

You might also want to consider using the tidy tool, which can also be integrated into an editor or build process:

    http://tidy.sourceforge.net/

We will now turn to that weather forecast for Phoenix, Arizona, that we downloaded earlier using our scripts (note that we will avoid creating extra traffic for the NWS by running our experiments against this local file), and we will learn how to extract actual data from HTML.

Three Axes

Parsing HTML with Python requires three choices:

    The parser you will use to digest the HTML, and try to make sense of its tangle of opening and closing tags

    The API by which your Python program will access the tree of concentric elements that the parser built from its analysis of the HTML page

    What kinds of selectors you will be able to write to jump directly to the part of the page that interests you, instead of having to step into the hierarchy one element at a time

The issue of selectors is a very important one, because a well-written selector can unambiguously identify an HTML element that interests you without your having to touch any of the elements above it in the document tree. This can insulate your program from larger design changes that might be made to a web site; as long as the element you are selecting retains the same ID, name, or whatever other property you select it with, your program will still find it even if after the redesign it is several levels deeper in the document.

I should pause for a second to explain terms like “deeper,” and I think the concept will be clearest if we reconsider the unordered list that was quoted in the previous section. An experienced web developer looking at that list rearranges it in her head, so that this is what it looks like:

<ul>

  <li>First</li>

  <li>Second</li>

  <li>Third</li>

  <li>Fourth</li>

</ul>

Here the <ul> element is said to be a “parent” element of the individual list items, which “wraps” them and which is one level “above” them in the whole document. The <li> elements are “siblings” of one another; each is a “child” of the <ul> element that “contains” them, and they sit “below” their parent in the larger document tree. This kind of spatial thinking winds up being very important for working your way into a document through an API.

In brief, here are your choices along each of the three axes that were just listed:

    The most powerful, flexible, and fastest parser at the moment appears to be the HTMLParser that comes with lxml; the next most powerful is the longtime favorite BeautifulSoup (I see that its author has, in his words, “abandoned” the new 3.1 version because it is weaker when given broken HTML, and recommends using the 3.0 series until he has time to release 3.2); and coming in dead last are the parsing classes included with the Python Standard Library, which no one seems to use for serious screen scraping.

    The best API for manipulating a tree of HTML elements is ElementTree, which has been brought into the Standard Library for use with the Standard Library parsers, and is also the API supported by lxml; BeautifulSoup supports an API peculiar to itself; and a pair of ancient, ugly, event-based interfaces to HTML still exist in the Python Standard Library.

    The lxml library supports two of the major industry-standard selectors: CSS selectors and XPath query language; BeautifulSoup has a selector system all its own, but one that is very powerful and has powered countless web-scraping programs over the years.

Given the foregoing range of options, I recommend using lxml when doing so is at all possible—installation requires compiling a C extension so that it can accelerate its parsing using libxml2—and using BeautifulSoup if you are on a machine where you can install only pure Python. Note that lxml is available as a pre-compiled package named python-lxml on Ubuntu machines, and that the best approach to installation is often this command line:

STATIC_DEPS=true pip install lxml

And if you consult the lxml documentation, you will find that it can optionally use the BeautifulSoup parser to build its own ElementTree-compliant trees of elements. This leaves very little reason to use BeautifulSoup by itself unless its selectors happen to be a perfect fit for your problem; we will discuss them later in this chapter.

But the state of the art may advance over the years, so be sure to consult its own documentation as well as recent blogs or Stack Overflow questions if you are having problems getting it to compile.

From the Future

The BeautifulSoup project has recovered! While the text below — written in late 2010 — has to carefully avoid the broken 3.2 release in favor of 3.0, BeautifulSoup has now released a rewrite named beautifulsoup4 on the Python Package Index that works with both Python 2 and 3. Once installed, simply import it like this:

from bs4 import BeautifulSoup

I just ran a test, and it reads the malformed phoenix.html page perfectly.

Diving into an HTML Document

The tree of objects that a parser creates from an HTML file is often called a Document Object Model, or DOM, even though this is officially the name of one particular API defined by the standards bodies and implemented by browsers for the use of JavaScript running on a web page.

The task we have set for ourselves, you will recall, is to find the current conditions, temperature, and humidity in the phoenix.html page that we have downloaded. Here is an excerpt: Listing 10–3, which focuses on the pane that we are interested in.

Listing 10–3. Excerpt from the Phoenix Forecast Page

<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN"><html><head>

<title>7-Day Forecast for Latitude 33.45&deg;N and Longitude 112.07&deg;W (Elev. 1132 ft)</title><link rel="STYLESHEET" type="text/css" href="fonts/main.css">

...

<table cellspacing="0" cellspacing="0" border="0" width="100%"><tr align="center"><td><table width='100%' border='0'>

<tr>

<td align ='center'>

<span class='blue1'>Phoenix, Phoenix Sky Harbor International Airport</span><br>

Last Update on 29 Oct 7:51 MST<br><br>

</td>
</tr>
<tr>

<td colspan='2'>

<table cellspacing='0' cellpadding='0' border='0' align='left'>

<tr>

<td class='big' width='120' align='center'>

<font size='3' color='000066'>

A Few Clouds<br>

<br>71&deg;F<br>(22&deg;C)</td>

</font><td rowspan='2' width='200'><table cellspacing='0' cellpadding='2' border='0' width='100%'>

<tr bgcolor='#b0c4de'>

<td><b>Humidity</b>:</td>

<td align='right'>30 %</td>

</tr>

<tr bgcolor='#ffefd5'>

<td><b>Wind Speed</b>:</td><td align='right'>SE 5 MPH<br>

</td>
</tr>

<tr bgcolor='#b0c4de'>

<td><b>Barometer</b>:</td><td align='right' nowrap>30.05 in (1015.90 mb)</td></tr>

<tr bgcolor='#ffefd5'>

<td><b>Dewpoint</b>:</td><td align='right'>38&deg;F (3&deg;C)</td>

</tr>
</tr>

<tr bgcolor='#ffefd5'>

<td><b>Visibility</b>:</td><td align='right'>10.00 Miles</td>

</tr>

<tr><td nowrap><b><a href='http://www.wrh.noaa.gov/total_forecast/other_obs.php?wfo=psr&zone=AZZ023' class='link'>More Local Wx:</a></b> </td>

<td nowrap align='right'><b><a href='http://www.wrh.noaa.gov/mesowest/getobext.php?wfo=psr&sid=KPHX&num=72' class='link'>3 Day History:</a></b> </td></tr>

</table>

...

There are two approaches to narrowing your attention to the specific area of the document in which you are interested. You can either search the HTML for a word or phrase close to the data that you want, or, as we mentioned previously, use Google Chrome or Firefox with Firebug to “Inspect Element” and see the element you want embedded in an attractive diagram of the document tree. Figure 10–2 shows Google Chrome with its Developer Tools pane open following an Inspect Element command: my mouse is poised over the <font> element that was brought up in its document tree, and the element itself is highlighted in blue on the web page itself.

Image

Figure 10–2. Examining Document Elements in the Browser

(click to enlarge)

Note that Google Chrome does have an annoying habit of filling in “conceptual” tags that are not actually present in the source code, like the <tbody> tags that you can see in every one of the tables shown here. For that reason, I look at the actual HTML source before writing my Python code; I mainly use Chrome to help me find the right places in the HTML.

We will want to grab the text “A Few Clouds” as well as the temperature before turning our attention to the table that sits to this element's right, which contains the humidity.

A properly indented version of the HTML page that you are scraping is good to have at your elbow while writing code. I have included phoenix-tidied.html with the source code bundle for this chapter so that you can take a look at how much easier it is to read!

You can see that the element displaying the current conditions in Phoenix sits very deep within the document hierarchy. Deep nesting is a very common feature of complicated page designs, and that is why simply walking a document object model can be a very verbose way to select part of a document—and, of course, a brittle one, because it will be sensitive to changes in any of the target element's parent. This will break your screen-scraping program not only if the target web site does a redesign, but also simply because changes in the time of day or the need for the site to host different kinds of ads can change the layout subtly and ruin your selector logic.

To see how direct document-object manipulation would work in this case, we can load the raw page directly into both the lxml and BeautifulSoup systems.

>>> import lxml.etree
>>> parser = lxml.etree.HTMLParser(encoding='utf-8')
>>> tree = lxml.etree.parse('phoenix.html', parser)

The need for a separate parser object here is because, as you might guess from its name, lxml is natively targeted at XML files.

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup(open('phoenix.html'))

Traceback (most recent call last):
  ...
HTMLParseError: malformed start tag, at line 96, column 720

What on earth? Well, look, the National Weather Service does not check or tidy its HTML! I might have chosen a different example for this book if I had known, but since this is a good illustration of the way the real world works, let's press on. Jumping to line 96, column 720 of phoenix.html, we see that there does indeed appear to be some broken HTML:

<a href="http://www.weather.gov"<u>www.weather.gov</u></a>

You can see that the <u> tag starts before a closing angle bracket has been encountered for the <a> tag. But why should BeautifulSoup care? I wonder what version I have installed.

>>> BeautifulSoup.__version__

'3.1.0'

Well, drat. I typed too quickly and was not careful to specify a working version when I ran pip to install BeautifulSoup into my virtual environment. Let's try again:

$ pip install BeautifulSoup==3.0.8.1

And now the broken document parses successfully:

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup(open('phoenix.html'))

That is much better!

Now, if we were to take the approach of starting at the top of the document and digging ever deeper until we find the node that we are interested in, we are going to have to generate some very verbose code. Here is the approach we would have to take with lxml:
>>> fonttag = tree.find('body').find('div').findall('table')[3] \

...     .findall('tr')[1].find('td').findall('table')[1].find('tr') \

...     .findall('td')[1].findall('table')[1].find('tr').find('td') \

...     .find('table').findall('tr')[1].find('td').find('table') \

...     .find('tr').find('td').find('font')

>>> fonttag.text

'\nA Few Clouds'

An attractive syntactic convention lets BeautifulSoup handle some of these steps more beautifully:

>>> fonttag = soup.body.div('table', recursive=False)[3] \

...     ('tr', recursive=False)[1].td('table', recursive=False)[1].tr \

...     ('td', recursive=False)[1]('table', recursive=False)[1].tr.td \

...     .table('tr', recursive=False)[1].td.table \

...     .tr.td.font

>>> fonttag.text

u'A Few Clouds71&deg;F(22&deg;C)'

BeautifulSoup lets you choose the first child element with a given tag by simply selecting the attribute .tagname, and lets you receive a list of child elements with a given tag name by calling an element like a function—you can also explicitly call the method findAll()—with the tag name and a recursive option telling it to pay attention just to the children of an element; by default, this option is set to True, and BeautifulSoup will run off and find all elements with that tag in the entire sub-tree beneath an element!

Anyway, two lessons should be evident from the foregoing exploration.

First, both lxml and BeautifulSoup provide attractive ways to quickly grab a child element based on its tag name and position in the document.

Second, we clearly should not be using such primitive navigation to try descending into a real-world web page! I have no idea how code like the expressions just shown can easily be debugged or maintained; they would probably have to be re-built from the ground up if anything went wrong with them—they are a painful example of write-once code.

And that is why selectors that each screen-scraping library supports are so critically important: they are how you can ignore the many layers of elements that might surround a particular target, and dive right in to the piece of information you need.

Figuring out how HTML elements are grouped, by the way, is much easier if you either view HTML with an editor that prints it as a tree, or if you run it through a tool like HTML tidy from W3C that can indent each tag to show you which ones are inside which other ones:
$ tidy phoenix.html > phoenix-tidied.html

You can also use either of these libraries to try tidying the code, with a call like one of these:

lxml.html.tostring(html)

soup.prettify()

See each library's documentation for more details on using these calls.

Selectors

A selector is a pattern that is crafted to match document elements on which your program wants to operate. There are several popular flavors of selector, and we will look at each of them as possible techniques for finding the current-conditions <font> tag in the National Weather Service page for Phoenix. We will look at three:

•    People who are deeply XML-centric prefer XPath expressions, which are a companion technology to XML itself and let you match elements based on their ancestors, their own identity, and textual matches against their attributes and text content. They are very powerful as well as quite general.

•    If you are a web developer, then you probably link to CSS selectors as the most natural choice for examining HTML. These are the same patterns used in Cascading Style Sheets documents to describe the set of elements to which each set of styles should be applied.

•    Both lxml and BeautifulSoup, as we have seen, provide a smattering of their own methods for finding document elements.

Here are standards and descriptions for each of the selector styles just described— first, XPath:

    http://www.w3.org/TR/xpath/

    http://codespeak.net/lxml/tutorial.html#using-xpath-to-find-text

    http://codespeak.net/lxml/xpathxslt.html

And here are some CSS selector resources:

    http://www.w3.org/TR/CSS2/selector.html

    http://codespeak.net/lxml/cssselect.html

And, finally, here are links to documentation that looks at selector methods peculiar to lxml and BeautifulSoup:

    http://codespeak.net/lxml/tutorial.html#elementpath

  http://www.crummy.com/software/BeautifulSoup/documentation.html#Searching the Parse Tree

The National Weather Service has not been kind to us in constructing this web page. The area that contains the current conditions seems to be constructed entirely of generic untagged elements; none of them have id or class values like currentConditions or temperature that might help guide us to them.

Well, what are the features of the elements that contain the current weather conditions in Listing 10–3? The first thing I notice is that the enclosing <td> element has the class "big". Looking at the page visually, I see that nothing else seems to be of exactly that font size; could it be so simple as to search the document for every <td> with this CSS class? Let us try, using a CSS selector to begin with:

>>> from lxml.cssselect import CSSSelector

>>> sel = CSSSelector('td.big')

>>> sel(tree)

[<Element td at b72ec0a4>]

Perfect! It is also easy to grab elements with a particular class attribute using the peculiar syntax of BeautifulSoup:

>>> soup.find('td', 'big')

<td class="big" width="120" align="center">

<font size="3" color="000066">

A Few Clouds<br />

<br />71&deg;F<br />(22&deg;C)</font></td>

Writing an XPath selector that can find CSS classes is a bit difficult since the class="" attribute contains space-separated values and we do not know, in general, whether the class will be listed first, last, or in the middle.

>>> tree.xpath(".//td[contains(concat(' ', normalize-space(@class), ' '), ' big ')]")

[<Element td at a567fcc>]

This is a common trick when using XPath against HTML: by prepending and appending spaces to the class attribute, the selector assures that it can look for the target class name with spaces around it and find a match regardless of where in the list of classes the name falls.

Selectors, then, can make it simple, elegant, and also quite fast to find elements deep within a document that interest us. And if they break because the document is redesigned or because of a corner case we did not anticipate, they tend to break in obvious ways, unlike the tedious and deep procedure of walking the document tree that we attempted first.

Once you have zeroed in on the part of the document that interests you, it is generally a very simple matter to use the ElementTree or the old BeautifulSoup API to get the text or attribute values you need. Compare the following code to the actual tree shown in Listing 10–3:

>>> td = sel(tree)[0]

>>> td.find('font').text

'\nA Few Clouds'

>>> td.find('font').findall('br')[1].tail

u'71°F'

If you are annoyed that the first string did not return as a Unicode object, you will have to blame the ElementTree standard; the glitch has been corrected in Python 3! Note that ElementTree thinks of text strings in an HTML file not as entities of their own, but as either the .text of its parent element or the .tail of the previous element. This can take a bit of getting used to, and works like this:

<p>

  My favorite play is    # the <p> element's .text

  <i>

    Hamlet                 # the <i> element's .text

  </i>

  which is not really      # the <i> element's .tail

  <b>

    Danish                 # the <b> element's .text

  </b>

  but English.             # the <b> element's .tail

</p>


This can be confusing because you would think of the three words favorite and really and English as being at the same “level” of the document—as all being children of the <p> element somehow—but lxml considers only the first word to be part of the text attached to the <p> element, and considers the other two to belong to the tail texts of the inner <i> and <b> elements. This arrangement can require a bit of contortion if you ever want to move elements without disturbing the text around them, but leads to rather clean code otherwise, if the programmer can keep a clear picture of it in her mind.

BeautifulSoup, by contrast, considers the snippets of text and the <br> elements inside the <font> tag to all be children sitting at the same level of its hierarchy. Strings of text, in other words, are treated as phantom elements. This means that we can simply grab our text snippets by choosing the right child nodes:

>>> td = soup.find('td', 'big')

>>> td.font.contents[0]

u'\nA Few Clouds'

>>> td.font.contents[4]

u'71&deg;F'

Through a similar operation, we can direct either lxml or BeautifulSoup to the humidity datum. Since the word Humidity: will always occur literally in the document next to the numeric value, this search can be driven by a meaningful term rather than by something as random as the big CSS tag. See Listing 10–4 for a complete screen-scraping routine that does the same operation first with lxml and then with BeautifulSoup.

This complete program, which hits the National Weather Service web page for each request, takes the city name on the command line:

$ python weather.py Springfield, IL

Condition:

Traceback (most recent call last):

  ..

AttributeError: 'NoneType' object has no attribute 'text'

And here you can see, superbly illustrated, why screen scraping is always an approach of last resort and should always be avoided if you can possibly get your hands on the data some other way: because presentation markup is typically designed for one thing—human readability in browsers—and can vary in crazy ways depending on what it is displaying.

What is the problem here? A short investigation suggests that the NWS page includes only a <font> element inside of the <tr> if—and this is just a guess of mine, based on a few examples—the description of the current conditions is several words long and thus happens to contain a space. The conditions in Phoenix as I have written this chapter are “A Few Clouds,” so the foregoing code has worked just fine; but in Springfield, the weather is “Fair” and therefore does not need a <font> wrapper around it, apparently.

Listing 10–4. Completed Weather Scraper

#!/usr/bin/env python

# Foundations of Python Network Programming - Chapter 10 - weather.py

# Fetch the weather forecast from the National Weather Service.

import sys, urllib, urllib2

import lxml.etree

from lxml.cssselect import CSSSelector

from BeautifulSoup import BeautifulSoup

if len(sys.argv) < 2:

    print >>sys.stderr, 'usage: weather.py CITY, STATE'

    exit(2)

data = urllib.urlencode({'inputstring': ' '.join(sys.argv[1:])})

info = urllib2.urlopen('http://forecast.weather.gov/zipcity.php', data)

content = info.read()

# Solution #1

parser = lxml.etree.HTMLParser(encoding='utf-8')

tree = lxml.etree.fromstring(content, parser)

big = CSSSelector('td.big')(tree)[0]

if big.find('font') is not None:

    big = big.find('font')

print 'Condition:', big.text.strip()

print 'Temperature:', big.findall('br')[1].tail

tr = tree.xpath('.//td[b="Humidity"]')[0].getparent()

print 'Humidity:', tr.findall('td')[1].text

print

# Solution #2

soup = BeautifulSoup(content)  # doctest: +SKIP

big = soup.find('td', 'big')

if big.font is not None:

    big = big.font

print 'Condition:', big.contents[0].string.strip()

temp = big.contents[3].string or big.contents[4].string  # can be either

print 'Temperature:', temp.replace('°', ' ')

tr = soup.find('b', text='Humidity').parent.parent.parent

print 'Humidity:', tr('td')[1].string

print

If you look at the final form of Listing 10–4, you will see a few other tweaks that I made as I noticed changes in format with different cities. It now seems to work against a reasonable selection of locations; again, note that it gives the same report twice, generated once with lxml and once with BeautifulSoup:

$ python weather.py Springfield, IL

Condition: Fair

Temperature: 54 °F

Humidity: 28 %</code>

<code>Condition: Fair

Temperature: 54  F

Humidity: 28 %

$ python weather.py Grand Canyon, AZ

Condition: Fair

Temperature: 67°F

Humidity: 28 %

Condition: Fair

Temperature: 67 F

Humidity: 28 %

You will note that some cities have spaces between the temperature and the F, and others do not. No, I have no idea why. But if you were to parse these values to compare them, you would have to learn every possible variant and your parser would have to take them into account.

I leave it as an exercise to the reader to determine why the web page currently displays the word “NULL”—you can even see it in the browser—for the temperature in Elk City, Oklahoma. Maybe that location is too forlorn to even deserve a reading? In any case, it is yet another special case that you would have to treat sanely if you were actually trying to repackage this HTML page for access from an API:

$ python weather.py Elk City, OK

Condition: Fair and Breezy

Temperature: NULL

Humidity: NA

Condition: Fair and Breezy

Temperature: NULL

Humidity: NA

I also leave as an exercise to the reader the task of parsing the error page that comes up if a city cannot be found, or if the Weather Service finds it ambiguous and prints a list of more specific choices!

Summary

Although the Python Standard Library has several modules related to SGML and, more specifically, to HTML parsing, there are two premier screen-scraping technologies in use today: the fast and powerful lxml library that supports the standard Python “ElementTree” API for accessing trees of elements, and the quirky BeautifulSoup library that has powerful API conventions all its own for querying and traversing a document.

If you use BeautifulSoup before 3.2 comes out, be sure to download the most recent 3.0 version; the 3.1 series, which unfortunately will install by default, is broken and chokes easily on HTML glitches.

Screen scraping is, at bottom, a complete mess. Web pages vary in unpredictable ways even if you are browsing just one kind of object on the site—like cities at the National Weather Service, for example.

To prepare to screen scrape, download a copy of the page, and use HTML tidy, or else your screen-scraping library of choice, to create a copy of the file that your eyes can more easily read. Always run your program against the ugly original copy, however, lest HTML tidy fixes something in the markup that your program will need to repair!

Once you find the data you want in the web page, look around at the nearby elements for tags, classes, and text that are unique to that spot on the screen. Then, construct a Python command using your scraping library that looks for the pattern you have discovered and retrieves the element in question. By looking at its children, parents, or enclosed text, you should be able to pull out the data that you need from the web page intact.

When you have a basic script working, continue testing it; you will probably find many edge cases that have to be handled correctly before it becomes generally useful. Remember: when possible, always use true APIs, and treat screen scraping as a technique of last resort!

Source: http://rhodesmill.org/brandon/chapters/screen-scraping/


Friday, 22 May 2015

5 tips for scraping big websites

Scraping bigger websites can be a challenge if done the wrong way.

Bigger websites would have more data, more security and more pages. We’ve learned a lot from our years of crawling such large complex websites, and these tips could help solve some of your challenges

1. Cache pages visited for scraping

When scraping big websites, its always a good idea to cache the data you have already downloaded. So you don’t have to put load on the website again, in case you have start over again or that page is required again during scraping. Its effortless to cache to a key value store like redis.But Databases and filesystem caches are also good.

2. Don’t flood the website with large number of parallel requests , take it slow

Big websites posses algorithms to detect webscraping, large number of parallel requests from the same ip address will identify you as a Denial Of Service Attack on their website, and blacklist your IPs immediately. A better idea is to time your requests properly one after the other, giving it some human behavior. Oh!.. but scraping like that will take you ages. So balance requests using the average response time of the websites, and play around with the number of parallel requests to the website to get the right number.

3. Store the URLs that you have already fetched

You may want to keep a list of URLs you have already fetched, in a database or a key value store. What would you do if you scraper crashes after scraping 70% of the website. If you need to complete the remaining 30%, with out this list of URLs, you’ll waste a lot of time and bandwidth. Make sure you store this list of URLs

somewhere permanent, till you have all the required data. This could also be combined with the cache. This way, you’ll be able to resume scraping.

4. Split scraping into different phases

Its easier and safer if you split the scraping into multiple smaller phases. For example, you could split scraping a huge site into two. One for gathering links to the pages from which you need to extract data and another for downloading these pages to scrape content.

5. Take only whats required

Don’t grab or follow every link unless its required. You can define a proper navigation scheme to make the scraper visit only the pages required. Its always tempting to grab everything, but its just a waste of bandwidth, time and storage.

Source: http://learn.scrapehero.com/5-tips-for-scraping-big-websites/

Wednesday, 20 May 2015

How to Generate Sales Leads Using Web Scraping Services

The first stage of any selling process is what is popularly known as “lead generation”. This phase is what most businesses place at the apex of their sales concerns. It is a driving force that governs decision-making at its highest levels, and influences business strategy and planning. If you are about to embark on an outbound sales campaign and are in the process of looking for leads, you would acknowledge the fact that lead generation process is of extreme importance for any business.

Different lead generation techniques have been used over and over again by companies around the world to satiate this growing business need. Newer, more innovative methods have also emerged to help marketers in this process. One such method of lead generation that is fast catching on, and is poised to play a big role for businesses in the coming years, is web scraping. With web scraping, you can easily get access to multiple relevant and highly customized leads – a perfect starting point for any marketing, promotional or sales campaign.

The prominence of Web Scraping in overall marketing strategy

At present, levels of competition have risen sky high for most businesses. For success, lead generation and gaining insight about customer behavior and preferences is an essential business requirement. Web scraping is the process of scraping or mining the internet for information. Different tools and techniques can be used to harvest information from multiple internet sources based on relevance, and the structured and organized in a way that makes sense to your business. Companies that provide web scraping services essentially use web scrapers to generate a targeted lead database that your company can then integrate into its marketing and sales strategies and plans.

The actual process of web scraping involves creating scraping scripts or algorithms which crawl the web for information based on certain preset parameters and options. The scraping process can be customized and tuned towards finding the kind of data that your business needs. The script can extract data from websites automatically, collate and put together a meaningful collection of leads for business development.

Lead Generation Basics

At a very high level, any person who has the resources and the intent to purchase your product or service qualifies as a lead. In the present scenario, you need to go far deeper than that. Marketers need to observe behavior patterns and purchasing trends to ensure that a particular person qualifies as a lead. If you have a group of people you are targeting, you need to decide who the viable leads will be, acquire their contact information and store it in a database for further action.

List buying used to be a popular way to get leads, but their efficacy has dwindled over time. Web scraping is the fast coming up as a feasible lead generation technique, allowing you to find highly focused and targeted leads in short amounts of time. All you need is a service provider that would carry out the data mining necessary for lead generation, and you end up with a list of actionable leads that you can try selling to.

How Web Scraping makes a substantial difference

With web scraping, you can extract valuable predictive information from websites. Web scraping facilitates high quality data collection and allows you to structure marketing and sales campaigns better. To drive sales and maximize revenue, you need strong, viable leads. To facilitate this, you need critical data which encompasses customer behavior, contact details, buying patterns and trends, willingness and ability to spend resources, and a myriad of other aspects critical to ascertain the potential of an entity as a rewarding lead. Data mining through web scraping can be a great way to get to these factors and identifying the leads that would make a difference for your business.

Crawling through many different web locales using different techniques, web scraping services pick up a wealth of information. This highly relevant and specialized information instantly provides your business with actionable leads. Furthermore, this exercise allows you to fine-tune your data management processes, make more accurate and reliable predictions and projections, arrive at more effective, strategic and marketing decisions and customize your workflow and business development to better suit the current market.

The Process and the Tools

Lead generation, being one of the most important processes for any business, can prove to be an expensive proposition if not handled strategically. Companies spend large amounts of their resources acquiring viable leads they can sell to. With web scraping, you can dramatically cut down the costs involved in lead generation and take your business forward with speed and efficiency. Here are some of the time-tested web scraping tools which can come in handy for lead generation –

•    Website download software – Used to copy entire websites to local storage. All website pages are downloaded and the hierarchy of navigation and internal links preserve. The stored pages can then be viewed and scoured for information at any later time.     Web scraper – Tools that crawl through bulk information on the internet, extracting specific, relevant data using a set of pre-defined parameters.

•    Data grabber – Sifts through websites and databases fast and extracts all the information, which can be sorted and classified later.

•    Text extractor – Can be used to scrape multiple websites or locations for acquiring text content from websites and web documents. It can mine data from a variety of text file formats and platforms.

With these tools, web scraping services scrape websites for lead generation and provide your business with a set of strong, actionable leads that can make a difference.

Covering all Bases

The strength of web scraping and web crawling lies in the fact that it covers all the necessary bases when it comes to lead generation. Data is harvested, structured, categorized and organized in such a way that businesses can easily use the data provided for their sales leads. As discussed earlier, cold and detached lists no longer provide you with enough actionable leads. You need to look at various factors and consider them during your lead generation efforts –

•    Contact details of the prospect

•    Purchasing power and purchasing history of the prospect

•    Past purchasing trends, willingness to purchase and history of buying preferences of the prospect

•    Social markers that are indicative of behavioral patterns

•    Commercial and business markers that are indicative of behavioral patterns

•    Transactional details

•    Other factors including age, gender, demography, social circles, language and interests

All these factors need to be taken into account and considered in detail if you have to ensure whether a lead is viable and actionable, or not. With web scraping you can get enough data about every single prospect, connect all the data collected with the help of onboarding, and ascertain with conviction whether a particular prospect will be viable for your business.

Let us take a look at how web scraping addresses these different factors –

1. Scraping website’s

During the scraping process, all websites where a particular prospect has some participation are crawled for data. Seemingly disjointed data can be made into a sensible unit by the use of onboarding- linking user activities with their online entities with the help of user IDs. Documents can be scanned for participation. E-commerce portals can be scanned to find comments and ratings a prospect might have delivered to certain products. Service providers’ websites can be scraped to find if the prospect has given a testimonial to any particular service. All these details can then be accumulated into a meaningful data collection that is indicative of the purchasing power and intent of the prospect, along with important data about buying preferences and tastes.

2. Social scraping

According to a study, most internet users spend upwards of two hours every day on social networks. Therefore, scraping social networks is a great way to explore prospects in detail. Initially, you can get important identification markers like names, addresses, contact numbers and email addresses. Further, social networks can also supply information about age, gender, demography and language choices. From this basic starting point, further details can be added by scraping social activity over long periods of time and looking for activities which indicate purchasing preferences, trends and interests. This exercise provides highly relevant and targeted information about prospects can be constructively used while designing sales campaigns.

3. Transaction scraping

Through the scraping of transactions, you get a clear idea about the purchasing power of prospects. If you are looking for certain income groups or leads that invest in certain market sectors or during certain specific periods of time, transaction scraping is the best way to harvest meaningful information. This also helps you with competition analysis and provides you with pointers to fine-tune your marketing and sales strategies.

Using these varied lead generation techniques and finding the right balance and combination is key to securing the right leads for your business. Overall, signing up for web scraping services can be a make or break factor for your business going forward. With a steady supply of valuable leads, you can supercharge your sales, maximize returns and craft the perfect marketing maneuvers to take your business to an altogether new dimension.

Source: https://www.promptcloud.com/blog/how-to-generate-sales-leads-using-web-scraping-services/

Sunday, 17 May 2015

Metadata Scraping Service

As mentioned in Robert's last blog post we set up a scraping service which supports users working with citations by extracting automatically references from digital library or publisher websites. We use a very similar service in BibSonomy to support our users while posting a new reference. However, the service is independent from BibSonomy. Our main goal is to make the metadata of other websites easily accessible to every user who needs bibliographic metadata. Therefore we offer the extracted information in BibTeX format. Most tools allow to import BibTeX so it should be very easy for everyone to get the data into his own tool. The service is running under the following URL:

http://scraper.bibsonomy.org/

Currently we support more than 60 different websites (here the full list) and we are working on further extensions. In the near future we will make the source code of our scrapers publicly available under GPL and we hope that other people will find it useful and start to help us by implementing their own scrapers.

How does the service work?

In principle there are two ways to use the service. One uses a so

called bookmarklet and the other is simply based on the URL. If you

have a webpage of a supported site e.g. from ACM digital library the

following page:

Logsonomy - social information retrieval with logdata

then you can copy this URL into the form on the service homepage and the service will return you the extracted BibTeX information. As this is not a very convenient way to access the data we provide a ScrapePublication button. This button is a small piece of JavaScript and can be copied to the toolbar of the browser. By pressing this button while visiting a digital library webpage the URL will be automatically copied and sent to the scraping service and the metadata is extracted.

The service has three options which can be used to customize it and to make it useful for other systems. Obviously one parameter is the URL itself which is used by the bookmarklet, too. The next is the selection parameter which allows to send text to the service and the last parameter allows to change the output format from html to plain BibTeX. This last parameter makes integration with other systems very simple.

If needed we can provide the metadata in other formats as well but currently we support only BibTeX.

Source: http://blog.bibsonomy.org/2008/11/metadata-scraping-service.html

Thursday, 7 May 2015

4 Web Scraping Tools To Save You Time On Data Extraction

Either you are working on a product website, struggling to add live data feed to your app or merely need to pull out a huge amount of online data for analysis, an accurate web scraping tool can save you loads of time and keep you sane. Here are four powerful web scraping tools to save you from copy-pasting or spending time on writing your own scripts.

1. Uipath

Either you are working on a product website, struggling to add live data feed to your app or merely need to pull out a huge amount of online data for analysis, an accurate web scraping tool can save you loads of time and keep you sane. Here are four powerful web scraping tools to save you from copy-pasting or spending time on writing your own scripts.

1. Uipath

Uipath specializes in developing various process automation software including web scraping and screen scraping software for desktop and web. Uipath web scraper is perfect for non-coders and easily surpasses most common data extraction challenges including page navigation, digging through flash and even scraping PDF files. All you need to do is open the web scraping wizard and simply highlight the data you need to extract. The tool will scrape all the data following this pattern at all pages you’ve chosen and sort it accordingly. You can add as many items for scraping as you like and have them sorted in respective columns. As a result, you receive a neat Excel or CSV document with all the data eliminated from duplicates.

Moreover, Uipath isn’t just about scraping. This software can be used not only for extracting data, but to manipulate the interface of another app, thus establishing data transfers among the two of them. Basically, this tool could be used to conduct any repetitive task a human could do, yet much faster and with higher accuracy.

Pros: You can automate form filling, clicking buttons, navigation etc. Uipath scraper is impressively accurate, fast and simple to use. It “reads” all types of data on screen (JS, HTML, Silverlight and more), plus you can train the software to emulate human actions of various complexity.

Cons: Premium software runs at a premium price. Uipath is an affordable professional solution, but may be a bit too pricey for personal use.

2. Import.io

Data Extraction

Either you are working on a product website, struggling to add live data feed to your app or merely need to pull out a huge amount of online data for analysis, an accurate web scraping tool can save you loads of time and keep you sane. Here are four powerful web scraping tools to save you from copy-pasting or spending time on writing your own scripts.

1. Uipath

Uipath specializes in developing various process automation software including web scraping and screen scraping software for desktop and web. Uipath web scraper is perfect for non-coders and easily surpasses most common data extraction challenges including page navigation, digging through flash and even scraping PDF files. All you need to do is open the web scraping wizard and simply highlight the data you need to extract. The tool will scrape all the data following this pattern at all pages you’ve chosen and sort it accordingly. You can add as many items for scraping as you like and have them sorted in respective columns. As a result, you receive a neat Excel or CSV document with all the data eliminated from duplicates.

Moreover, Uipath isn’t just about scraping. This software can be used not only for extracting data, but to manipulate the interface of another app, thus establishing data transfers among the two of them. Basically, this tool could be used to conduct any repetitive task a human could do, yet much faster and with higher accuracy.

Pros: You can automate form filling, clicking buttons, navigation etc. Uipath scraper is impressively accurate, fast and simple to use. It “reads” all types of data on screen (JS, HTML, Silverlight and more), plus you can train the software to emulate human actions of various complexity.

Cons: Premium software runs at a premium price. Uipath is an affordable professional solution, but may be a bit too pricey for personal use.

2. Import.io

Import.io offers you a free desktop app to help you scrap all the data you need from an unlimited amount of web pages. The service treats each page as a potential data source to generate API from. If the page you’ve submitted has been previously processed, you can access its API and get some of the data. In other case, Import.io will guide you through the process of creating the scraping matrix by building connectors (for navigation) or extractors (to pull out the needed data). Afterwards, you submit a request for extraction and it’s typically processed within 24 hours. All the data is private and you can schedule auto refreshments at any chosen period of time.

Pros: The service is easy-to-use with no tech skills needed. It can  pages with data (those that needed login/pass), plus it’s free. Minimalistic effective design and simple navigation comes along.

Cons: Improt.io has hard times navigating through combinations of javascript/POST and cannot navigate from one page to another (e.g. click next, second page etc).  Sometimes, it takes over 24 hours to receive the report.  Besides, it’s a browser-only app, non-compatible with other applications.

3. Kimono

Either you are working on a product website, struggling to add live data feed to your app or merely need to pull out a huge amount of online data for analysis, an accurate web scraping tool can save you loads of time and keep you sane. Here are four powerful web scraping tools to save you from copy-pasting or spending time on writing your own scripts.

1. Uipath

Uipath specializes in developing various process automation software including web scraping and screen scraping software for desktop and web. Uipath web scraper is perfect for non-coders and easily surpasses most common data extraction challenges including page navigation, digging through flash and even scraping PDF files. All you need to do is open the web scraping wizard and simply highlight the data you need to extract. The tool will scrape all the data following this pattern at all pages you’ve chosen and sort it accordingly. You can add as many items for scraping as you like and have them sorted in respective columns. As a result, you receive a neat Excel or CSV document with all the data eliminated from duplicates.

Moreover, Uipath isn’t just about scraping. This software can be used not only for extracting data, but to manipulate the interface of another app, thus establishing data transfers among the two of them. Basically, this tool could be used to conduct any repetitive task a human could do, yet much faster and with higher accuracy.

Pros: You can automate form filling, clicking buttons, navigation etc. Uipath scraper is impressively accurate, fast and simple to use. It “reads” all types of data on screen (JS, HTML, Silverlight and more), plus you can train the software to emulate human actions of various complexity.

Cons: Premium software runs at a premium price. Uipath is an affordable professional solution, but may be a bit too pricey for personal use.

2. Import.io

Import.io offers you a free desktop app to help you scrap all the data you need from an unlimited amount of web pages. The service treats each page as a potential data source to generate API from. If the page you’ve submitted has been previously processed, you can access its API and get some of the data. In other case, Import.io will guide you through the process of creating the scraping matrix by building connectors (for navigation) or extractors (to pull out the needed data). Afterwards, you submit a request for extraction and it’s typically processed within 24 hours. All the data is private and you can schedule auto refreshments at any chosen period of time.

Pros: The service is easy-to-use with no tech skills needed. It can  pages with data (those that needed login/pass), plus it’s free. Minimalistic effective design and simple navigation comes along.

Cons: Improt.io has hard times navigating through combinations of javascript/POST and cannot navigate from one page to another (e.g. click next, second page etc).  Sometimes, it takes over 24 hours to receive the report.  Besides, it’s a browser-only app, non-compatible with other applications.

3. Kimono

Kimono is a popular web scraper among app developers who prefer to power up their products with live data and no additional code. It saves you tons of time when you need to fill up your app with mashing data. Install Kimono Browser bookmarklet; highlight page elements you need to and provide some positive/negative examples to train the tool. After labeling all the data you can download it in CSV/JSON/a web endpoint format. The APIs created for your pages are stored in the cloud and you can run them on schedule. So far, Kimono is free to use with pro and enterprise solutions to be launched soon.

Pros: The tool works pretty fast and works great with scraping newsfeeds and prices. The data is rather accurate.

Cons: No page navigation available and you need to spend quite a lot of time to train Kimono before it starts to pull out the multi items data accurate enough. In general, I’d say Kimono is more of an app mash-ups creator than a full-scale web scraper.

4. Screen Scraper

Either you are working on a product website, struggling to add live data feed to your app or merely need to pull out a huge amount of online data for analysis, an accurate web scraping tool can save you loads of time and keep you sane. Here are four powerful web scraping tools to save you from copy-pasting or spending time on writing your own scripts.

1. Uipath

Uipath specializes in developing various process automation software including web scraping and screen scraping software for desktop and web. Uipath web scraper is perfect for non-coders and easily surpasses most common data extraction challenges including page navigation, digging through flash and even scraping PDF files. All you need to do is open the web scraping wizard and simply highlight the data you need to extract. The tool will scrape all the data following this pattern at all pages you’ve chosen and sort it accordingly. You can add as many items for scraping as you like and have them sorted in respective columns. As a result, you receive a neat Excel or CSV document with all the data eliminated from duplicates.

Moreover, Uipath isn’t just about scraping. This software can be used not only for extracting data, but to manipulate the interface of another app, thus establishing data transfers among the two of them. Basically, this tool could be used to conduct any repetitive task a human could do, yet much faster and with higher accuracy.

Pros: You can automate form filling, clicking buttons, navigation etc. Uipath scraper is impressively accurate, fast and simple to use. It “reads” all types of data on screen (JS, HTML, Silverlight and more), plus you can train the software to emulate human actions of various complexity.

Cons: Premium software runs at a premium price. Uipath is an affordable professional solution, but may be a bit too pricey for personal use.

2. Import.io

Import.io offers you a free desktop app to help you scrap all the data you need from an unlimited amount of web pages. The service treats each page as a potential data source to generate API from. If the page you’ve submitted has been previously processed, you can access its API and get some of the data. In other case, Import.io will guide you through the process of creating the scraping matrix by building connectors (for navigation) or extractors (to pull out the needed data). Afterwards, you submit a request for extraction and it’s typically processed within 24 hours. All the data is private and you can schedule auto refreshments at any chosen period of time.

Pros: The service is easy-to-use with no tech skills needed. It can  pages with data (those that needed login/pass), plus it’s free. Minimalistic effective design and simple navigation comes along.

Cons: Improt.io has hard times navigating through combinations of javascript/POST and cannot navigate from one page to another (e.g. click next, second page etc).  Sometimes, it takes over 24 hours to receive the report.  Besides, it’s a browser-only app, non-compatible with other applications.

3. Kimono

Kimono is a popular web scraper among app developers who prefer to power up their products with live data and no additional code. It saves you tons of time when you need to fill up your app with mashing data. Install Kimono Browser bookmarklet; highlight page elements you need to and provide some positive/negative examples to train the tool. After labeling all the data you can download it in CSV/JSON/a web endpoint format. The APIs created for your pages are stored in the cloud and you can run them on schedule. So far, Kimono is free to use with pro and enterprise solutions to be launched soon.

Pros: The tool works pretty fast and works great with scraping newsfeeds and prices. The data is rather accurate.

Cons: No page navigation available and you need to spend quite a lot of time to train Kimono before it starts to pull out the multi items data accurate enough. In general, I’d say Kimono is more of an app mash-ups creator than a full-scale web scraper.

4. Screen Scraper

Screen scraper is pretty neat and tackles a lot of difficult tasks including navigation and precise data extractions, however it requires a bit of programming/tokenization skills if you’d like to run it super smooth. Launch the software, add a proxy, start recording the list of your actions and creating extracting patterns (some coding required). Works great with HTML and Javascript, however you should test it with Citrix and other platforms. Basically, screen scraper helps you writing simple web scraping scripts and lets you download the extracted data in txt/csv/excel format.

Pros: When set correctly, there’s no data extraction tasks Screen scraper fails to handle.

Cons: The tool is pricey and you’ll have to go through documentation and have basic coding skills to use it.

Source: http://tech.co/4-web-scraping-tools-save-time-data-extraction-2015-03