eBay Product Scraping, Manta Data Scraping, Website Screen Scraping, Website Screen Scraping, Website Scraper, Scraping Data from Websites, Website Information Scraping, Web Scraping Services, Scraping Data from Websites, Website Information Scraping

Friday, 27 December 2013

Basic web scraping and data visualization using Google Spreadsheets

Google Spreadsheets provides a free, one-stop solution for journalists and researchers to retrieve tabular data from a web page, visualize the data, and embed the visualizations in a news or research report.

There are times when a journalist or researcher needs to incorporate visualized data in a report, and the data reside on a third-party website, in the form of a table. It is tedious and time-consuming to manually copy/paste the table, take care of any formatting issues, find a way to visualize the data, and upload the visualizations somewhere on the web for sharing and embedding.

With a free Google account, in three steps, we can create a spreadsheet, let the spreadsheet retrieve data in that table, perform various visualizations with the retrieved data, and then share the visualizations through different venues.

Step 1: Retrieve (scrape) tabular data

Google Spreadsheets provides a large collection of “functions” that do various things. This tutorial is based on one particular function that retrieves tabular data on a web page; for other spreadsheet functions, please refer to the Google Spreadsheet function list.

Let’s say that in a report about U.S. population, we want to quote state-by-state population data, in the form of an interactive chart, off of the Wikipedia list of U.S. states, which is a table with 12 columns and dozens of rows.

With a new blank spreadsheet, type in the A1 cell (the upper-left cell) the following function code:

This formula tells Google Spreadsheets to go to the said web page, look for tables on that page, and fetch data in the second table. Once I hit the enter key, in a few moments, the blank spreadsheet is populated with data as shown in the screenshot below:

U.S. population

A few notes about this step:

    To scrape data from any other page, copy that web address and paste it within the quotation marks.

    For that number “2″ in the formula – a web page may contain several tables; some are visible, some invisible. If you don’t know how to read HTML source codes and pinpoint the exact table you want to target, then start with 1, and work your way up until the spreadsheet returns the data you want.

    You cannot edit this spreadsheet as it is dynamically linked to the source on that Wiki page. If you delete a column or cell, the spreadsheet will refresh itself with the deleted contents back in place.

    If you do need to edit the data, for instance to create a customized chart of selected columns/rows, you need to make a copy of the spreadsheet. Remember, when you copy/paste to a new spreadsheet, choose edit>paste special>paste values only, so that the new spreadsheet is not linked to the source.

Step 2: Visualize data

Google Spreadsheets provides various visualization tools that we can choose and customize to suit our needs.

For instance, I want to visualize and compare the 2012 estimated population of the top-10 states. To do that, I need to specify the range of cells for the visualization; in this case, it is columns C and D, from row 1 down to row 11 (we need to include the header in row 1).

This range is thus C1:D11, which is a standard way to specify spreadsheet ranges.

(Note: read a Goggle Spreadsheet tutorial for working with cell range and other spreadsheet features)

On the spreadsheet menu bar, click on the “Insert chart” button (third from right); in the Chart Editor pop-up window, there are three tabs where we can customize the chart:

    chart-editorUnder the Start tab, change the data range to Sheet1!C1:D11; leave other options as is, and make sure “use row 1 as headers” is checked. Based on the data type, the spreadsheets has a few recommended charts; I selected the bar chart.

    Click the Chart tab and explore other types of charts; charts not compatible with the current data set are not available (grayed out).

    Click the Customize tab; here, among others, give the chart a title, and scroll down to change the default names of the vertical axis and the horizontal axis.

    Click Insert and the chart will be shown on top of the spreadsheet; click upper-right corner of the chart window, click the drop-down menu and choose “move to own sheet.” The chart will be shown in a separate sheet with a default name of “Chart 1.”

Step 3: Share chart visualizations

By default, the spreadsheet is private, meaning it is only visible to its creator. For others to view our chart, we need to change the sharing settings: Click on the blue “Share” button on the upper-right corner; in the pop-up window, under “who has access,” click on “change” next to the “private” option, choose either “public on the web” or “anyone with the link,”  click Save then Done.

Click File>Publish to the web; in the pop-up window, under “Get a link to the published data,” there are a few options:

    Click in the first box and you will see a group of options for how you can share the data/chart. For our purpose, we want to embed the interactive chart in this blog post, so I selected “HTML to embed in a page.”

    Click in the “All sheets” box and choose “Chart 1,” which is the one that holds the chart.

In the next box, select and copy the codes, paste them to the HTML editor of a blog post or web page, and an interactive chart will be shown as the one below; notice that in the original codes, the default dimensions are 500 pixels wide and 300 pixels tall; I changed it to 650*450 for display in this blog post.

It should be noted that in some browsers, an embedded chart may have unsightly scroll bars along the right and the bottom borders. I have not been able to find a fix, but we can always embed a chart image which is automatically generated and doesn’t have scroll bars. All we need to do is to click “publish chart” on the sheet that holds the chart; in the pop-up window, select “image” as the publish format, copy the link and paste it to a blog post or web page.

Source:http://www.mulinblog.com/basic-web-scraping-data-visualization-using-google-spreadsheets/

No comments:

Post a Comment