eBay Product Scraping, Manta Data Scraping, Website Screen Scraping, Website Screen Scraping, Website Scraper, Scraping Data from Websites, Website Information Scraping, Web Scraping Services, Scraping Data from Websites, Website Information Scraping

Friday 13 December 2013

How To Build Agile SEO Tools Using Google Spreadsheets

In the past few years innovations on the web have made it incredibly easy for regular people like you and I to enter the world of coding. For example, right at the end of 2010 I started dabbling with Google Appengine and shipped a fully functional interactive site in 4 weeks. (Read how I built 7books in 4 weeks)

Of course, advances in technology have also made it easier for the pros to build awesome applications. Just look at SEOmoz’s Open Site Explorer which relies on Amazon Web Services.

So as SEOs we have a huge arsenal of tools that we can call upon for various different functions. A lot of these tools, services and platforms however either require learning a large amount of code or take a long time to build something bespoke. So in this post I’m going to talk about using Google Spreadsheets to build small, agile tools which can be built to match your exact needs.

Agile vs Scaleable

Before I dive into the technical details, a quick word on what I use Google Docs for. In my SEO-ninja toolset Google Docs are used for quick, agile tools. That means that if there’s a specific problem I need to overcome or some weird thing I’m testing I always turn to Google Docs first. That’s because I can build things quickly. They aren’t always robust, but if I’m only building a tool to solve a unique problem (as opposed to a problem I encounter all the time) then speed is of the essence. I don’t want to have to spend a lot of time building a tool I’m only going to use once. Or running a test that turns out to not give me the expected results. If you want to build scaleable tools then I suggest you leave it to the pros (though Appengine is a great place to start with building “real” tools).

Let’s start with the complete beginner

Ok, so you might be scared. I’m going to talk about writing functions and building tools. You’re going to get your hands dirty. But literally anyone can do this. You need no prior knowledge. None. This post should take the complete beginner to ninja in 5 easy steps. The steps I’m going to cover:

    Simple Web Scraping
    Advanced Web Scraping
    Google Docs Scripts - The Secret Sauce
    Script Triggers
    Putting It All Together

Lesson 1 - Simple Web Scraping

Ok, so the bedrock of using Google Spreadsheets for fun and profit is their nifty little function called ImportXML. Richard Baxter wrote a great intro post on ImportXML which you can check out. The basic premise is that using a function like this:

=importxml(“https://www.distilled.net/blog/”,“//h2[@class=’entry-title’]”)

(Note: this importxml function has been modified from the original to handle the move from distilled.co.uk to distilled.net)

Gets us a list of blog post titles into a Google Spreadsheet like this:

Try it for yourself! Copy and paste that code into a blank Google Spreadsheet and see what happens :)

Don’t get scared! There’s lots of things you probably don’t understand so let’s walk through them for you.

A standard function looks like this =importxml(“url”, “query”). So the URL can be explicit (like I typed above) or a reference file like this =importxml(A1, “query”) just like you would with a regular spreadsheet function. The query is an XPATH query. For a tutorial reference on XPATH here’s a good guide.

If you can’t be bothered reading that then here’s a few quick definitions (warning! hand-wavey!)

    // - this means select all elements of the type
    //h3 - this means select all h3 elements
    [@class=’’] - this means only select those elements that meet the criteria given
    //h3[@class=’storytitle’] - this means only select elements that look like: <h3 class=“storytitle”>Title</h3>

Walkthrough Example for Simple Web Scraping

So, now we’re getting to grips with the code let’s step through a practical example. A common SEO task is “how can I find as many blogs on niche X as possible”. So I google around and find a list of the top 25 blogs on Technorati: http://technorati.com/blogs/top100/. It’s manual and time consuming having to click on each one to copy the link. I want to get the list of URLs into a spreadsheet as quick as possible.

1) First we take a look at the source code of the page and we see something like this:

1tcpost

2) We load up a Google Docs and fire up the importxml function. We can see that all the blogs are in h3 elements within a list with class of “even” so let’s try something like

=importxml(A2,“//li[@class=’even’]//h3”))

(where A1 is the cell with the URL of the page). We get this back:

Check out the sheet here.

3) As you can see, it contains the blog names so we’re getting there. But our query is also getting a whole load of other stuff we don’t want. So let’s look in the code and see if we can isolate the list of blog items. I find the “inspect element” control in Google Chrome excellent for visualising this. As you hover over the code, it highlights the section of the page that applies to it.

2tcpost

4) We refine our guess to limit ourselves to the a tag within the h3 using a query like

=importxml(A2,“//li[@class=’even’]//h3//a”)

Which loosely translated says “fetch the anchor text within h3’s that appear within the list with class=”even“ which results in:

Get the sheet here.

5) We’re nearly there! We now have a list of all the blog elements. The next step is to pull the URL. We still want the blog names, but we also want the links, so we add in another importxml call:

=importxml(A2,”//li[@class=’even’]//a[@class=’offsite’]/@href“)

Which says, from the li elements select the href contents from the a element. This /a/@href is a very common thing to tag on the end of importxml functions so I suggest you memorise it. This results in:

And we’re done! If you want to look at the spreadsheet within Google Docs go here and make a copy then you can play around to your heart’s content :)

Lesson 2 - More Advanced Web Scraping

Ok, now we have the basics down let’s move on to some more fun activities. Of course, as soon as I get computers involved my thoughts turn to rank checking... This is a common task that we might want to do so let’s quickly discuss how to do that. Firstly we construct the search URL like this:

=concatenate(”http://www.google.co.uk/search?q=“&A2&”&pws=0&gl=UK&num=50“)

Where the query to search is in cell A2. Then we parse the Google URL using importxml like this:

=importxml(B2,”//h3[@class=’r’]/a/@href“)

I’m not going to break that down, hopefully you can figure out what I’m getting off the page. Again, check the source code for the page if you’re not sure what to write in your importxml function. Output like this:

As before, Grab your own copy here.

You’ll notice that the results returned are less than pretty, and it’s just because this is how Google structures their HTML. We need to turn this: /url?q=http://www.wickedweb.co.uk/search-engine-optimisation/&sa=U&ei=eyIpUtfyHKLi2wWplYHgAg&ved=0CC4QFjAB&usg=AFQjCNHy63s0tmfxC5njtJ8Yj7v-VHz9yA into http://www.wickedweb.co.uk/. There’s probably 10 different ways of doing this, but here’s what I’m using:

=arrayformula(mid(C2:C51,search(”q=“,C2:C51,1)+2,search(”&sa“,C2:C51,6)-8))

. Using this formula, I can extract exactly what I need from all the returned results, hopefully you can pick out what I’m doing!

Lastly, we want to find out where distilled.net ranks for ”seo agencies london“ so we’ll use this formula:

=ArrayFormula(MATCH(1, FIND(”https://www.distilled.net“,D2:D51),0))

I was going to add some explanation here as to what this formula does but actually it gets pretty complicated. Either you already know what an arrayforumla does (in which case it should be straightforward) or you don’t. In which case you probably just want to copy and paste for now :)

I should note at this stage that there is a limit for 50 importxml calls per spreadsheet which limits us from building a full web crawler but for most agile tools this is sufficient (especially when combined with scripts, see lesson 3).

Lesson 3 - Google Docs Scripts - The Secret Sauce

Now, all this is very well - we have functions which pull in data but it’s all a little ”flat“ if you know what I mean? Let’s try and jazz things up a little by making it MOVE. For anyone familiar with macros in excel, scripts function in a very similar way. Two big advantages here however are the ability to crawl URLs and also the ability to email you. Nice.

Google Scripts are very powerful and essentially allow you to build fully featured programs so I’m not going to go into massive detail here. There are great tutorials from Google already for example:

    Your First Script - a walkthrough of how to use the service
    Sending emails from a spreadsheet - does what it says on the tin

You can easily lose days of your life browsing through and playing with all the things that Google Scripts do. Here, I’m going to present a simple example to show you how agile this is. That’s the key here, building tools that fit your exact needs quickly and easily. Let’s imagine I want to quickly check a bunch of URLs for their tweet count to produce something like this:

Check out the sheet here.

What’s happening here is that I have a list of URLs that I want to check tweet counts for. I’ve created my own function which takes one parameter: =twitter(URL) where URL is the reference to the cell with the link I want to check. Here’s the code:

function twitter(url) {
  var jsondata = UrlFetchApp.fetch(”http://urls.api.twitter.com/1/urls/count.json?url=“+url);
  var object = Utilities.jsonParse(jsondata.getContentText());
  return object.count;
}

Once you’ve read through the Google Scripts tutorials above you should be fairly comfortable with how this works so I’m not going to step through it in detail. The parsing XML tutorial will likely come in handy.

Lesson 4 - Google Scripts Triggers

Ok, now for the magic. Google scripts are nice, but the real power comes from triggering these scripts in different situations. You can cause a script to trigger on any of the following:

    The spreadsheet is opened
    A form is submitted
    A button is pressed
    A specific time happens

Read more about script triggers here.

The most useful here is the time-based trigger I think. Let’s take a quick look at writing a time-based script.

Walkthrough example of time-based trigger

Let’s again take a simple example. As I’m writing this post I know that I’m going to put it live soon, so let’s build a spreadsheet to check the keyword ”seo tools“ and see if perhaps QDF will push this post onto the first page at any point. How viciously referential :)

Step 1 - we write a simple spreadsheet to rank check distilled.co.uk against a particular keyword Step 2 - we write a script that tracks the rank and logs it in a new cell:

Step 3 - we create a time-based trigger to run the script every 30mins:

A few things to note:

    I’ve used

    =int(now())

    in the URL to generate a unique URL each time. Otherwise Google caches the data and you won’t get fresh data each time
    Note the getRange and setValue functions - these are very useful to get your head around. See this tutorial.

The final result (you might have to scroll down for a while depending how long after I wrote this post you’re reading this!):

Grab a copy of the spreadsheet here to see how it works.

Lesson 5 - putting it all together

So, finally let’s put it all together in a fun example. I’ve created a form here where you can enter your city and your email address and my script will fetch some data and email it to you. Just like magic! Go ahead, try it out :)

Taking it further

The sky really is the limit when it comes to Google Scripts but I think that if you start doing any more heavy lifting than what I’ve done in this post you almost certainly want to start building in exception handling and learning to code properly (which I, to be clear, have very much not done!). That said, if you do fancy using Google Scripts there are all kinds of funky things it can do:

    Build a GUI
    Use full oAuth to build a twitter app
    Create & edit new spreadsheets on the fly
    Publish & edit Google sites
    Converting spreadsheets into JSON
    Using Google Docs as a database

But for me, the real power is in hacking together things in a few minutes which gather the data I need so I can get back to getting stuff done. I’ll leave building real SEO tools to the pros for now

Source: https://www.distilled.net/blog/seo/how-to-build-agile-seo-tools-using-google-docs/

No comments:

Post a Comment