eBay Product Scraping, Manta Data Scraping, Website Screen Scraping, Website Screen Scraping, Website Scraper, Scraping Data from Websites, Website Information Scraping, Web Scraping Services, Scraping Data from Websites, Website Information Scraping

Wednesday, 25 December 2013

Tools for Data Scraping and Visualization

Over the last few weeks I co-taught a short-course on data scraping and data presentation for.  It was a pleasure to get a chance to teach with Ethan Zuckerman (my boss) and interact with the creative group of students! You can peruse the syllabus outline if you like.

In my Data Therapy work I don’t usually introduce tools, because there are loads of YouTube tutorials and written tutorials.  However, while co-teaching a short-course for incoming students in the Comparative Media Studies program here at MIT, I led two short “lab” sessions on tools for data scraping, interrogation, and visualization.

There are a myriad of tools that support these efforts, so I was forced to pick just a handle to introduce to these students.  I wanted to share the short lists of tools I choose to share.

Data Scraping:

As much as possible, avoid writing code!  Many of these tools can help you avoid writing software to do the scraping.  There are constantly new tools being built, but I recommend these:

   1.Copy/Paste: Never forget the awesome power of copy/paste! There are many times when an hour of copying and pasting will be faster than learning any sort of new tool!
   2.Import.io: Still nascent, but this is a radical re-thinking of how you scrape.  Point and click to train their scraper.  It’s very early, and buggy, but on many simple webpages it works well!
   3.Regular Expressions: Install a text editor like Sublime Text and you get the power of regular expressions (which I call “Super Find and Replace”).  It lets you define a pattern and find it in any large document.  Sure the pattern definition is cryptic, but learning it is totally worth it (here’s an online playground).
   4.Jquery in the browser: Install the bookmarklet, and you can add the JQuery javascript library to any webpage you are viewing.  From there you can use a basic understanding of javascript and the Javascript console (in most browsers) to pull parts of a webpage into an array.
   5.ScraperWiki: There are a few things this makes really easy – getting recent tweets, getting twitter followers, and a few others.  Otherwise this is a good engine for software coding.
   6.Software Development: If you are a coder, and the website you need to scrape has javascript and logins and such, then you might need to go this route (ugh).  If so, here’s a functioning example of a scraper built in Python (with Beautiful Soup and Mechanize).  I would use Watir if you want to do this in Ruby.

Data Interrogation and Visualization:

There are even more tools that help you here.  I picked a handful of single-purpose tools, and some generic ones to share.

   1.Tabula: There are  few PDF-cleaning tools, but this one has worked particularly well for me.  If your data is in a PDF, and selectable, then I recommend this! (disclosure: the Knight Foundation funds much of my paycheck, and contributed to Tabula’s development as well)

   2.OpenRefine: This data cleaning tool lets you do things like cluster rows in your data that are spelled similarly, look for correlations at a high level, and more!  The School of Data has written well about this – read their OpenRefine handbook.

   3.Wordle: As maligned as word clouds have been, I still believe in their role as a proxy for deep text analysis.  They give a nice visual representation of how frequently words appear in quotes, writing, etc.

   4.Quartz ChartBuilder: If you need to make clean and simple charts, this is the tool for you. Much nicer than the output of Excel.

   5.TimelineJS: Need an online timeline?  This is an awesome tool. Disclosure: another Knight-funded project.

   6.Google Fusion Tables: This tool has empowered loads of folks to create maps online.  I’m not a big user, but lots of folks recommend it to me.

   7.TileMill: Google maps isn’t the only way to make a map.  TileMill lets you create beautiful interactive maps that fit your needs. Disclosure: another Knight-funded project.

   8.Tableau Public: Tableau is a much nicer way to explore your data than Excel pivot tables.  You can drag and drop columns onto a grid and it suggests visualizations that might be revealing in your attempts to find stories.

I hope those are helpful in your data scraping and story-finding adventures!


No comments:

Post a Comment