Lyrics Scraping: December 2013

Saturday 28 December 2013

Simple Steps to Find an Article Writing Service

Many businesses understand the importance of content marketing and quality web content writing. However, some businesses are simply too busy and don’t have the resources to complete this work on their own. These companies either suffer the consequences of not having a solid online presence or they hire an article writing service to help them. Hiring a service that matches you with a quality content writer will help your business succeed online without costing a lot of your time, effort or money.

Check Their Background

A good article writing service will have a solid background that shows you their expertise in the area of web content writing. While the service is capable of telling you more about their background when you speak with a representative, it can also be useful to do some of the research on your own. Look for reviews of their services and look over the work they have completed for their own website, blog and other marketing materials. If a company is dedicated to their work, they will put the work into helping their own company, as well as yours.

In addition to checking into the background of the service itself, you should find out as much as you can about the content writer with whom you will be working. Look for someone who has a proven track record of success with their work. There are plenty of amateurs on the Internet who feel they can sufficiently string words together to please their clients. You need someone who can write on a college level and provide you with the quality you deserve.

Ask for Examples

Actions speak much louder than words. As you consider which article writing service you want to hire for your web content writing, make sure you ask them for examples of the work they have already performed, particularly within your industry. This will give you a clear picture of what they will be able to do for you. For instance, if you hire them to write your blog posts, seeing examples of their press releases or white papers will not give you the evidence you need to make your decision.

Do They Format Their Work?

One of the biggest factors in the success of your content marketing is how well your content writer formats the work. Subheadings are particularly important for your blog posts and other web content writing because it breaks up the text and allows readers to scan the piece to determine if they want to read it or not. Hyperlinks, bold text, italics and other features are important to formatting. If the article writing service isn’t able to make your content look great, it isn’t worth your time or your money to invest in their work. Make sure you ask about the type of formatting they use and let them know if you have any preferences.

Evaluate Turnaround Time

When you have a content writer working on content for your site, your blog or other marketing materials, you need to know how fast you can expect the work. Some companies claim they can consistently turn your work around in as little as 24 hours and maintain a high level of quality. While this may be possible, it certainly isn’t typical. However, a good article writing service should be able to submit work to you in less than seven days, giving you the content you need quickly. If you are working with them on a consistent basis, this can mean receiving new content daily.

Get a Price Quote

The final important piece to the puzzle is to find out how much the company charges. Make sure you find a company that can create a custom package just for your business. They may offer a package that fits your needs, but if not, they should be happy to work with you to find an alternative that works with your needs and your budget. Remember, you don’t have to spend a lot of money to get what you need.

An article writing service can be the perfect way to get the content you need without spending your own time and resources on this important task. When you take the time to find a content writer who has the experience necessary to complete quality work for you on a regular basis, you can boost your content marketing and see the results you are looking for.

Do you need help with your content marketing plan? Contact us to find out what our article writing service can offer.

Source:http://www.iwebcontent.com/simple-steps-to-find-an-article-writing-service/

Friday 27 December 2013

Basic web scraping and data visualization using Google Spreadsheets

Google Spreadsheets provides a free, one-stop solution for journalists and researchers to retrieve tabular data from a web page, visualize the data, and embed the visualizations in a news or research report.

There are times when a journalist or researcher needs to incorporate visualized data in a report, and the data reside on a third-party website, in the form of a table. It is tedious and time-consuming to manually copy/paste the table, take care of any formatting issues, find a way to visualize the data, and upload the visualizations somewhere on the web for sharing and embedding.

With a free Google account, in three steps, we can create a spreadsheet, let the spreadsheet retrieve data in that table, perform various visualizations with the retrieved data, and then share the visualizations through different venues.

Step 1: Retrieve (scrape) tabular data

Google Spreadsheets provides a large collection of “functions” that do various things. This tutorial is based on one particular function that retrieves tabular data on a web page; for other spreadsheet functions, please refer to the Google Spreadsheet function list.

Let’s say that in a report about U.S. population, we want to quote state-by-state population data, in the form of an interactive chart, off of the Wikipedia list of U.S. states, which is a table with 12 columns and dozens of rows.

With a new blank spreadsheet, type in the A1 cell (the upper-left cell) the following function code:

This formula tells Google Spreadsheets to go to the said web page, look for tables on that page, and fetch data in the second table. Once I hit the enter key, in a few moments, the blank spreadsheet is populated with data as shown in the screenshot below:

U.S. population

A few notes about this step:

    To scrape data from any other page, copy that web address and paste it within the quotation marks.

    For that number “2″ in the formula – a web page may contain several tables; some are visible, some invisible. If you don’t know how to read HTML source codes and pinpoint the exact table you want to target, then start with 1, and work your way up until the spreadsheet returns the data you want.

    You cannot edit this spreadsheet as it is dynamically linked to the source on that Wiki page. If you delete a column or cell, the spreadsheet will refresh itself with the deleted contents back in place.

    If you do need to edit the data, for instance to create a customized chart of selected columns/rows, you need to make a copy of the spreadsheet. Remember, when you copy/paste to a new spreadsheet, choose edit>paste special>paste values only, so that the new spreadsheet is not linked to the source.

Step 2: Visualize data

Google Spreadsheets provides various visualization tools that we can choose and customize to suit our needs.

For instance, I want to visualize and compare the 2012 estimated population of the top-10 states. To do that, I need to specify the range of cells for the visualization; in this case, it is columns C and D, from row 1 down to row 11 (we need to include the header in row 1).

This range is thus C1:D11, which is a standard way to specify spreadsheet ranges.

(Note: read a Goggle Spreadsheet tutorial for working with cell range and other spreadsheet features)

On the spreadsheet menu bar, click on the “Insert chart” button (third from right); in the Chart Editor pop-up window, there are three tabs where we can customize the chart:

    chart-editorUnder the Start tab, change the data range to Sheet1!C1:D11; leave other options as is, and make sure “use row 1 as headers” is checked. Based on the data type, the spreadsheets has a few recommended charts; I selected the bar chart.

    Click the Chart tab and explore other types of charts; charts not compatible with the current data set are not available (grayed out).

    Click the Customize tab; here, among others, give the chart a title, and scroll down to change the default names of the vertical axis and the horizontal axis.

    Click Insert and the chart will be shown on top of the spreadsheet; click upper-right corner of the chart window, click the drop-down menu and choose “move to own sheet.” The chart will be shown in a separate sheet with a default name of “Chart 1.”

Step 3: Share chart visualizations

By default, the spreadsheet is private, meaning it is only visible to its creator. For others to view our chart, we need to change the sharing settings: Click on the blue “Share” button on the upper-right corner; in the pop-up window, under “who has access,” click on “change” next to the “private” option, choose either “public on the web” or “anyone with the link,” click Save then Done.

Click File>Publish to the web; in the pop-up window, under “Get a link to the published data,” there are a few options:

    Click in the first box and you will see a group of options for how you can share the data/chart. For our purpose, we want to embed the interactive chart in this blog post, so I selected “HTML to embed in a page.”

    Click in the “All sheets” box and choose “Chart 1,” which is the one that holds the chart.

In the next box, select and copy the codes, paste them to the HTML editor of a blog post or web page, and an interactive chart will be shown as the one below; notice that in the original codes, the default dimensions are 500 pixels wide and 300 pixels tall; I changed it to 650*450 for display in this blog post.

It should be noted that in some browsers, an embedded chart may have unsightly scroll bars along the right and the bottom borders. I have not been able to find a fix, but we can always embed a chart image which is automatically generated and doesn’t have scroll bars. All we need to do is to click “publish chart” on the sheet that holds the chart; in the pop-up window, select “image” as the publish format, copy the link and paste it to a blog post or web page.

Source:http://www.mulinblog.com/basic-web-scraping-data-visualization-using-google-spreadsheets/

Data Cleansing Services Launched by WinPure

The new range of services are aimed at companies and individuals who have more complex needs or simply do not have the time to use our award-winning data cleansing software.

The WinPure Data Cleansing Service is available to clean, suppress and enhance consumer or business data. Featuring pay-as-you-go pricing so you only pay for what you need. WinPure can check your data against the latest UK Royal Mail PAF file and the National Change of Address (NCOA) database to determine if any of the individuals or companies/organizations in the database have relocated, and will update the database with new addresses as needed. Also included are checks against the Bereavement Register, MPS and TPS.

For names and addresses, the data cleansing service will ensure that your list or database doesn’t contain duplicate records and all your records in your file are up to date and accurate, and are maintaining the addresses in the Royal Mails standard format..

By using the data cleansing service it will also ensure businesses and consumers stay compliant with the Data Protection Act by ensuring that the data they hold is current and accurate. Businesses using this service will also help protect the environment and their bottom line by avoiding waste when sending out duplicate or unwanted mailings.

WinPure can undertake bureau work on either a one-off basis or as a regular service. For more information on the data cleansing services visit and ensure you maximise your business performance in future with clean, accurate and up-to-data data.

About WinPure

WinPure are a worldwide leading provider of data quality and data cleansing software solutions that are powerful, simple to use, inexpensive and most importantly can be used by anyone rather than just IT specialists or data cleansing experts. Businesses around the world are now using WinPure software to help improve the quality of their information, helping them to increase profitability through more accurate data, and reducing costs by eliminating duplications, spelling errors and mistakes. WinPure products are relied upon by thousands of international companies, non profit & government agencies, educational organizations and individuals in over 40 countries from around the world.

Source:http://www.sbwire.com/press-releases/data-cleansing-services-launched-by-winpure-286331.htm

Methods for making lucrative article writing routines

Coffee may be of wonderful help when you work at home and require some air flow. Caffeine properties tend to be built with Wi-fi, therefore you can work although consuming that mug of joe to get a various surroundings. Otherwise, several restaurants supply the very same.

Parajumpers Jakke Dame

Can you afford to quit your task and homeschool? Have you developed a price range to find out? Draft an affordable budget of your own existing income and expenditures. Now, eliminate the cash flow of the individual that will be remaining home. Also, include the fee for supplies, like lesson resources, producing devices, papers, and many others. Can you afford it now?

Canada Goose Jassen

Some women's gum line become very vulnerable and bleed whenever they experience the monthly period or hormone adjustments during growing up. If this sounds like your situation, you can easily eliminate this concern if you take dental contraceptive. Check out the dental office to make sure the internal bleeding is just not a result of chewing gum illness.

Canada Goose Jassen

Of course, there exists a stigma that accompanies the phrase "natural and organic," but that's as most individuals fail to know that the saying organic and natural, generally means normal. The truth is, expanding organic and natural is as normal as you can perhaps get. So be sure to start using these garden tips when you're able to expand organic and natural foods.

Natural and organic Horticulture 101: Everything You Need To Know Uggs

When you must concentrate on 2 things, particularly, focus on reading and mathematics. These subjects can be hard for many children to find out. They are also the most-utilized subject areas in their day-to-day lives. A great math and looking at foundation will assist them all through most of their education and life. Try using throat stretches and good posture for a appealing the neck and throat. A healthy the neck and throat is as vital as a wholesome face. Do not treat the the neck and throat area just like the face because each of them age in a different way. As time passes the muscles from the the neck and throat reduce and also the slender skin is not going to react in the same manner as being the skin of your face does to peels and lasers.

Source:http://www.escortlyfe.com/showthread.php?tid=57073

Thursday 26 December 2013

Simple Steps to Find an Article Writing Service

Wednesday 25 December 2013

Tools for Data Scraping and Visualization

Over the last few weeks I co-taught a short-course on data scraping and data presentation for. It was a pleasure to get a chance to teach with Ethan Zuckerman (my boss) and interact with the creative group of students! You can peruse the syllabus outline if you like.

In my Data Therapy work I don’t usually introduce tools, because there are loads of YouTube tutorials and written tutorials. However, while co-teaching a short-course for incoming students in the Comparative Media Studies program here at MIT, I led two short “lab” sessions on tools for data scraping, interrogation, and visualization.

There are a myriad of tools that support these efforts, so I was forced to pick just a handle to introduce to these students. I wanted to share the short lists of tools I choose to share.

Data Scraping:

As much as possible, avoid writing code! Many of these tools can help you avoid writing software to do the scraping. There are constantly new tools being built, but I recommend these:

   1.Copy/Paste: Never forget the awesome power of copy/paste! There are many times when an hour of copying and pasting will be faster than learning any sort of new tool!

   2.Import.io: Still nascent, but this is a radical re-thinking of how you scrape. Point and click to train their scraper. It’s very early, and buggy, but on many simple webpages it works well!

   3.Regular Expressions: Install a text editor like Sublime Text and you get the power of regular expressions (which I call “Super Find and Replace”). It lets you define a pattern and find it in any large document. Sure the pattern definition is cryptic, but learning it is totally worth it (here’s an online playground).

   4.Jquery in the browser: Install the bookmarklet, and you can add the JQuery javascript library to any webpage you are viewing. From there you can use a basic understanding of javascript and the Javascript console (in most browsers) to pull parts of a webpage into an array.

   5.ScraperWiki: There are a few things this makes really easy – getting recent tweets, getting twitter followers, and a few others. Otherwise this is a good engine for software coding.

   6.Software Development: If you are a coder, and the website you need to scrape has javascript and logins and such, then you might need to go this route (ugh). If so, here’s a functioning example of a scraper built in Python (with Beautiful Soup and Mechanize). I would use Watir if you want to do this in Ruby.

Data Interrogation and Visualization:

There are even more tools that help you here. I picked a handful of single-purpose tools, and some generic ones to share.

   1.Tabula: There are few PDF-cleaning tools, but this one has worked particularly well for me. If your data is in a PDF, and selectable, then I recommend this! (disclosure: the Knight Foundation funds much of my paycheck, and contributed to Tabula’s development as well)

   2.OpenRefine: This data cleaning tool lets you do things like cluster rows in your data that are spelled similarly, look for correlations at a high level, and more! The School of Data has written well about this – read their OpenRefine handbook.

   3.Wordle: As maligned as word clouds have been, I still believe in their role as a proxy for deep text analysis. They give a nice visual representation of how frequently words appear in quotes, writing, etc.

   4.Quartz ChartBuilder: If you need to make clean and simple charts, this is the tool for you. Much nicer than the output of Excel.

   5.TimelineJS: Need an online timeline? This is an awesome tool. Disclosure: another Knight-funded project.

   6.Google Fusion Tables: This tool has empowered loads of folks to create maps online. I’m not a big user, but lots of folks recommend it to me.

   7.TileMill: Google maps isn’t the only way to make a map. TileMill lets you create beautiful interactive maps that fit your needs. Disclosure: another Knight-funded project.

   8.Tableau Public: Tableau is a much nicer way to explore your data than Excel pivot tables. You can drag and drop columns onto a grid and it suggests visualizations that might be revealing in your attempts to find stories.

I hope those are helpful in your data scraping and story-finding adventures!

Source:http://datatherapy.wordpress.com/2013/10/24/tools-for-data-scraping-and-presentation/

Tuesday 17 December 2013

Role of web scraper in extraction of data

Web harvesting, also referred as web scraping or web data extraction is an approach employed to take out large amount of data from a website. Data from third party sites on web can usually be viewed using a web browser only. Examples are data listings at real estate websites, yellow pages directories, industrial inventory sites, social networks, shopping sites and many more. Most websites do not offer the functionality to save a copy of the data which they display to your local storage. The only alternative then is to physically copy and paste the data to a local file in your computer, which is a very tricky and tedious job that can take so many hours to complete. Web Scraping is the method of automating this procedure, so that in place of manually copy the data from website, the web scraper will performs the same task within a short span of the time.

Web scraper is software or a scraping tool used to extract data from website in a very easy and hassle-free manner. Web scrapers are programs that are capable to collect information from the Internet. They will be competent to go online, access the content of your website, and then drag the data points and placing them in a work or structured database or in spreadsheet. Many services and companies make use of this software to scrap the Web, such as to carry out online research, tracking changes to Web content and comparing prices. The web scraper will interrelate with websites in the similar way as your web browsers. But in place of displaying the data served by the website on screen, the web scraper saves the desired data from the web page to a local database or file.

The web scraper works in the similar manner as web indexing is done using a web robot, which is the method that is employed by the majority of search engines. This software is very user friendly as the main aim of this tool is to make the procedure of web data extraction easier. If you wish to make use of this tool and want to buy it, then there are various websites that offer this software for individuals who wish to extract data from internet. So what are you waiting for? Simply go online and search for the most reputable and trusted website as per your needs.

Source:http://justarticlessite.com/role-of-web-scraper-in-extraction-of-data.html

Monday 16 December 2013

Product Feed Integration and Scraping Products From Supplier Web Sites

This is an old post. The information it contains is probably out of date or innacurate

This is a post that was written a long time ago and is only being kept here for posterity. You should probably look up more recent blog posts related to the subject you are researching

One of the big tasks that any ecommerce retail business must undertake is the continual updating and inserting of products into the catalogue. Done one by one, this task can take a ridiculous amount of time. In some instances there is no better option, but in the vast majority of cases there is!

Product Feed

The ideal scenario is that your supplier makes available an up to date product feed which is regularly updated and contains all of the information you need to insert those products into your catalogue. The challenge with this is that it is highly unlikely that you will literally be able to upload this data as is. The reason being that each ecommerce system has its own quirks and separate ways of doing things. Before you can upload this data into your feed, it is highly likely that it will need to be altered and prepared for insertion.

You could do updating by hand – but that brings us back to our first point. Doing things by hand can take a ridiculously large amount of time. Instead – we recommend that you have a script which does all this preparation for you.

In fact this task is something that Edmonds Commerce specialises in. Not least because it is something that we have done plenty of and so we have a good understanding of how to do the job. Furthermore – we understand how to do the job well.

Spidering and Scraping Products from Supplier Web Site

If your supplier does not provide a feed, or if the feed they supply does not have all of the information that you want, you might think you are stuck. You are not!

It is perfectly possible to build a system which will visit every product on your suppliers web site and grab all of the information and pictures and then save them into a format that you can insert into your catalogue system. It is even possible to extend the scraping system so that it goes all the way and inserts the products into your site for you.

Again this is something that Edmonds Commerce specialises in.

Conclusion

If you find that you or your staff are spending large amounts of time manually copying and pasting information from supplier web sites – you need to ask yourself if that is really cost effective. Whilst developing a script to process a feed or scrape a supplier web site might involve a significant initial outlay – the humongous saving in staff time ensures that you will quickly recoup this cost and then will be straight into a profitable scenario. Furthermore your catalogue will be absolutely up to date with the latest pictures, information and prices meaning that you have the best chance to sell those products.

If you want to discuss how Edmonds Commerce could help you achieve these great goals of cost reduction and totally up to date catalogue – please do get in touch.

Source:http://edmondscommerce.github.io/spidering/ecommerce/product%20catalogue/product%20feed/scraping/product-feed-integration-and-scraping-products-from-supplier-web-sites.html

Sunday 15 December 2013

Web Page Change Tracking

Often, you want to detect changes on some eBay offerings or get notified of the latest items of interest from craigslist in your area. Or, you want to monitor updates on a website (your competitor’s, for example) where no RSS feed is available. How would you do it, by visiting it over and over again? No, now there are handy tools for website change monitoring. We’ve evaluated some tools and would like to recommend the most useful ones that will make your monitoring job easy. Those tools nicely complement the web scraping software, service and plugins.

General categories

Website tracking utilities may be put into 3 categories:

    Browser plugin/add-on
    Service
    Application

When using a browser plugin/add-on or a desktop application to detect changes, the pages are tracked only when your computer is on. Emails/RSS notifications from a monitoring service work well if you are out of the office and want instant (often with a delay, like Page2Rss) changes on your mobile device. Applications and plugins are the most powerful tools since they’re likely to report with minimum delay; they inform you of the changes right away. Being able to compare changes in the same window is a convenient feature, although most of the time you will need only the new values without the previous ones. Page Monitor excels in that aspect, posting both the new and previous values side-by-side on the screen.

Page Monitor

Google Chrome extension FreeEasy to usePopupsPage area selectorTarget word check

Page Monitor is a Google Chrome extension that works perfectly for detecting changes in any number of websites. It uses a smart comparison system that ignores ads and code changes. The tracking interval is customizable for each page, starting from 5 sec. It sends a popup to the system tray with a link to the website as soon as it changes. Notifications may also be set to sound alarm. You may select an area of a website to monitor or check if there is a word(s) on a page and also apply a regex. Both old (in red) and new (in green) data are highlighted:

Update Scanner

Firefox add-onEasy to usePopupsFreeFilter

This free monitoring tool is a Firefox browser add-on. The user may select how often each site will be scanned. New content that has been added to a page is highlighted. The sound alert may be turned on or off. Autoscan time is adjustable, starting from 5 min. The change threshold, including whether changes in numbers should be ignored, may be selected, depending on your needs. But, compared to previous tools, this add-on doesn’t show what has been removed from a website.

Page2rss

ServiceFreeEasy to useRSS

Page2RSS is a service that helps you monitor websites that don’t publish feeds. It’s also available as the Google Chrome extension that allows you to quick-add browsed websites’ changes as RSS feed when no RSS button is present on a website. This RSS feed service incorporates the changes into the feed, relieving you of the work of revisiting a page of interest when it’s updated.

InfoMinder

ServicePaidUsabilityEmail alertsFilter

This service is easy to set up and works smoothly. I like it because in an alert email there is a link to the page changed with changes highlighted, as well as the initial text so it’s immediately clear what has been changed. The free trial plan tracks up to 10 pages free of charge for 30 days. InfoMinder offers extensions for browsers (IE or FF) to make it easy to track any website while browsing. It’s possible to filter what changes to detect. After a trial period, you may upgrade to Pro or Premium plans and get your websites quota increased, as well as a monitoring frequency (up to 6 hours for each), with the price starting from $30/year per 100 pages. See the webpage snippet with red text and yellow background highlighted changes:

The free analogue to this service is ChangeDetection. With a simple UI it allows monitoring on a daily base with email notifications. The drawback is that it tracks and then logs only text updates, which is not convenient to check later.

Femtoo

ServiceFree&PaidEmail alertsFilterPage Area SelectionTarget word check

This service allows you to have up to 10 “trackers” (only one of which having a 24 hour check interval) and a maximum of 5000 characters tracked for changes for free. Alerts can be delivered by email, instant messages, text messages and a personal tracker RSS Feed. The Femtoo provides advanced content filters, which allow you to track only a specific element of a website (text or numeric), if desired. The service limits a free account user to only 30 checks per month. There are a number of plans that allow more trackers, more frequent checks (up to 30 minutes) and additional options. This service (in paid plans only) allows you to share a tracker with other subscribers (for example, family or colleagues). In the image, only the upper tracker is on; the other 4 are paused, due to use of the free plan.

Website Watcher

Desktop applicationPaidFilterUpdates archivingUsabilityTwo columns compareAuto loginBinary filesLocal files

This professional, fully-featured monitoring application allows website changes tracking, including comparison of the new and old pages side-by-side. It can notify you when a monitored website has been updated and when chosen expressions or text are detected on a web page. Website Watcher checks websites automatically with the AutoWatch feature or you may manually invoke a check for updates. Filters can be created automatically with the Auto-Filter system, or manually. Also, local files may be placed under its oversight and changes are highlighted. The Basic edition costs about 29,95 Euro (approximately $40 US) with Personal and Business editions priced up to about 10000 Euro. See the versions compared in two columns in the application inbuilt browser:

Its free analogue is NotiPage. It’s also a Windows utility for monitoring websites with basic monitoring features, warning you by a visual and audible alert. The difference between NotiPage and Website Watcher is NotiPage has fewer features and, specifically, it doesn’t show you what has been removed from a page.

Source:http://scraping.pro/web-page-change-tracking/

Fourth Workshop on Data Extraction and Object Search

The Fourth International Workshop on “Data Extraction and Object Search” (DEOS 2013) will take place as a satellite event of WWW 2014 in Seoul, South Korea, on April 7th, 2014. Web data extraction is witnessing a renaissance. In an increasing number of applications such as price intelligence or predictive analytics, the value of data-driven approaches has been conclusively proven. However, the necessary data is often available only as HTML, e.g., in form of online shops of competitors that can serve as sources for pricing and offer data. DEOS is a regular forum for researchers and practitioners in data extraction and object search, to present and discuss ongoing work on data extraction and object search for products, events, reviews, and other types of structured data on the web.

This year’s DEOS focuses on the challenges in scaling data extraction to the variety and volume of different data sources available only as HTML on the web. Classical data extraction has been largely site-specific, requiring some manual supervision for every site. Where data is to be sourced from more than a handful of websites, this approach fails. To address this challenge, we are witnessing a paradigm shift in data extraction away from manual supervision by experts.

This shift has seen two primary directions emerge: Some approaches have considered how to allow non-experts to provide the necessary per-site supervision and turned to crowdsourcing. Some approaches employ automatic entity extraction to replace human annotation of data to be extracted and techniques to deal with the noise in such automatic annotations. Either direction poses major challenges and changes to existing data extraction technology. In this workshop, we bring together researchers from both directions.

Source:http://diadem.cs.ox.ac.uk/deos14/

Friday 13 December 2013

Are Louisiana Tech Startups Vulnerable to Scraping?

The new tech startups in Louisiana often forget one of the most basic necessities of keeping their website safe from the hands of content scrapers and often end up losing their hard earned industry database. The content duplicacy is a big industry and is estimated to be around $1 billion. The narrow mindset of some of the people may hamper the growth of tech startups in Louisiana if proper steps are not taken to safeguard the information and database from the hands of scrapers.

The World of Scraping

Getting a large sampling of information from the internet can be a long process. However, web scraping software dramatically reduces data collection. It has been said that ‘information is money.’ Apparently that saying must still hold some meaning, as websites and internet companies are facing an ongoing battle of protection of their online information. Even those sites that are protected by firewalls and other anti-theft devices are still sometimes susceptible to having their data scraped.

Web Scraping Defined

Over the past decades, numerous new programs have arrived on the market that aid in the collection of data from websites. It used to be the job of a team of individuals who spent day after day manually extracting information from websites. But now, new programs are able to gather that same information in a matter of hours or even minutes. The process is called web scraping, and it is a very highly debated topic in today’ cyber community.

An example of web scraping is when a company has an online shopping site that has become targeted by a web scraping program. The software can enter the shopping web site and copy all of the item names, descriptions, prices, and shipping details. This information can then be used by a competitor to create an online catalogue with identical products but at slightly reduced prices. This puts the first company at a distinct disadvantage because they now have to examine all of their web content to adjust for the new competition. In essence, they need to web scrape the new competitor’s site to determine how much of a price reduction they need to make in order to stay ahead of their new competition.

Legal or Illegal?

At the heart of the controversy surrounding information retrieval through web scraping is whether or not the activity is legal. The courts have been back and forth on the issue. Some argue that the information is the private property of the company who runs the website, while others say that if it is available on the net, it’s free for the taking. Yet others say that it’s okay to take the information, but it’s what you do with it afterward that makes it legal or illegal. At some point, the courts are going to need to come up with a definitive response to web scraping. Until then, many companies feel a need to check for web scraping activity on their internet sites so that can determine whether or not their information has been compromised by a web scraping program.

Web Site Protection

Companies that want to protect their data from web collection ‘bots’ do have some alternatives which may help them. Web scrapers may have some sophisticated qualities about them, but their ability to solve certain problems is usually outside of their programming. Web designers can add certain codes to their web pages that will stop or confound the web scraping programs. So if you want to prevent your website content from being stolen, the best way is to use anti scraping services like ScrapeSentry, other than that, here are some possible solutions to stopping web scraping software.

    User registration. By requiring all users of your site to become registered members, you are limiting the ease with which bots can access your information. It is possible for individuals to get access prior to launching web scraping software, but it does at least slow the process.

    Captcha codes. If the information on your web site is protected behind a captcha code device, bots and various types of web scraping software are prevented from entering the system. Captcha codes can be placed in several locations throughout the website which will stall their efforts to collect data.

    Javascript. By including Javascript in your programming, you can effectively block most web scrapers. Web scraper software that requires the ability to read Javascript throughout the website is very difficult to write. Including even a few Javascript actions at the beginning of your web site can cripple web scraping software.

    Monitor download rates. When individuals visit websites, they are only able to download a certain amount of information at a time. Web scrapers, however, can gather a vast array of data and download it very quickly. By installing monitoring systems that measure download rates, you can identify users who are employing web scrapers and block them from your site.

Web scraping is one of those activities that will probably become completely illegal in the near future. That won’t stop some people from creating and employing web scraping programs. Organizations in off-shore locations operate outside the law and can be employed to run web scraping services anywhere in the world. Web site owners will still need to install blocking software and monitor visitors. As new web scraping software is developed, new anti-scraping software will need to follow right behind. It is important that you stay abreast of these issues to protect you business and the information contained on your web site.

Source:http://siliconbayounews.com/2013/09/24/are-louisiana-tech-startups-vulnerable-to-scraping/

How To Build Agile SEO Tools Using Google Spreadsheets

In the past few years innovations on the web have made it incredibly easy for regular people like you and I to enter the world of coding. For example, right at the end of 2010 I started dabbling with Google Appengine and shipped a fully functional interactive site in 4 weeks. (Read how I built 7books in 4 weeks)

Of course, advances in technology have also made it easier for the pros to build awesome applications. Just look at SEOmoz’s Open Site Explorer which relies on Amazon Web Services.

So as SEOs we have a huge arsenal of tools that we can call upon for various different functions. A lot of these tools, services and platforms however either require learning a large amount of code or take a long time to build something bespoke. So in this post I’m going to talk about using Google Spreadsheets to build small, agile tools which can be built to match your exact needs.

Agile vs Scaleable

Before I dive into the technical details, a quick word on what I use Google Docs for. In my SEO-ninja toolset Google Docs are used for quick, agile tools. That means that if there’s a specific problem I need to overcome or some weird thing I’m testing I always turn to Google Docs first. That’s because I can build things quickly. They aren’t always robust, but if I’m only building a tool to solve a unique problem (as opposed to a problem I encounter all the time) then speed is of the essence. I don’t want to have to spend a lot of time building a tool I’m only going to use once. Or running a test that turns out to not give me the expected results. If you want to build scaleable tools then I suggest you leave it to the pros (though Appengine is a great place to start with building “real” tools).

Let’s start with the complete beginner

Ok, so you might be scared. I’m going to talk about writing functions and building tools. You’re going to get your hands dirty. But literally anyone can do this. You need no prior knowledge. None. This post should take the complete beginner to ninja in 5 easy steps. The steps I’m going to cover:

    Simple Web Scraping
    Advanced Web Scraping
    Google Docs Scripts - The Secret Sauce
    Script Triggers
    Putting It All Together

Lesson 1 - Simple Web Scraping

Ok, so the bedrock of using Google Spreadsheets for fun and profit is their nifty little function called ImportXML. Richard Baxter wrote a great intro post on ImportXML which you can check out. The basic premise is that using a function like this:

=importxml(“https://www.distilled.net/blog/”,“//h2[@class=’entry-title’]”)

(Note: this importxml function has been modified from the original to handle the move from distilled.co.uk to distilled.net)

Gets us a list of blog post titles into a Google Spreadsheet like this:

Try it for yourself! Copy and paste that code into a blank Google Spreadsheet and see what happens :)

Don’t get scared! There’s lots of things you probably don’t understand so let’s walk through them for you.

A standard function looks like this =importxml(“url”, “query”). So the URL can be explicit (like I typed above) or a reference file like this =importxml(A1, “query”) just like you would with a regular spreadsheet function. The query is an XPATH query. For a tutorial reference on XPATH here’s a good guide.

If you can’t be bothered reading that then here’s a few quick definitions (warning! hand-wavey!)

    // - this means select all elements of the type
    //h3 - this means select all h3 elements
    [@class=’’] - this means only select those elements that meet the criteria given
    //h3[@class=’storytitle’] - this means only select elements that look like: <h3 class=“storytitle”>Title</h3>

Walkthrough Example for Simple Web Scraping

So, now we’re getting to grips with the code let’s step through a practical example. A common SEO task is “how can I find as many blogs on niche X as possible”. So I google around and find a list of the top 25 blogs on Technorati: http://technorati.com/blogs/top100/. It’s manual and time consuming having to click on each one to copy the link. I want to get the list of URLs into a spreadsheet as quick as possible.

1) First we take a look at the source code of the page and we see something like this:

1tcpost

2) We load up a Google Docs and fire up the importxml function. We can see that all the blogs are in h3 elements within a list with class of “even” so let’s try something like

=importxml(A2,“//li[@class=’even’]//h3”))

(where A1 is the cell with the URL of the page). We get this back:

Check out the sheet here.

3) As you can see, it contains the blog names so we’re getting there. But our query is also getting a whole load of other stuff we don’t want. So let’s look in the code and see if we can isolate the list of blog items. I find the “inspect element” control in Google Chrome excellent for visualising this. As you hover over the code, it highlights the section of the page that applies to it.

2tcpost

4) We refine our guess to limit ourselves to the a tag within the h3 using a query like

=importxml(A2,“//li[@class=’even’]//h3//a”)

Which loosely translated says “fetch the anchor text within h3’s that appear within the list with class=”even“ which results in:

Get the sheet here.

5) We’re nearly there! We now have a list of all the blog elements. The next step is to pull the URL. We still want the blog names, but we also want the links, so we add in another importxml call:

=importxml(A2,”//li[@class=’even’]//a[@class=’offsite’]/@href“)

Which says, from the li elements select the href contents from the a element. This /a/@href is a very common thing to tag on the end of importxml functions so I suggest you memorise it. This results in:

And we’re done! If you want to look at the spreadsheet within Google Docs go here and make a copy then you can play around to your heart’s content :)

Lesson 2 - More Advanced Web Scraping

Ok, now we have the basics down let’s move on to some more fun activities. Of course, as soon as I get computers involved my thoughts turn to rank checking... This is a common task that we might want to do so let’s quickly discuss how to do that. Firstly we construct the search URL like this:

=concatenate(”http://www.google.co.uk/search?q=“&A2&”&pws=0&gl=UK&num=50“)

Where the query to search is in cell A2. Then we parse the Google URL using importxml like this:

=importxml(B2,”//h3[@class=’r’]/a/@href“)

I’m not going to break that down, hopefully you can figure out what I’m getting off the page. Again, check the source code for the page if you’re not sure what to write in your importxml function. Output like this:

As before, Grab your own copy here.

You’ll notice that the results returned are less than pretty, and it’s just because this is how Google structures their HTML. We need to turn this: /url?q=http://www.wickedweb.co.uk/search-engine-optimisation/&sa=U&ei=eyIpUtfyHKLi2wWplYHgAg&ved=0CC4QFjAB&usg=AFQjCNHy63s0tmfxC5njtJ8Yj7v-VHz9yA into http://www.wickedweb.co.uk/. There’s probably 10 different ways of doing this, but here’s what I’m using:

=arrayformula(mid(C2:C51,search(”q=“,C2:C51,1)+2,search(”&sa“,C2:C51,6)-8))

. Using this formula, I can extract exactly what I need from all the returned results, hopefully you can pick out what I’m doing!

Lastly, we want to find out where distilled.net ranks for ”seo agencies london“ so we’ll use this formula:

=ArrayFormula(MATCH(1, FIND(”https://www.distilled.net“,D2:D51),0))

I was going to add some explanation here as to what this formula does but actually it gets pretty complicated. Either you already know what an arrayforumla does (in which case it should be straightforward) or you don’t. In which case you probably just want to copy and paste for now :)

I should note at this stage that there is a limit for 50 importxml calls per spreadsheet which limits us from building a full web crawler but for most agile tools this is sufficient (especially when combined with scripts, see lesson 3).

Lesson 3 - Google Docs Scripts - The Secret Sauce

Now, all this is very well - we have functions which pull in data but it’s all a little ”flat“ if you know what I mean? Let’s try and jazz things up a little by making it MOVE. For anyone familiar with macros in excel, scripts function in a very similar way. Two big advantages here however are the ability to crawl URLs and also the ability to email you. Nice.

Google Scripts are very powerful and essentially allow you to build fully featured programs so I’m not going to go into massive detail here. There are great tutorials from Google already for example:

    Your First Script - a walkthrough of how to use the service
    Sending emails from a spreadsheet - does what it says on the tin

You can easily lose days of your life browsing through and playing with all the things that Google Scripts do. Here, I’m going to present a simple example to show you how agile this is. That’s the key here, building tools that fit your exact needs quickly and easily. Let’s imagine I want to quickly check a bunch of URLs for their tweet count to produce something like this:

Check out the sheet here.

What’s happening here is that I have a list of URLs that I want to check tweet counts for. I’ve created my own function which takes one parameter: =twitter(URL) where URL is the reference to the cell with the link I want to check. Here’s the code:

function twitter(url) {
var jsondata = UrlFetchApp.fetch(”http://urls.api.twitter.com/1/urls/count.json?url=“+url);
var object = Utilities.jsonParse(jsondata.getContentText());
return object.count;
}

Once you’ve read through the Google Scripts tutorials above you should be fairly comfortable with how this works so I’m not going to step through it in detail. The parsing XML tutorial will likely come in handy.

Lesson 4 - Google Scripts Triggers

Ok, now for the magic. Google scripts are nice, but the real power comes from triggering these scripts in different situations. You can cause a script to trigger on any of the following:

    The spreadsheet is opened
    A form is submitted
    A button is pressed
    A specific time happens

Read more about script triggers here.

The most useful here is the time-based trigger I think. Let’s take a quick look at writing a time-based script.

Walkthrough example of time-based trigger

Let’s again take a simple example. As I’m writing this post I know that I’m going to put it live soon, so let’s build a spreadsheet to check the keyword ”seo tools“ and see if perhaps QDF will push this post onto the first page at any point. How viciously referential :)

Step 1 - we write a simple spreadsheet to rank check distilled.co.uk against a particular keyword Step 2 - we write a script that tracks the rank and logs it in a new cell:

Step 3 - we create a time-based trigger to run the script every 30mins:

A few things to note:

    I’ve used

    =int(now())

    in the URL to generate a unique URL each time. Otherwise Google caches the data and you won’t get fresh data each time
    Note the getRange and setValue functions - these are very useful to get your head around. See this tutorial.

The final result (you might have to scroll down for a while depending how long after I wrote this post you’re reading this!):

Grab a copy of the spreadsheet here to see how it works.

Lesson 5 - putting it all together

So, finally let’s put it all together in a fun example. I’ve created a form here where you can enter your city and your email address and my script will fetch some data and email it to you. Just like magic! Go ahead, try it out :)

Taking it further

The sky really is the limit when it comes to Google Scripts but I think that if you start doing any more heavy lifting than what I’ve done in this post you almost certainly want to start building in exception handling and learning to code properly (which I, to be clear, have very much not done!). That said, if you do fancy using Google Scripts there are all kinds of funky things it can do:

    Build a GUI
    Use full oAuth to build a twitter app
    Create & edit new spreadsheets on the fly
    Publish & edit Google sites
    Converting spreadsheets into JSON
    Using Google Docs as a database

But for me, the real power is in hacking together things in a few minutes which gather the data I need so I can get back to getting stuff done. I’ll leave building real SEO tools to the pros for now

Source: https://www.distilled.net/blog/seo/how-to-build-agile-seo-tools-using-google-docs/

Google data scraping for SEO

Whether you agree or not, search engine optimization (SEO) has come a long way since its origination that had revolutionized internet search methods. SEO - a lot depends on planning and strategizing according to the historical data available. Competitor analysis is a significant part of SEO that helps the analyst in determining what is and what not is working for their competitor and how to fine tune his own approach towards it.

Google data scraping or data scraping is about extracting data from web pages, i.e. HTML and XML files in identifying the key components that would influence site performance. Data scraping works on extracting information like keyword, Meta tags, titles, content classifications and ad copies of competitors.

Data extraction has been made possible with sophisticated scraping tools that can perform expert task in extracting information from different websites and web platforms. Data scraping helps with comparing your site’s performance against your competitors across web platforms and therefore, is crucial for SEO analysis. To be honest, the need to rank high in search results is imperative for your website to receive organic traffic and hence, rank monitoring is an important determinant of your website performance.

The type of data scraping tool you need, Google Maps Scraper, Amazon Screen Scraping or LinkedIn Extractor, would be determined by your inherent requirements but you can’t ignore its requirement in extracting valuable information that would help improving website performance to put your venture on the right path of growth.

Source: http://scrappingexperts.blogspot.in/2013/10/google-data-scraping-for-seo.html

How to make website scraping easy

We thought we'd also go 'back to basics' and explain how retailers can simplify their data extraction process.

Web scraping is a way of extracting data from websites. Rich data extraction ensures that the most comprehensive product information is extracted from the retailer’s ecommerce site.

This ensures that the data remains accurate and up-to-date and leaves less room for error.

Why is web scraping important?

If retailers want to increase their product visibility and display their product inventory across the various channels, data extraction is essential. There are several ways of extracting site data, but one of the most common is screen scraping.

Screen scraping is carried out by a crawler that is sent onto an ecommerce site to capture specific data. This extracted data is then put together to create a product data feed.

Why are websites scraped?

Scraping makes data extraction much easier for retailers. Most of them have complex CMS systems, so their website is usually the only place where all of their product information comes together.

How can retailers improve their site so it’s easy to scrape?

Use IDs and classes within tags

If a website uses IDs and classes within its page tags, it’s much easier to produce Xpaths (the query language for selecting nodes), which are used to navigate through HTML.

Don’t use tables for structuring

If a site’s structure includes a lot of tables, it becomes more difficult to scrape. This is because there are unlikely to be IDs and classes within the table’s data.

Not only this, but when tables are used, the Xpaths can become much longer and are therefore more likely to break.

Don’t use unnecessary AJAX

AJAX (asynchronous JavaScript and XML)tends to load independently from HTML, meaning that it can be missed in the scraping process.

Although the browser does drive content and the HTML does load, sometimes something else will pop up. Though a crawler can be set to wait for AJAX content to load before scraping, any AJAX can still dramatically increase the scraping time.

Avoid using sessions

Unnecessary sessions make it difficult to deep link products and can also make the website difficult to scrape.

This is most common on travel website pages, as search URLs sometimes use sessions, causing them to timeout or expire after a period of time.

Be consistent

Crawlers are programmed to recognise each type of webpage based on its structure; if site pages are inconsistent, the crawler will return invalid results.

So, for example, if the crawler is expecting to find the product price under a particular HTML tag class or id and the client introduces a new product page where the price is located under an unfamiliar HTML tag class or id, the product is likely to be overlooked.

Make sure your website is compliant

All websites should comply with W3C standards, which lays out standards that developers need to adhere to. It’s also best to have well-formed HTML so that Xpaths can be created easily.

For example, if an HTML tag is not closed off properly, it can affect the structure of the site.

Keep your website accessible

Even in its simplest form, your website should be compatible with each of the various internet browsers.

So, even if a user has content blockers switched on, the website should still load. This also makes it much easier to scrape the product data.

Source: http://econsultancy.com/in/blog/63375-how-to-make-website-scraping-easy