Lyrics Scraping

Monday, 12 December 2016

Data Extraction Services - A Helpful Hand For Large Organization

Data Extraction Services - A Helpful Hand For Large Organization

The data extraction is the way to extract and to structure data from not structured and semi-structured electronic documents, as found on the web and in various data warehouses. Data extraction is extremely useful for the huge organizations which deal with considerable amounts of data, daily, which must be transformed into significant information and be stored for the use this later on.

Your company with tons of data but it is difficult to control and convert the data into useful information. Without right information at the right time and based on half of accurate information, decision makers with a company waste time by making wrong strategic decisions. In high competing world of businesses, the essential statistics such as information customer, the operational figures of the competitor and the sales figures inter-members play a big role in the manufacture of the strategic decisions. It can help you to take strategic business decisions that can shape your business' goals..

Outsourcing companies provide custom made services to the client's requirements. A few of the areas where it can be used to generate better sales leads, extract and harvest product pricing data, capture financial data, acquire real estate data, conduct market research , survey and analysis, conduct product research and analysis and duplicate an online database..

The different types of Data Extraction Services:

    Database Extraction:
    Reorganized data from multiple databases such as statistics about competitor's products, pricing and latest offers and customer opinion and reviews can be extracted and stored as per the requirement of company.

    Web Data Extraction:
    Web Data Extraction is also known as data Extraction which is usually referred to the practice of extract or reading text data from a targeted website.

Businesses have now realized about the huge benefits they can get by outsourcing their services. Then outsourcing is profitable option for business. Since all projects are custom based to suit the exact needs of the customer, huge savings in terms of time, money and infrastructure are among the many advantages that outsourcing brings.

Advantages of Outsourcing Data Extraction Services:

    Improved technology scalability
    Skilled and qualified technical staff who are proficient in English
    Advanced infrastructure resources
    Quick turnaround time
    Cost-effective prices
    Secure Network systems to ensure data safety
    Increased market coverage

By outsourcing, you can definitely increase your competitive advantages. Outsourcing of services helps businesses to manage their data effectively, which in turn would enable them to experience an increase in profits.

Outsourcing Web Research offer complete Data Extraction Services and Solutions to quickly collective data and information from multiple Internet sources for your Business needs in a cost efficient manner. For more info please visit us at: http://www.webscrapingexpert.com/ or directly send your requirements at: info@webscrapingexpert.com

Source:http://ezinearticles.com/?Data-Extraction-Services---A-Helpful-Hand-For-Large-Organization&id=2477589

Tuesday, 6 December 2016

Increasing Accessibility by Scraping Information From PDF

Increasing Accessibility by Scraping Information From PDF

You may have heard about data scraping which is a method that is being used by computer programs in extracting data from an output that comes from another program. To put it simply, this is a process which involves the automatic sorting of information that can be found on different resources including the internet which is inside an html file, PDF or any other documents. In addition to that, there is the collection of pertinent information. These pieces of information will be contained into the databases or spreadsheets so that the users can retrieve them later.

Most of the websites today have text that can be accessed and written easily in the source code. However, there are now other businesses nowadays that choose to make use of Adobe PDF files or Portable Document Format. This is a type of file that can be viewed by simply using the free software known as the Adobe Acrobat. Almost any operating system supports the said software. There are many advantages when you choose to utilize PDF files. Among them is that the document that you have looks exactly the same even if you put it in another computer so that you can view it. Therefore, this makes it ideal for business documents or even specification sheets. Of course there are disadvantages as well. One of which is that the text that is contained in the file is converted into an image. In this case, it is often that you may have problems with this when it comes to the copying and pasting.

This is why there are some that start scraping information from PDF. This is often called PDF scraping in which this is the process that is just like data scraping only that you will be getting information that is contained in your PDF files. In order for you to begin scraping information from PDF, you must choose and exploit a tool that is specifically designed for this process. However, you will find that it is not easy to locate the right tool that will enable you to perform PDF scraping effectively. This is because most of the tools today have problems in obtaining exactly the same data that you want without personalizing them.

Nevertheless, if you search well enough, you will be able to encounter the program that you are looking for. There is no need for you to have programming language knowledge in order for you to use them. You can easily specify your own preferences and the software will do the rest of the work for you. There are also companies out there that you can contact and they will perform the task since they have the right tools that they can use. If you choose to do things manually, you will find that this is indeed tedious and complicated whereas if you compare this to having professionals do the job for you, they will be able to finish it in no time at all. Scraping information from PDF is a process where you collect the information that can be found on the internet and this does not infringe copyright laws.

Source:http://ezinearticles.com/?Increasing-Accessibility-by-Scraping-Information-From-PDF&id=4593863

Friday, 2 December 2016

Web Data Extraction Services and Data Collection Form Website Pages

Web Data Extraction Services and Data Collection Form Website Pages

For any business market research and surveys plays crucial role in strategic decision making. Web scrapping and data extraction techniques help you find relevant information and data for your business or personal use. Most of the time professionals manually copy-paste data from web pages or download a whole website resulting in waste of time and efforts.

Instead, consider using web scraping techniques that crawls through thousands of website pages to extract specific information and simultaneously save this information into a database, CSV file, XML file or any other custom format for future reference.

Examples of web data extraction process include:
• Spider a government portal, extracting names of citizens for a survey
• Crawl competitor websites for product pricing and feature data
• Use web scraping to download images from a stock photography site for website design

Automated Data Collection
Web scraping also allows you to monitor website data changes over stipulated period and collect these data on a scheduled basis automatically. Automated data collection helps you discover market trends, determine user behavior and predict how data will change in near future.

Examples of automated data collection include:
• Monitor price information for select stocks on hourly basis
• Collect mortgage rates from various financial firms on daily basis
• Check whether reports on constant basis as and when required

Using web data extraction services you can mine any data related to your business objective, download them into a spreadsheet so that they can be analyzed and compared with ease.

In this way you get accurate and quicker results saving hundreds of man-hours and money!

With web data extraction services you can easily fetch product pricing information, sales leads, mailing database, competitors data, profile data and many more on a consistent basis.

Source: http://ezinearticles.com/?Web-Data-Extraction-Services-and-Data-Collection-Form-Website-Pages&id=4860417

Tuesday, 29 November 2016

Get Started With Scraping – Extracting Simple Tables from PDF Documents

Get Started With Scraping – Extracting Simple Tables from PDF Documents

As anyone who has tried working with “real world” data releases will know, sometimes the only place you can find a particular dataset is as a table locked up in a PDF document, whether embedded in the flow of a document, included as an appendix, or representing a printout from a spreadsheet. Sometimes it can be possible to copy and paste the data out of the table by hand, although for multi-page documents this can be something of a chore. At other times, copy-and-pasting may result in something of a jumbled mess. Whilst there are several applications available that claim to offer reliable table extraction services (some free software,so some open source software, some commercial software), it can be instructive to “View Source” on the PDF document itself to see what might be involved in scraping data from it.

In this post, we’ll look at a simple PDF document to get a feel for what’s involved with scraping a well-behaved table from it. Whilst this won’t turn you into a virtuoso scraper of PDFs, it should give you a few hints about how to get started. If you don’t count yourself as a programmer, it may be worth reading through this tutorial anyway! If nothing else, it may give a feel for the sorts of the thing that are possible when it comes to extracting data from a PDF document.

The computer language I’ll be using to scrape the documents is the Python programming language. If you don’t class yourself as a programmer, don’t worry – you can go a long way copying and pasting other people’s code and then just changing some of the decipherable numbers and letters!

So let’s begin, with a look at a PDF I came across during the recent School of Data data expedition on mapping the garment factories. Much of the source data used in that expedition came via a set of PDF documents detailing the supplier lists of various garment retailers. The image I’ve grabbed below shows one such list, from Varner-Gruppen.

If we look at the table (and looking at the PDF can be a good place to start!) we see that the table is a regular one, with a set of columns separated by white space, and rows that for the majority of cases occupy just a single line.

I’m not sure what the “proper” way of scraping the tabular data from this document is, but here’s the sort approach I’ve arrived at from a combination of copying things I’ve seen, and bit of my own problem solving.

The environment I’ll use to write the scraper is Scraperwiki. Scraperwiki is undergoing something of a relaunch at the moment, so the screenshots may differ a little from what’s there now, but the code should be the same once you get started. To be able to copy – and save – your own scrapers, you’ll need an account; but it’s free, for the moment (though there is likely to soon be a limit on the number of free scrapers you can run…) so there’s no reason not to…;-)

Once you create a new scraper:

you’ll be presented with an editor window, where you can write your scraper code (don’t panic!), along with a status area at the bottom of the screen. This area is used to display log messages when you run your scraper, as well as updates about the pages you’re hoping to scrape that you’ve loaded into the scraper from elsewhere on the web, and details of any data you have popped into the small SQLite database that is associated with the scraper (really, DON’T PANIC!…)

Give your scraper a name, and save it…

To start with, we need to load a couple of programme libraries into the scraper. These libraries provide a lot of the programming tools that do a lot of the heavy lifting for us, and hide much of the nastiness of working with the raw PDF document data.

import scraperwiki
import urllib2, lxml.etree

No, I don’t really know everything these libraries can do either, although I do know where to find the documentation for them… lxm.etree, scraperwiki! (You can also download and run the scraperwiki library in your own Python programmes outside of scraperwiki.com.)

To load the target PDF document into the scraper, we need to tell the scraper where to find it. In this case, the web address/URL of the document is http://cdn.varner.eu/cdn-1ce36b6442a6146/Global/Varner/CSR/Downloads_CSR/Fabrikklister_VarnerGruppen_2013.pdf, so that’s exactly what we’ll use:

url = 'http://cdn.varner.eu/cdn-1ce36b6442a6146/Global/Varner/CSR/Downloads_CSR/Fabrikklister_VarnerGruppen_2013.pdf'

The following three lines will load the file in to the scraper, “parse” the data into an XML document format, which represents the whole PDF in a way that resembles an HTML page (sort of), and then provides us with a link to the “root” of that document.

pdfdata = urllib2.urlopen(url).read()
xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.etree.fromstring(xmldata)

If you run this bit of code, you’ll see the PDF document gets loaded in:

Here’s an example of what some of the XML from the PDF we’ve just loaded looks like preview it:

print etree.tostring(root, pretty_print=True)

We can see how many pages there are in the document using the following command:

pages = list(root)
print "There are",len(pages),"pages"

The scraperwiki.pdftoxml library I’m using converts each line of the PDF document to a separate grouped elements. We can iterate through each page, and each element within each page, using the following nested loop:

for page in pages:
for el in page:

We can take a peak inside the elements using the following print statement within that nested loop:

if el.tag == "text":
print el.text, el.attrib

Here’s the sort of thing we see from one of the table pages (the actual document has a cover page followed by several tabulated data pages):

Bangladesh {'font': '3', 'width': '62', 'top': '289', 'height': '17', 'left': '73'}
Cutting Edge {'font': '3', 'width': '71', 'top': '289', 'height': '17', 'left': '160'}
1612, South Salna, Salna Bazar {'font': '3', 'width': '165', 'top': '289', 'height': '17', 'left': '425'}
Gazipur {'font': '3', 'width': '44', 'top': '289', 'height': '17', 'left': '907'}
Dhaka Division {'font': '3', 'width': '85', 'top': '289', 'height': '17', 'left': '1059'}
Bangladesh {'font': '3', 'width': '62', 'top': '311', 'height': '17', 'left': '73'}

Looking again the output from each row of the table, we see that there are regular position indicators, particulalry the “top” and “left” coordinates, which correspond to the co-ordinates of where the registration point of each block of text should be placed on the page.

If we imagine the PDF table marked up as follows, we might be able to add some of the co-ordinate values as follows – the blue lines correspond to co-ordinates extracted from the document:

imaginary table lines

We can now construct a small default reasoning hierarchy that describes the contents of each row based on the horizontal (“x-axis”, or “left” co-ordinate) value. For convenience, we pick values that offer a clear separation between the x-co-ordinates defined in the document. In the diagram above, the red lines mark the threshold values I have used to distinguish one column from another:

if int(el.attrib['left']) < 100: print 'Country:', el.text,
elif int(el.attrib['left']) < 250: print 'Factory name:', el.text,
elif int(el.attrib['left']) < 500: print 'Address:', el.text,
elif int(el.attrib['left']) < 1000: print 'City:', el.text,
else:
print 'Region:', el.text

Take a deep breath and try to follow the logic of it. Hopefully you can see how this works…? The data rows are ordered, stepping through each cell in the table (working left right) for each table row in turn. The repeated if-else statement tries to find the leftmost column into which a text value might fall, based on the value of its “left” attribute. When we find the value of the rightmost column, we print out the data associated with each column in that row.

We’re now in a position to look at running a proper test scrape, but let’s optimise the code slightly first: we know that the data table starts on the second page of the PDF document, so we can ignore the first page when we loop through the pages. As with many programming languages, Python tends to start counting with a 0; to loop through the second page to the final page in the document, we can use this revised loop statement:

for page in pages[1:]:

Here, pages describes a list element with N items, which we can describe explicitly as pages[0:N-1]. Python list indexing counts the first item in the list as item zero, so [1:] defines the sublist from the second item in the list (which has the index value 1 given that we start counting at zero) to the end of the list.

Rather than just printing out the data, what we really want to do is grab hold of it, a row at a time, and add it to a database.

We can use a simple data structure to model each row in a way that identifies which data element was in which column. We initiate this data element in the first cell of a row, and print it out in the last. Here’s some code to do that:

for page in pages[1:]:
for el in page:
 if el.tag == "text":
 if int(el.attrib['left']) < 100: data = { 'Country': el.text }
 elif int(el.attrib['left']) < 250: data['Factory name'] = el.text
 elif int(el.attrib['left']) < 500: data['Address'] = el.text
 elif int(el.attrib['left']) < 1000: data['City'] = el.text
 else:
 data['Region'] = el.text
 print data

And here’s the sort of thing we get if we run it:

starting to get structured data

That looks nearly there, doesn’t it, although if you peer closely you may notice that sometimes we catch a header row. There are a couple of ways we might be able to ignore the elements in the first, header row of the table on each page.

 We could keep track of the “top” co-ordinate value and ignore the header line based on the value of this attribute.
 We could tack a hacky lazy way out and explicitly ignore any text value that is one of the column header values.

The first is rather more elegant, and would also allow us to automatically label each column and retain it’s semantics, rather than explicitly labelling the columns using out own labels. (Can you see how? If we know we are in the title row based on the “top” co-ordinate value, we can associate the column headings with the “left” coordinate value.) The second approach is a bit more of a blunt instrument, but it does the job…

skiplist=['COUNTRY','FACTORY NAME','ADDRESS','CITY','REGION']
for page in pages[1:]:
for el in page:
 if el.tag == "text" and el.text not in skiplist:
 if int(el.attrib['left']) < 100: data = { 'Country': el.text }
 elif int(el.attrib['left']) < 250: data['Factory name'] = el.text
 elif int(el.attrib['left']) < 500: data['Address'] = el.text
 elif int(el.attrib['left']) < 1000: data['City'] = el.text
 else:
 data['Region'] = el.text
 print data

At the end of the day, it’s the data we’re after and the aim is not necessarily to produce a reusable, general solution – expedient means occasionally win out! As ever, we have to decide for ourselves the point at which we stop trying to automate everything and consider whether it makes more sense to hard code our observations rather than trying to write scripts to automate or generalise them.

http://xkcd.com/974/ - The General Problem

The final step is to add the data to a database. For example, instead of printing out each data row, we could add the data to the a scraper database table using the command:

scraperwiki.sqlite.save(unique_keys=[], table_name='fabvarn', data=data)

Scraped data preview

Note that the repeated database accesses can slow Scraperwiki down somewhat, so instead we might choose to build up a list of data records, one per row, for each page and them and then add all the companies scraped from a page one page at a time.

If we need to remove a database table, this utility function may help – call it using the name of the table you want to clear…

def dropper(table):
if table!='':
 try: scraperwiki.sqlite.execute('drop table "'+table+'"')
 except: pass

Here’s another handy utility routine I found somewhere a long time ago (I’ve lost the original reference?) that “flattens” the marked up elements and just returns the textual content of them:

def gettext_with_bi_tags(el):
res = [ ]
if el.text:
 res.append(el.text)
for lel in el:
 res.append("<%s>" % lel.tag)
 res.append(gettext_with_bi_tags(lel))
 res.append("</%s>" % lel.tag)
 if el.tail:
 res.append(el.tail)
return "".join(res).strip()

If we pass this function something like the string Some text or Some text it will return Some text.

Having saved the data to the scraper database, we can download it or access it via a SQL API from the scraper homepage:

scrpaed data - db

You can find a copy of the scraper here and a copy of various stages of the code development here.

Finally, it is worth noting that there is a small number of “badly behaved” data rows that split over more than one table row on the PDF.

broken scraper row

Whilst we can handle these within the scraper script, the effort of creating the exception handlers sometimes exceeds the pain associated with identifying the broken rows and fixing the data associated with them by hand.

Summary

This tutorial has shown one way of writing a simple scraper for extracting tabular data from a simply structured PDF document. In much the same way as a sculptor may lock on to a particular idea when working a piece of stone, a scraper writer may find that they lock in to a particular way of parsing data out of a data, and develop a particular set of abstractions and exception handlers as a result. Writing scrapers can be infuriating at times, but may also prove very rewarding in the way that solving any puzzle can be. Compared to copying and pasting data from a PDF by hand, it may also be time well spent!

It is also worth remembering that sometimes it can be quicker to write a scraper that does most of the job, and then finish off the data cleansing or exception handling using another tool, such as OpenRefine or even just a simple text editor. On occasion, it may also make sense to throw the data into a database table as quickly as you can, and then develop code to manage a second pass that takes the raw data out of the database, tidies it up, and then writes it in a cleaner or more structured form into another database table.

Source: http://schoolofdata.org/2013/06/18/get-started-with-scraping-extracting-simple-tables-from-pdf-documents/

Tuesday, 15 November 2016

How to scrape search results from search engines like Google, Bing and Yahoo

How to scrape search results from search engines like Google, Bing and Yahoo

Search giants like Google, Yahoo and Bing made their empire on scraping others content. However, they don’t want you to scrape them. How ironic, isn’t it?

Search engine performance is a very important metric all digital marketers want to measure and improve. I’m sure you will be using some great SEO tools to check how your keywords perform. All great SEO tool comes with a search keyword ranking feature. The tools will tell you how your keywords are performing in google, yahoo bing etc.

How will you get data from search engines If you want to build a keyword ranking app?

These search engines have API’s but the daily query limit is very low and not useful for the commercial purpose. The only solution is to scrape search results. Search engine giants obviously know this :). Once they know that you are scraping, they will block your IP, Period!

How do Search engines detect bots?

Here are the common methods of detection of bots.

* IP address: Search engines can detect if there are too many requests coming from a single IP. If a high amount of traffic is detected, they will throw a captcha.

* Search patterns: Search engines match traffic patterns to an existing set of patterns and if there is huge variation, they will classify this as a bot.

If you don’t have access to sophisticated technology, it is impossible to scrape search engines like google, Bing or Yahoo.

How to avoid detection

There are some things you can do to avoid detection.

    Scrape slowly and don’t try to squeeze everything at once.
    Switch user agents between queries
    Scrape randomly and don’t follow the same pattern
    Use intelligent IP rotations
    Clear Cookies after each IP change or disable them completely

Thanks for reading this blog post.

Source: http://blog.datahut.co/how-to-scrape-search-results-from-search-engines-like-google-bing-and-yahoo/

Friday, 14 October 2016

Scraping Yelp Data and How to use?

Scraping Yelp Data and How to use?

We get a lot of requests to scrape data from Yelp. These requests come in on a daily basis, sometimes several times a day. At the same time we have not seen a good business case for a commercial project with scraping Yelp.

We have decided to release a simple example Yelp robot which anyone can run on Chrome inside your computer, tune to your own requirements and collect some data. With this robot you can save business contact information like address, postal code, telephone numbers, website addresses etc. Robot is placed in our Demo space on Web Robots portal for anyone to use, just sign up, find the robot and use it.

How to use it:

    Sign in to our portal here.
    Download our scraping extension from here.
    Find robot named Yelp_us_demo in the dropdown.
    Modify start URL to the first page of your search results. For example: http://www.yelp.com/search?find_desc=Restaurants&find_loc=Arlington,+VA,+USA
    Click Run.
    Let robot finish it’s job and download data from portal.

Some things to consider:

This robot is placed in our Demo space – therefore it is accessible to anyone. Anyone will be able to modify and run it, anyone will be able to download collected data. Robot’s code may be edited by someone else, but you can always restore it from sample code below. Yelp limits number of search results, so do not expect to scrape more results than you would normally see by search.

In case you want to create your own version of such robot, here it’s full code:

// starting URL above must be the first page of search results.
// Example: http://www.yelp.com/search?find_desc=Restaurants&find_loc=Arlington,+VA,+USA

steps.start = function () {

   var rows = [];

   $(".biz-listing-large").each (function (i,v) {
     if ($("h3 a", v).length > 0)
       {
        var row = {};
        row.company = $(".biz-name", v).text().trim();
        row.reviews =$(".review-count", v).text().trim();
        row.companyLink = $(".biz-name", v)[0].href;
        row.location = $(".secondary-attributes address", v).text().trim();
        row.phone = $(".biz-phone", v).text().trim();
        rows.push (row);
      }
   });

   emit ("yelp", rows);
   if ($(".next").length === 1) {
     next ($(".next")[0].href, "start");
   }
done();
};

Source: https://webrobots.io/scraping-yelp-data/

Monday, 3 October 2016

Has It Been Done Before? Optimize Your Patent Search Using Patent Scraping Technology

Since the US patent office opened in 1790, inventors across the United States have been submitting all sorts of great products and half-baked ideas to their database. Nowadays, many individuals get ideas for great products only to have the patent office do a patent search and tell them that their ideas have already been patented by someone else! Herin lies a question: How do I perform a patent search to find out if my invention has already been patented before I invest time and money into developing it?

The US patent office patent search database is available to anyone with internet access.

US Patent Search Homepage

Performing a patent search with the patent searching tools on the US Patent office webpage can prove to be a very time consuming process. For example, patent searching the database for "dog" and "food" yields 5745 patent search results. The straight-forward approach to investigating the patent search results for your particular idea is to go through all 5745 results one at a time looking for yours. Get some munchies and settle in, this could take a while! The patent search database sorts results by patent number instead of relevancy. This means that if your idea was recently patented, you will find it near the top but if it wasn't, you could be searching for quite a while. Also, most patent search results have images associated with them. Downloading and displaying these images over the internet can be very time consuming depending on you internet connection and the availability of the patent search database servers.

Because patent searches take such a long time, many companies and organizations are looking ways to improve the process. Some organizations and companies will hire employees for the sole purpose of performing patent searches for them. Others contract out the job to small business that specialize in patent searches. The latest technology for performing patent searches is called patent scraping.

Patent scraping is the process of writing computer automated scripts that analyze a website and copy only the content you are interested in into easily accessible databases or spreadsheets on your computer. Because it is a computerized script performing the patent search, you don't need a separate employee to get the data, you can let it run the patent scraping while you perform other important tasks! Patent scraping technology can also extract text content from images. By saving the images and textual content to your computer, you can then very efficiently search them for content and relevancy; thus saving you lots of time that could be better spent actually inventing something!

To put a real-world face on this, let us consider the pharmaceutical industry. Many different companies are competing for the patent on the next big drug. It has become an indispensible tactic of the industry for one company to perform patent searches for what patents the other companies are applying for, thus learning in which direction the research and development team of the other company is taking them. Using this information, the company can then choose to either pursue that direction heavily, or spin off in a different direction. It would quickly become very costly to maintain a team of researchers dedicated to only performing patent searches all day. Patent scraping technology is the means for figuring out what ideas and technologies are coming about before they make headline news. It is by utilizing patent scraping technology that the large companies stay up to date on the latest trends in technology.

Source: http://ezinearticles.com/?Has-It-Been-Done-Before?-Optimize-Your-Patent-Search-Using-Patent-Scraping-Technology&id=171000