How to create a basic web crawler with Scrapy

Programs that read information from a web page or web crawler, have all sorts of useful applications.

Writing these web crawler programs is easier than you think. Python has an excellent library for writing scripts that extract information from web pages. Let's see how to create a web crawler with Scrapy through the following article!

Use Scrapy to create web crawlers

  1. Install Scrapy
  2. How to create a web crawler
    1. Turn off logging
    2. Use Chrome Inspector
    3. Extract title
    4. Find description
  3. Collect JSON data
  4. Exploit many factors

Install Scrapy

Scrapy is a Python library created to scan and build web crawlers. It is fast, simple and can navigate through many websites without a lot of effort.

Scrapy is available through the Pip Installs Python library (PIP). For how to install PIP, please refer to the article: Installing Python Package with PIP on Windows, Mac and Linux.

Using the Python virtual environment is preferred because it will allow you to install Scrapy in a virtual directory and keep the file system intact. The Scrapy documentation recommends doing this for best results.

Create directories and initialize a virtual environment.

 mkdir crawler cd crawler virtualenv venv . venv/bin/activate 

Now you can install Scrapy into that directory using the PIP command.

 pip install scrapy 

Quick check to make sure Scrapy is properly installed:

 scrapy # prints Scrapy 1.4.0 - no active project Usage: scrapy [options] [args] Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) .  scrapy # prints Scrapy 1.4.0 - no active project Usage: scrapy [options] [args] Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) . 
How to create a basic web crawler with Scrapy Picture 1
Scrapy

How to create a web crawler

Now that the environment is ready, you can start creating web crawlers. Take a look at some information from the Wikipedia page about batteries:

 https://en.wikipedia.org/wiki/Battery_(electricity) 

The first step to writing a crawler is to define the Python class extending from Scrapy.Spider. This gives you access to all functions in Scrapy. Call this class spider1.

The spider class needs some information:

  1. Name ( name ) to identify the spider
  2. The start_urls variable contains a list of URLs to crawl (Wikipedia URLs will be examples in this tutorial)
  3. The parse () method is used to process web pages and extract information.
 import scrapy class spider1(scrapy.Spider): name = 'Wikipedia' start_urls = ['https://en.wikipedia.org/wiki/Battery_(electricity)'] def parse(self, response): pass 

A quick test to make sure everything is running properly.

 scrapy runspider spider1.py # prints 2017-11-23 09:09:21 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot) 2017-11-23 09:09:21 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True} 2017-11-23 09:09:21 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats', 

Turn off logging

Running Scrapy with this class will export log information which will not help at the moment. Make things simple by removing this redundant log information. Add the following code to the beginning of the file.

 import logging logging.getLogger('scrapy').setLevel(logging.WARNING) 

Now when you run the script again, the log information will not be printed.

Use Chrome Inspector

Everything on a web page is stored in HTML elements. Elements are organized in the Document Object Model (DOM). Understanding the DOM is important to make the most of the web crawler. Web crawlers search through all the HTML elements on a page to find information, so it's important to understand how they are organized.

Google Chrome has tools to help you find HTML elements faster. You can position the HTML in any element you see on the web with the inspector.

  1. Navigate to a page in Chrome
  2. Place your mouse over the element you want to see
  3. Right click and select Inspect from the menu

These steps will open the developer console with the Elements tab selected. At the bottom of the console, you will see a tree diagram containing the elements. This tree diagram is how you will get information for the script.

Extract title

Use the script to do some work. Gather simple information to get the site's title text.

Start the script by adding some code to the parse () method to extract the title.

 . def parse(self, response): print response.css('h1#firstHeading::text').extract() . 

The response argument supports a method called CSS () that selects elements from the page using the location you provide.

In this example, the element is h1.firstHeading. Adding :: text to the script is what is needed to give you the content of the element. Finally, the extract () method returns the selected element.

Run this script in Scrapy to export the title in text form.

 [u'Battery (electricity)'] 

Find description

We have now extracted the title text. Do more with scripts. The crawler will find the first paragraph after the title and extract this information.

Here is an element tree diagram in the Chrome Developer Console:

 div#mw-content-text>div>p 

The right arrow ( > ) indicates the parent-child relationship between the elements. This position will return all matching p elements, including the entire description. To get the first p element, you can write this code:

 response.css('div#mw-content-text>div>p')[0] 

Like the title, you add :: text to get the text content of the element.

 response.css('div#mw-content-text>div>p')[0].css('::text') 

The final expression uses extract () to return the list. You can use the Python join () function to concatenate lists after the crawl is complete.

 def parse(self, response): print ''.join(response.css('div#mw-content-text>div>p')[0].css('::text').extract()) 

The result is the first paragraph of the text!

 An electric battery is a device consisting of one or more electrochemical cells with external connections provided to power electrical devices such as flashlights, smartphones, and electric cars.[1] When a battery is supplying electric power, its positive terminal is . 

Collect JSON data

Scrapy can extract information in text form, very useful. Scrapy also allows you to view JavaScript Object Notation (JSON) data. JSON is a neat way to organize information and is widely used in web development. JSON also works quite nicely with Python.

When you need to collect data in JSON format, you can use the yield statement built into Scrapy.

Here, a new version of the script uses the yield statement . Instead of taking the first p element in text format, this statement will retrieve all the p elements and arrange it in JSON format.

 . def parse(self, response): for e in response.css('div#mw-content-text>div>p'): yield { 'para' : ''.join(e.css('::text').extract()).strip() } . 

Now, you can run the spider by specifying the output JSON file:

 scrapy runspider spider3.py -o joe.json 

The script will now output all p elements.

 [ {"para": "An electric battery is a device consisting of one or more electrochemical cells with external connections provided to power electrical devices such as flashlights, smartphones, and electric cars.[1] When a battery is supplying electric power, its positive terminal is the cathode and its negative terminal is the anode.[2] The terminal marked negative is the source of electrons that when connected to an external circuit will flow and deliver energy to an external device. When a battery is connected to an external circuit, electrolytes are able to move as ions within, allowing the chemical reactions to be completed at the separate terminals and so deliver energy to the external circuit. It is the movement of those ions within the battery which allows current to flow out of the battery to perform work.[3] Historically the term "battery" specifically referred to a device composed of multiple cells, however the usage has evolved additionally to include devices composed of a single cell.[4]"}, {"para": "Primary (single-use or "disposable") batteries are used once and discarded; the electrode materials are irreversibly changed during discharge. Common examples are the alkaline battery used for flashlights and a multitude of portable electronic devices. Secondary (rechargeable) batteries can be discharged and recharged multiple . 

Exploit many factors

So far, the web crawler has tapped the title and a type of element from the page. Scrapy can also extract information from different types of elements in a script.

Feel free to extract the top IMDb Box Office hits for the weekend. This information is taken from http://www.imdb.com/chart/boxoffice , in a table with rows for each metric.

The parse () method can extract multiple fields from a row. Using Chrome Developer Tools, you can find nested elements in the table.

 . def parse(self, response): for e in response.css('div#boxoffice>table>tbody>tr'): yield { 'title': ''.join(e.css('td.titleColumn>a::text').extract()).strip(), 'weekend': ''.join(e.css('td.ratingColumn')[0].css('::text').extract()).strip(), 'gross': ''.join(e.css('td.ratingColumn')[1].css('span.secondaryInfo::text').extract()).strip(), 'weeks': ''.join(e.css('td.weeksColumn::text').extract()).strip(), 'image': e.css('td.posterColumn img::attr(src)').extract_first(), } . 

The selector image identifies that img is a descendant of td.posterColumn. To extract the correct attribute, use the expression :: attr (src).

Running spider returns JSON:

 [ {"gross": "$93.8M", "weeks": "1", "weekend": "$93.8M", "image": "https://images-na.ssl-images-amazon.com/images/M/MV5BYWVhZjZkYTItOGIwYS00NmRkLWJlYjctMWM0ZjFmMDU4ZjEzXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UY67_CR0,0,45,67_AL_.jpg", "title": "Justice League"}, {"gross": "$27.5M", "weeks": "1", "weekend": "$27.5M", "image": "https://images-na.ssl-images-amazon.com/images/M/MV5BYjFhOWY0OTgtNDkzMC00YWJkLTk1NGEtYWUxNjhmMmQ5ZjYyXkEyXkFqcGdeQXVyMjMxOTE0ODA@._V1_UX45_CR0,0,45,67_AL_.jpg", "title": "Wonder"}, {"gross": "$247.3M", "weeks": "3", "weekend": "$21.7M", "image": "https://images-na.ssl-images-amazon.com/images/M/MV5BMjMyNDkzMzI1OF5BMl5BanBnXkFtZTgwODcxODg5MjI@._V1_UY67_CR0,0,45,67_AL_.jpg", "title": "Thor: Ragnarok"}, . ] 
5 ★ | 1 Vote

May be interested

  • How to use Photoshop CS5 - Part 19: Create an Out of bound imageHow to use Photoshop CS5 - Part 19: Create an Out of bound image
    in the next article of the tutorial series on photoshop, we will present some basic steps to create the effect of out-of-bound photos.
  • Guide to creating gadgets in Windows 7Guide to creating gadgets in Windows 7
    in the following article, we will show you some basic tasks to create gadgets - a small form of applications that are supported in windows vista and windows 7. a very special feature of gadgets is possible set anywhere on the desktop, quickly provide users with the necessary information, or act as a 'door' to the website or a number of other large-scale programs and applications. than.
  • How to use Photoshop CS5 - Part 21: Create water ripple effect in photosHow to use Photoshop CS5 - Part 21: Create water ripple effect in photos
    in the next part of the tutorial series on photoshop, we will learn and practice the basic steps to create a photo with the water surface moving and wavy. in fact, photoshop already has a number of filters available to create the same effect ...
  • How to Create a Basic Laptop Design in the Solidworks CAD ProgramHow to Create a Basic Laptop Design in the Solidworks CAD Program
    have you ever wondered how 3d models are designed and wanted to try to model yourself? this article will teach you the basic skills needed to model a basic laptop design as well as many other shapes open the solidworks program.
  • How to use Photoshop CS5 - Part 13: Create a reflection effect from the waterHow to use Photoshop CS5 - Part 13: Create a reflection effect from the water
    in the next part of the tutorial series on photoshop cs5, we will cover the basic steps to create reflection effects from the water for any photo.
  • How to create tables in Word on computersHow to create tables in Word on computers
    how to create a table in word will help you easily list and summarize data in documents. this is one of the basic word skills that everyone needs to know, so you should master how to create word tables to start getting acquainted with this software.
  • Basic steps for creating plugins with jQueryBasic steps for creating plugins with jQuery
    in the article below, we will introduce and guide you a few basic steps to create a plugin using jquery. whether you are a beginner to learn about web development or have been exposed to javascript for a long time, jquery is a great framework and not to be missed ...
  • The easiest way to Create EXE fileThe easiest way to Create EXE file
    this wikihow teaches you how to create a basic exe file on a windows computer, as well as how to create a container for that file to install on another computer.
  • How to Create an L‐Shaped Stairway on AutoCADHow to Create an L‐Shaped Stairway on AutoCAD
    if you're looking for a project to help you get to know autocad, making a basic l-shaped staircase is a good one. these instructions are for beginning users of the software with basic knowledge on where to find and how to use appropriate...
  • Basic steps for making GIF images from video filesBasic steps for making GIF images from video files
    using animated images often brings an unexpected and interesting element to the reader, but how to create those photos is important ...