Scrapy

From Sinfronteras
Jump to: navigation, search


https://scrapy.org/


  • Creating a project:
scrapy startproject tutorial


  • Creating a Scpider:
Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.
This is the code for our first Spider. Save it in a file named quotes_spider.py under the tutorial/spiders directory in your project:
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)


As you can see, our Spider subclasses scrapy.Spider and defines some attributes and methods:
  • name: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.
  • start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.
  • parse(): a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.
The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.


  • Running the spider:
To put our spider to work, go to the project’s top level directory and run:
scrapy crawl quotes


This command runs the spider with name quotes that we've just added, that will send some requests for the quotes.toscrape.com domain. You will get an output similar to this:
... (omitted for brevity)
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened
2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)
...


Now, check the files in the current directory. You should notice that two new files have been created: quotes-1.html and quotes-2.html, with the content for the respective URLs, as our parse method instructs.


  • Extracting data:
The best way to learn how to extract data with Scrapy is trying selectors using the Scrapy shell. Run:
scrapy shell 'http://quotes.toscrape.com/page/1/'


Using the shell, you can try selecting elements using CSS with the response object:
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]


To extract the text from the title above, you can do:
>>> response.css('title').getall()
['<title>Quotes to Scrape</title>']
>>> response.css('title::text').getall()
['Quotes to Scrape']
>>> response.css('title::text').get()
'Quotes to Scrape'
>>> response.css('title::text')[0].get()
'Quotes to Scrape'
However, using .get() directly on a SelectorList instance avoids an IndexError and returns None when it doesn't find any element matching the selection.


Besides the getall() and get() methods, you can also use the re() method to extract using regular expressions:
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']


  • How to find the proper CSS selectors:
In order to find the proper CSS selectors to use, you might find useful opening the response page from the shell in your web browser using view(response). You can use your browser’s developer tools to inspect the HTML and come up with a selector (see Using your browser’s Developer Tools for scraping).
Selector Gadget is also a nice tool to quickly find CSS selector for visually selected elements, which works in many browsers.


  • XPath: a brief intro:
Besides CSS, Scrapy selectors also support using XPath expressions:
...


  • Extracting quotes and authors in our example:
Now that you know a bit about selection and extraction, let’s complete our spider by writing the code to extract the quotes from the web page.
Each quote in http://quotes.toscrape.com is represented by HTML elements that look like this:
<div class="quote">
    <span class="text">The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>
Let’s open up scrapy shell and play a bit to find out how to extract the data we want:
scrapy shell 'http://quotes.toscrape.com'


>>> response.css("div.quote")


>>> quote = response.css("div.quote")[0]


>>> text = quote.css("span.text::text").get()
>>> text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").get()
>>> author
'Albert Einstein'


>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']


Having figured out how to extract each bit, we can now iterate over all the quotes elements and put them together into a Python dictionary:
>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").get()
...     author = quote.css("small.author::text").get()
...     tags = quote.css("div.tags a.tag::text").getall()
...     print(dict(text=text, author=author, tags=tags))
{'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
{'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}
    ... a few more of these, omitted for brevity
>>>


  • Extracting data in our spider:
Let’s get back to our spider. Until now, it doesn’t extract any data in particular, just saves the whole HTML page to a local file. Let’s integrate the extraction logic above into our spider.
A Scrapy spider typically generates many dictionaries containing the data extracted from the page. To do that, we use the yield Python keyword in the callback, as you can see below:
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
If you run this spider, it will output the extracted data with the log.


  • Storing the scraped data:
The simplest way to store the scraped data is by using Feed exports, with the following command:
scrapy crawl quotes -o quotes.json



scrapyamazon-robot-middleware

https://github.com/ziplokk1/scrapy-amazon-robot-middleware

https://pypi.org/project/scrapy-amazon-robot-middleware3/