The Modern Scrapy Developer's Guide (Part 3): Auto-Generating Page Objects with Web Scraping Co-pilot

Published: (December 16, 2025 at 01:55 PM EST)
5 min read
Source: Dev.to

Source: Dev.to

Prerequisites & Setup

This tutorial assumes you have:

  • Completed Part 1 (see the link above).
  • Visual Studio Code installed.
  • The Web Scraping Co‑pilot extension (which we’ll install now).

Step 1: Installing Web Scraping Co‑pilot

  1. Open VS Code and go to the Extensions tab.

  2. Search for Web Scraping Co‑pilot (published by Zyte) and install it.

    Web Scraping Co‑pilot

  3. After installation you’ll see a new icon in the sidebar. Click it; the extension will automatically detect your Scrapy project.

  4. If prompted, allow it to install a few dependencies (e.g., pytest). This ensures your environment is ready for AI‑powered generation.

Step 2: Auto‑Generating our BookItem

We’ll start with the spider from Part 1 and let the Co‑pilot create a Page Object for BookItem, adding even more fields than we did in Part 2.

  1. Open the Co‑pilot chat window.

  2. Select “Web Scraping.”

  3. Write a prompt such as:

    Create a page object for the item BookItem using the sample URL https://books.toscrape.com/catalogue/the-host_979/index.html

The Co‑pilot will:

  • Check your project – confirming that scrapy‑poet and pytest are installed (offering to add them if not).
  • Add scrapy‑poet settings – automatically inserting the ADDONS and SCRAPY_POET_DISCOVER entries into settings.py.
  • Create items.py – generating a new BookItem class with all fields it can discover on the page.
# tutorial/items.py (Auto‑Generated!)
import attrs

@attrs.define
class BookItem:
    """
    The structured data we extract from a book *detail* page.
    """
    name: str
    price: str
    url: str
    availability: str          #  str:
        return self.response.css("h1::text").get()

    @field
    def price(self) -> str:
        return self.response.css("p.price_color::text").get()

    @field
    def url(self) -> str:
        return self.response.url

    @field
    def availability(self) -> str:
        # The second element contains the actual text
        return self.response.css("p.availability::text").getall()[1].strip()

    @field
    def number_of_reviews(self) -> int:
        return int(self.response.css("table tr:last-child td::text").get())

    @field
    def upc(self) -> str:
        return self.response.css("table tr:first-child td::text").get()

In roughly 30 seconds the Co‑pilot has done everything we did manually in Part 2—and added extra fields.

Step 3: Running the AI‑Generated Tests

The Co‑pilot also wrote unit tests for you. A tests folder now contains test_bookstoscrape_com.py.

Run the tests via the Co‑pilot UI (“Run Tests”) or from the terminal:

$ pytest
=================== test session starts ===================
...
tests/test_bookstoscrape_com.py::test_book_detail[book_0] PASSED
tests/test_bookstoscrape_com.py::test_book_detail[book_1] PASSED
...
=================== 8 passed in 0.10s ====================

Your parsing logic is fully tested, and you didn’t write a single line of test code.

Step 4: Refactoring the Spider (The Easy Way)

Now update tutorial/spiders/books.py to use the new architecture, just as we did in Part 2.

# tutorial/spiders/books.py

import scrapy
# Import our new, auto‑generated Item class
from tutorial.items import BookItem

class BooksSpider(scrapy.Spider):
    name = "books"
    # ... (rest of spider from Part 1) ...

    async def parse_listpage(self, response):
        product_urls = response.css("article.product_pod h3 a::attr(href)").getall()
        for url in product_urls:
            # We just tell Scrapy to call parse_book
            yield response.follow(url, callback=self.parse_book)

    async def parse_book(self, response):
        # Let the Page Object do the heavy lifting
        page = BookDetailPage(response)
        item = await page.to_item()
        yield item

With the auto‑generated BookDetailPage handling all parsing, the spider becomes dramatically simpler.

🎉 That’s it!

In just a few minutes the Web Scraping Co‑pilot has:

  1. Created the BookItem schema with all relevant fields.
  2. Generated a fully‑featured Page Object (BookDetailPage).
  3. Produced fixtures and comprehensive unit tests.
  4. Simplified the spider code to a clean, async‑ready implementation.

You can now focus on crawling strategy, data pipelines, and scaling—letting the AI take care of the repetitive boilerplate. Happy scraping!

t_page_url = response.css("li.next a::attr(href)").get()
if next_page_url:
    yield response.follow(next_page_url, callback=self.parse_listpage)

# We ask for the BookItem, and scrapy‑poet does the rest!
async def parse_book(self, response, book: BookItem):
    yield book

Step 5: Auto‑Generating our BookListPage

We can repeat the exact same process for our list page to finish the refactor.

Prompt the Co‑pilot:

Create a page object for the list item BookListPage using the sample URL

Result

  • The Co‑pilot creates the BookListPage item in items.py.
  • It creates the BookListPageObject in bookstoscrape_com.py with parsers for book_urls and next_page_url.
  • It writes and passes the tests.

Now we can update our spider one last time to be fully architected.

# tutorial/spiders/books.py (FINAL VERSION)

import scrapy
from tutorial.items import BookItem, BookListPage   # Import both

class BooksSpider(scrapy.Spider):
    # ... (name, allowed_domains, url) ...

    async def start(self):
        yield scrapy.Request(self.url, callback=self.parse_listpage)

    # We now ask for the BookListPage item!
    async def parse_listpage(self, response, page: BookListPage):
        # All parsing logic is GONE from the spider.
        for url in page.book_urls:
            yield response.follow(url, callback=self.parse_book)

        if page.next_page_url:
            yield response.follow(page.next_page_url, callback=self.parse_listpage)

    async def parse_book(self, response, book: BookItem):
        yield book

Our spider is now just a crawler. It has zero parsing logic. All the hard work of finding selectors and writing parsers was automated by the Co‑pilot.

Conclusion: The “Hybrid Developer”

The Web Scraping Co‑pilot doesn’t replace you. It accelerates you. It automates the 90 % of work that is “grunt work” (finding selectors, writing boilerplate, creating tests) so you can focus on the 10 % that matters: crawling logic, strategy, and handling complex sites.

This is how we, as the maintainers of Scrapy, build spiders professionally.

What’s Next? Join the Community

💬 Talk on Discord – Stuck on this Scrapy code? Ask the maintainers and 5k+ devs in our Discord.
▶️ Watch on YouTube – This post was based on our video! Watch the full walkthrough on our channel.
📩 Read More – Want more? In Part 2 we’ll cover Scrapy Items and Pipelines. Subscribe to the Extract newsletter so you don’t miss it.

Back to Blog

Related posts

Read more »