The Modern Scrapy Developer's Guide (Part 3): Auto-Generating Page Objects with Web Scraping Co-pilot

Published: 2 days ago (December 16, 2025 at 01:55 PM EST)

5 min read

Source: Dev.to

Source: Dev.to

Prerequisites & Setup

This tutorial assumes you have:

Completed Part 1 (see the link above).
Visual Studio Code installed.
The Web Scraping Co‑pilot extension (which we’ll install now).

Step 1: Installing Web Scraping Co‑pilot

Open VS Code and go to the Extensions tab.
Search for Web Scraping Co‑pilot (published by Zyte) and install it.
After installation you’ll see a new icon in the sidebar. Click it; the extension will automatically detect your Scrapy project.
If prompted, allow it to install a few dependencies (e.g., pytest). This ensures your environment is ready for AI‑powered generation.

Step 2: Auto‑Generating our `BookItem`

We’ll start with the spider from Part 1 and let the Co‑pilot create a Page Object for BookItem, adding even more fields than we did in Part 2.

Open the Co‑pilot chat window.
Select “Web Scraping.”

Write a prompt such as:

Create a page object for the item BookItem using the sample URL https://books.toscrape.com/catalogue/the-host_979/index.html

The Co‑pilot will:

Check your project – confirming that scrapy‑poet and pytest are installed (offering to add them if not).
Add scrapy‑poet settings – automatically inserting the ADDONS and SCRAPY_POET_DISCOVER entries into settings.py.
Create items.py – generating a new BookItem class with all fields it can discover on the page.

# tutorial/items.py (Auto‑Generated!)
import attrs

@attrs.define
class BookItem:
    """
    The structured data we extract from a book *detail* page.
    """
    name: str
    price: str
    url: str
    availability: str          #  str:
        return self.response.css("h1::text").get()

    @field
    def price(self) -> str:
        return self.response.css("p.price_color::text").get()

    @field
    def url(self) -> str:
        return self.response.url

    @field
    def availability(self) -> str:
        # The second element contains the actual text
        return self.response.css("p.availability::text").getall()[1].strip()

    @field
    def number_of_reviews(self) -> int:
        return int(self.response.css("table tr:last-child td::text").get())

    @field
    def upc(self) -> str:
        return self.response.css("table tr:first-child td::text").get()

In roughly 30 seconds the Co‑pilot has done everything we did manually in Part 2—and added extra fields.

Step 3: Running the AI‑Generated Tests

The Co‑pilot also wrote unit tests for you. A tests folder now contains test_bookstoscrape_com.py.

Run the tests via the Co‑pilot UI (“Run Tests”) or from the terminal:

$ pytest
=================== test session starts ===================
...
tests/test_bookstoscrape_com.py::test_book_detail[book_0] PASSED
tests/test_bookstoscrape_com.py::test_book_detail[book_1] PASSED
...
=================== 8 passed in 0.10s ====================

Your parsing logic is fully tested, and you didn’t write a single line of test code.

Step 4: Refactoring the Spider (The Easy Way)

Now update tutorial/spiders/books.py to use the new architecture, just as we did in Part 2.

# tutorial/spiders/books.py

import scrapy
# Import our new, auto‑generated Item class
from tutorial.items import BookItem

class BooksSpider(scrapy.Spider):
    name = "books"
    # ... (rest of spider from Part 1) ...

    async def parse_listpage(self, response):
        product_urls = response.css("article.product_pod h3 a::attr(href)").getall()
        for url in product_urls:
            # We just tell Scrapy to call parse_book
            yield response.follow(url, callback=self.parse_book)

    async def parse_book(self, response):
        # Let the Page Object do the heavy lifting
        page = BookDetailPage(response)
        item = await page.to_item()
        yield item

With the auto‑generated BookDetailPage handling all parsing, the spider becomes dramatically simpler.

🎉 That’s it!

In just a few minutes the Web Scraping Co‑pilot has:

Created the BookItem schema with all relevant fields.
Generated a fully‑featured Page Object (BookDetailPage).
Produced fixtures and comprehensive unit tests.
Simplified the spider code to a clean, async‑ready implementation.

You can now focus on crawling strategy, data pipelines, and scaling—letting the AI take care of the repetitive boilerplate. Happy scraping!

t_page_url = response.css("li.next a::attr(href)").get()
if next_page_url:
    yield response.follow(next_page_url, callback=self.parse_listpage)

# We ask for the BookItem, and scrapy‑poet does the rest!
async def parse_book(self, response, book: BookItem):
    yield book

Step 5: Auto‑Generating our `BookListPage`

We can repeat the exact same process for our list page to finish the refactor.

Prompt the Co‑pilot:

Create a page object for the list item BookListPage using the sample URL

Result

The Co‑pilot creates the BookListPage item in items.py.
It creates the BookListPageObject in bookstoscrape_com.py with parsers for book_urls and next_page_url.
It writes and passes the tests.

Now we can update our spider one last time to be fully architected.

# tutorial/spiders/books.py (FINAL VERSION)

import scrapy
from tutorial.items import BookItem, BookListPage   # Import both

class BooksSpider(scrapy.Spider):
    # ... (name, allowed_domains, url) ...

    async def start(self):
        yield scrapy.Request(self.url, callback=self.parse_listpage)

    # We now ask for the BookListPage item!
    async def parse_listpage(self, response, page: BookListPage):
        # All parsing logic is GONE from the spider.
        for url in page.book_urls:
            yield response.follow(url, callback=self.parse_book)

        if page.next_page_url:
            yield response.follow(page.next_page_url, callback=self.parse_listpage)

    async def parse_book(self, response, book: BookItem):
        yield book

Our spider is now just a crawler. It has zero parsing logic. All the hard work of finding selectors and writing parsers was automated by the Co‑pilot.

Conclusion: The “Hybrid Developer”

The Web Scraping Co‑pilot doesn’t replace you. It accelerates you. It automates the 90 % of work that is “grunt work” (finding selectors, writing boilerplate, creating tests) so you can focus on the 10 % that matters: crawling logic, strategy, and handling complex sites.

This is how we, as the maintainers of Scrapy, build spiders professionally.

What’s Next? Join the Community

💬 Talk on Discord – Stuck on this Scrapy code? Ask the maintainers and 5k+ devs in our Discord.
▶️ Watch on YouTube – This post was based on our video! Watch the full walkthrough on our channel.
📩 Read More – Want more? In Part 2 we’ll cover Scrapy Items and Pipelines. Subscribe to the Extract newsletter so you don’t miss it.

The Modern Scrapy Developer's Guide (Part 3): Auto-Generating Page Objects with Web Scraping Co-pilot

Prerequisites & Setup

Step 1: Installing Web Scraping Co‑pilot

Step 2: Auto‑Generating our `BookItem`

Step 3: Running the AI‑Generated Tests

Step 4: Refactoring the Spider (The Easy Way)

🎉 That’s it!

Step 5: Auto‑Generating our `BookListPage`

Conclusion: The “Hybrid Developer”

What’s Next? Join the Community

Related posts

We found our site was slow in Singapore but perfect in Europe — here's why

I put a Game Boy inside ChatGPT (ChatGPT Apps)

Advent of AI - Day 13: Goose Terminal Integration

A Day in the Life of a Marketing Manager Using Microsoft Planner

Prerequisites & Setup

Step 1: Installing Web Scraping Co‑pilot

Step 2: Auto‑Generating our BookItem

Step 3: Running the AI‑Generated Tests

Step 4: Refactoring the Spider (The Easy Way)

🎉 That’s it!

Step 5: Auto‑Generating our BookListPage

Conclusion: The “Hybrid Developer”

What’s Next? Join the Community

Related posts

We found our site was slow in Singapore but perfect in Europe — here's why

I put a Game Boy inside ChatGPT (ChatGPT Apps)

Advent of AI - Day 13: Goose Terminal Integration

A Day in the Life of a Marketing Manager Using Microsoft Planner

Step 1: Installing Web Scraping Co‑pilot

Step 2: Auto‑Generating our `BookItem`

Step 3: Running the AI‑Generated Tests

Step 4: Refactoring the Spider (The Easy Way)

Step 5: Auto‑Generating our `BookListPage`