The Modern Scrapy Developer's Guide (Part 3): Auto-Generating Page Objects with Web Scraping Co-pilot
Source: Dev.to
Prerequisites & Setup
This tutorial assumes you have:
- Completed Part 1 (see the link above).
- Visual Studio Code installed.
- The Web Scraping Co‑pilot extension (which we’ll install now).
Step 1: Installing Web Scraping Co‑pilot
-
Open VS Code and go to the Extensions tab.
-
Search for Web Scraping Co‑pilot (published by Zyte) and install it.
-
After installation you’ll see a new icon in the sidebar. Click it; the extension will automatically detect your Scrapy project.
-
If prompted, allow it to install a few dependencies (e.g.,
pytest). This ensures your environment is ready for AI‑powered generation.
Step 2: Auto‑Generating our BookItem
We’ll start with the spider from Part 1 and let the Co‑pilot create a Page Object for BookItem, adding even more fields than we did in Part 2.
-
Open the Co‑pilot chat window.
-
Select “Web Scraping.”
-
Write a prompt such as:
Create a page object for the item BookItem using the sample URL https://books.toscrape.com/catalogue/the-host_979/index.html
The Co‑pilot will:
- Check your project – confirming that
scrapy‑poetandpytestare installed (offering to add them if not). - Add
scrapy‑poetsettings – automatically inserting theADDONSandSCRAPY_POET_DISCOVERentries intosettings.py. - Create
items.py– generating a newBookItemclass with all fields it can discover on the page.
# tutorial/items.py (Auto‑Generated!)
import attrs
@attrs.define
class BookItem:
"""
The structured data we extract from a book *detail* page.
"""
name: str
price: str
url: str
availability: str # str:
return self.response.css("h1::text").get()
@field
def price(self) -> str:
return self.response.css("p.price_color::text").get()
@field
def url(self) -> str:
return self.response.url
@field
def availability(self) -> str:
# The second element contains the actual text
return self.response.css("p.availability::text").getall()[1].strip()
@field
def number_of_reviews(self) -> int:
return int(self.response.css("table tr:last-child td::text").get())
@field
def upc(self) -> str:
return self.response.css("table tr:first-child td::text").get()
In roughly 30 seconds the Co‑pilot has done everything we did manually in Part 2—and added extra fields.
Step 3: Running the AI‑Generated Tests
The Co‑pilot also wrote unit tests for you. A tests folder now contains test_bookstoscrape_com.py.
Run the tests via the Co‑pilot UI (“Run Tests”) or from the terminal:
$ pytest
=================== test session starts ===================
...
tests/test_bookstoscrape_com.py::test_book_detail[book_0] PASSED
tests/test_bookstoscrape_com.py::test_book_detail[book_1] PASSED
...
=================== 8 passed in 0.10s ====================
Your parsing logic is fully tested, and you didn’t write a single line of test code.
Step 4: Refactoring the Spider (The Easy Way)
Now update tutorial/spiders/books.py to use the new architecture, just as we did in Part 2.
# tutorial/spiders/books.py
import scrapy
# Import our new, auto‑generated Item class
from tutorial.items import BookItem
class BooksSpider(scrapy.Spider):
name = "books"
# ... (rest of spider from Part 1) ...
async def parse_listpage(self, response):
product_urls = response.css("article.product_pod h3 a::attr(href)").getall()
for url in product_urls:
# We just tell Scrapy to call parse_book
yield response.follow(url, callback=self.parse_book)
async def parse_book(self, response):
# Let the Page Object do the heavy lifting
page = BookDetailPage(response)
item = await page.to_item()
yield item
With the auto‑generated BookDetailPage handling all parsing, the spider becomes dramatically simpler.
🎉 That’s it!
In just a few minutes the Web Scraping Co‑pilot has:
- Created the
BookItemschema with all relevant fields. - Generated a fully‑featured Page Object (
BookDetailPage). - Produced fixtures and comprehensive unit tests.
- Simplified the spider code to a clean, async‑ready implementation.
You can now focus on crawling strategy, data pipelines, and scaling—letting the AI take care of the repetitive boilerplate. Happy scraping!
t_page_url = response.css("li.next a::attr(href)").get()
if next_page_url:
yield response.follow(next_page_url, callback=self.parse_listpage)
# We ask for the BookItem, and scrapy‑poet does the rest!
async def parse_book(self, response, book: BookItem):
yield book
Step 5: Auto‑Generating our BookListPage
We can repeat the exact same process for our list page to finish the refactor.
Prompt the Co‑pilot:
Create a page object for the list item BookListPage using the sample URL
Result
- The Co‑pilot creates the
BookListPageitem initems.py. - It creates the
BookListPageObjectinbookstoscrape_com.pywith parsers forbook_urlsandnext_page_url. - It writes and passes the tests.
Now we can update our spider one last time to be fully architected.
# tutorial/spiders/books.py (FINAL VERSION)
import scrapy
from tutorial.items import BookItem, BookListPage # Import both
class BooksSpider(scrapy.Spider):
# ... (name, allowed_domains, url) ...
async def start(self):
yield scrapy.Request(self.url, callback=self.parse_listpage)
# We now ask for the BookListPage item!
async def parse_listpage(self, response, page: BookListPage):
# All parsing logic is GONE from the spider.
for url in page.book_urls:
yield response.follow(url, callback=self.parse_book)
if page.next_page_url:
yield response.follow(page.next_page_url, callback=self.parse_listpage)
async def parse_book(self, response, book: BookItem):
yield book
Our spider is now just a crawler. It has zero parsing logic. All the hard work of finding selectors and writing parsers was automated by the Co‑pilot.
Conclusion: The “Hybrid Developer”
The Web Scraping Co‑pilot doesn’t replace you. It accelerates you. It automates the 90 % of work that is “grunt work” (finding selectors, writing boilerplate, creating tests) so you can focus on the 10 % that matters: crawling logic, strategy, and handling complex sites.
This is how we, as the maintainers of Scrapy, build spiders professionally.
What’s Next? Join the Community
💬 Talk on Discord – Stuck on this Scrapy code? Ask the maintainers and 5k+ devs in our Discord.
▶️ Watch on YouTube – This post was based on our video! Watch the full walkthrough on our channel.
📩 Read More – Want more? In Part 2 we’ll cover Scrapy Items and Pipelines. Subscribe to the Extract newsletter so you don’t miss it.
