使用 Symfony 控制台命令爬取网站(清洁且适用于生产环境)
Source: Dev.to
Introduction
网页爬取不应该放在控制器中。
它是长时间运行的,可能会失败,通常需要调度,实质上是一种自动化。
Symfony 控制台命令非常适合这种场景,因为它们在 Symfony 应用内部运行,能够完整访问应用的服务和依赖注入容器。
What the command does
- 从一个免费的沙盒站点爬取国家数据。
- 使用 DomCrawler 解析 HTML。
- 按字母顺序对结果进行排序。
- 显示一个整洁的 CLI 表格。
GitHub 仓库可供参考:
演示中使用的爬取沙盒:
Prerequisites
composer require symfony/http-client
composer require symfony/dom-crawler
composer require symfony/css-selector
Create the console command
php bin/console make:command app:create-countries
Inject the HTTP client (DI)
use Symfony\Contracts\HttpClient\HttpClientInterface;
public function __construct(
private HttpClientInterface $client
) {}
Fetch the page
$response = $this->client->request('GET', self::URL);
$html = $response->getContent(); // raw HTML
Parse with DomCrawler
use Symfony\Component\DomCrawler\Crawler;
$crawler = new Crawler($html);
Extract country information
$countryInfo = [];
$crawler->filter('.country')->each(function (Crawler $row) use (&$countryInfo) {
$countryInfo[] = [
$row->filter('.country-name')->text(),
$row->filter('.country-capital')->text(),
$row->filter('.country-population')->text(),
$row->filter('.country-area')->text(),
];
});
Sort alphabetically by country name
usort($countryInfo, function ($a, $b) {
return strcasecmp($a[0], $b[0]);
});
Display results in a formatted table
// Header
printf(
"%-45s | %-20s | %15s | %15s\n",
"Country name",
"Capital",
"Population",
"Area (km2)"
);
// Rows
foreach ($countryInfo as $row) {
printf(
"%-45s | %-20s | %15s | %15s\n",
$row[0],
$row[1],
$row[2],
$row[3]
);
}
Tip: Add a multibyte‑safe padding helper if you need proper alignment with Unicode characters.
Run the command:
php bin/console app:create-countries
You should see an output similar to a professional terminal table rather than raw debug data.
Benefits of this approach
- Separation of concerns – scraping logic lives outside controllers.
- Cron‑friendly – easy to schedule with cron or Symfony’s scheduler.
- Clean architecture – reusable, testable, and maintainable code.
- Scalable – can be refactored into asynchronous jobs if needed.
Scraping etiquette (before targeting real sites)
- Check the site’s Terms of Service.
- Respect
robots.txt. - Avoid aggressive request rates; add delays when scraping multiple pages:
sleep(1); // pause 1 second between requests
Full source code
The complete working project, including the command implementation, formatting helpers, and setup instructions, is available on GitHub:
When to use this pattern
- Data aggregation tools
- Monitoring systems
- Intelligence platforms
- Background automation jobs
Symfony Console Commands combined with DomCrawler provide an underrated yet powerful solution for these use cases.
Next steps
In part 2 of this series, the scraped results will be persisted to a database.