使用 Symfony 控制台命令爬取网站（清洁且适用于生产环境）

发布: 2个月前 (2026年2月26日 GMT+8 01:48)

3 分钟阅读

原文: Dev.to

Source: Dev.to

Introduction

网页爬取不应该放在控制器中。
它是长时间运行的，可能会失败，通常需要调度，实质上是一种自动化。
Symfony 控制台命令非常适合这种场景，因为它们在 Symfony 应用内部运行，能够完整访问应用的服务和依赖注入容器。

What the command does

从一个免费的沙盒站点爬取国家数据。
使用 DomCrawler 解析 HTML。
按字母顺序对结果进行排序。
显示一个整洁的 CLI 表格。

GitHub 仓库可供参考：

演示中使用的爬取沙盒：

Prerequisites

composer require symfony/http-client
composer require symfony/dom-crawler
composer require symfony/css-selector

Create the console command

php bin/console make:command app:create-countries

Inject the HTTP client (DI)

use Symfony\Contracts\HttpClient\HttpClientInterface;

public function __construct(
    private HttpClientInterface $client
) {}

Fetch the page

$response = $this->client->request('GET', self::URL);
$html = $response->getContent();   // raw HTML

Parse with DomCrawler

use Symfony\Component\DomCrawler\Crawler;

$crawler = new Crawler($html);

Extract country information

$countryInfo = [];

$crawler->filter('.country')->each(function (Crawler $row) use (&$countryInfo) {
    $countryInfo[] = [
        $row->filter('.country-name')->text(),
        $row->filter('.country-capital')->text(),
        $row->filter('.country-population')->text(),
        $row->filter('.country-area')->text(),
    ];
});

Sort alphabetically by country name

usort($countryInfo, function ($a, $b) {
    return strcasecmp($a[0], $b[0]);
});

Display results in a formatted table

// Header
printf(
    "%-45s | %-20s | %15s | %15s\n",
    "Country name",
    "Capital",
    "Population",
    "Area (km2)"
);

// Rows
foreach ($countryInfo as $row) {
    printf(
        "%-45s | %-20s | %15s | %15s\n",
        $row[0],
        $row[1],
        $row[2],
        $row[3]
    );
}

Tip: Add a multibyte‑safe padding helper if you need proper alignment with Unicode characters.

Run the command:

php bin/console app:create-countries

You should see an output similar to a professional terminal table rather than raw debug data.

Benefits of this approach

Separation of concerns – scraping logic lives outside controllers.
Cron‑friendly – easy to schedule with cron or Symfony’s scheduler.
Clean architecture – reusable, testable, and maintainable code.
Scalable – can be refactored into asynchronous jobs if needed.

Scraping etiquette (before targeting real sites)

Check the site’s Terms of Service.
Respect robots.txt.
Avoid aggressive request rates; add delays when scraping multiple pages:

sleep(1);   // pause 1 second between requests

Full source code

The complete working project, including the command implementation, formatting helpers, and setup instructions, is available on GitHub:

When to use this pattern

Data aggregation tools
Monitoring systems
Intelligence platforms
Background automation jobs

Symfony Console Commands combined with DomCrawler provide an underrated yet powerful solution for these use cases.

Next steps

In part 2 of this series, the scraped results will be persisted to a database.