使用 Symfony 控制台命令爬取网站(清洁且适用于生产环境)

发布: (2026年2月26日 GMT+8 01:48)
3 分钟阅读
原文: Dev.to

Source: Dev.to

Introduction

网页爬取不应该放在控制器中。
它是长时间运行的,可能会失败,通常需要调度,实质上是一种自动化。
Symfony 控制台命令非常适合这种场景,因为它们在 Symfony 应用内部运行,能够完整访问应用的服务和依赖注入容器。

What the command does

  • 从一个免费的沙盒站点爬取国家数据。
  • 使用 DomCrawler 解析 HTML。
  • 按字母顺序对结果进行排序。
  • 显示一个整洁的 CLI 表格。

GitHub 仓库可供参考:

演示中使用的爬取沙盒:

Prerequisites

composer require symfony/http-client
composer require symfony/dom-crawler
composer require symfony/css-selector

Create the console command

php bin/console make:command app:create-countries

Inject the HTTP client (DI)

use Symfony\Contracts\HttpClient\HttpClientInterface;

public function __construct(
    private HttpClientInterface $client
) {}

Fetch the page

$response = $this->client->request('GET', self::URL);
$html = $response->getContent();   // raw HTML

Parse with DomCrawler

use Symfony\Component\DomCrawler\Crawler;

$crawler = new Crawler($html);

Extract country information

$countryInfo = [];

$crawler->filter('.country')->each(function (Crawler $row) use (&$countryInfo) {
    $countryInfo[] = [
        $row->filter('.country-name')->text(),
        $row->filter('.country-capital')->text(),
        $row->filter('.country-population')->text(),
        $row->filter('.country-area')->text(),
    ];
});

Sort alphabetically by country name

usort($countryInfo, function ($a, $b) {
    return strcasecmp($a[0], $b[0]);
});

Display results in a formatted table

// Header
printf(
    "%-45s | %-20s | %15s | %15s\n",
    "Country name",
    "Capital",
    "Population",
    "Area (km2)"
);

// Rows
foreach ($countryInfo as $row) {
    printf(
        "%-45s | %-20s | %15s | %15s\n",
        $row[0],
        $row[1],
        $row[2],
        $row[3]
    );
}

Tip: Add a multibyte‑safe padding helper if you need proper alignment with Unicode characters.

Run the command:

php bin/console app:create-countries

You should see an output similar to a professional terminal table rather than raw debug data.

Benefits of this approach

  • Separation of concerns – scraping logic lives outside controllers.
  • Cron‑friendly – easy to schedule with cron or Symfony’s scheduler.
  • Clean architecture – reusable, testable, and maintainable code.
  • Scalable – can be refactored into asynchronous jobs if needed.

Scraping etiquette (before targeting real sites)

  • Check the site’s Terms of Service.
  • Respect robots.txt.
  • Avoid aggressive request rates; add delays when scraping multiple pages:
sleep(1);   // pause 1 second between requests

Full source code

The complete working project, including the command implementation, formatting helpers, and setup instructions, is available on GitHub:

When to use this pattern

  • Data aggregation tools
  • Monitoring systems
  • Intelligence platforms
  • Background automation jobs

Symfony Console Commands combined with DomCrawler provide an underrated yet powerful solution for these use cases.

Next steps

In part 2 of this series, the scraped results will be persisted to a database.

0 浏览
Back to Blog

相关文章

阅读更多 »

国家代码与地区差异

在 Symfony Intl 中支持科索沃:Symfony 的 Intl 组件不包括科索沃,因为其 ISO 3166‑1 代码 XK 是用户分配的代码,而非官方的…