How I scraped the CQC Care Register without hitting the API auth wall
Source: Dev.to
API authentication wall
CQC migrated their API to api.service.cqc.org.uk and added bearer‑token authentication. You must register at their developer portal, create an application, and include an Authorization: Bearer header on every request. This isn’t a deal‑breaker for enterprise use cases, but it adds friction for a data product because users have to register with CQC before they can run your actor.
The old API base (api.cqc.org.uk/public/v1) now returns HTTP 403 – fully blocked.
Open‑data file rescue
CQC publishes a monthly open‑data file called HSCA_Active_Locations.ods. It’s a 23 MB OpenDocument Spreadsheet containing every active regulated location in England (≈ 56 000). The file is free, requires no authentication, and is released under the Open Government Licence. The URL is date‑stamped and changes each month, but the transparency page always links to the current version.
Approach
- Scrape the transparency page to find the current ODS URL.
- Download the file.
- Parse it, filter rows, and push results.
No API, no auth wall.
ODS parsing challenge
ODS files (.ods) are ZIP archives containing XML. The standard tool for parsing them in Node.js is SheetJS (xlsx package, v0.18.5 – the last Apache 2.0 release).
Sheet selection
The workbook contains three sheets: README, HSCA_Active_Locations, and Dual_Registration_Locations. SheetJS defaults to the first sheet (README), which only has 34 rows. The code below selects the sheet with the most rows:
let bestSheet = null;
let bestCount = 0;
for (const name of workbook.SheetNames) {
const sheet = workbook.Sheets[name];
const probe = XLSX.utils.sheet_to_json(sheet, { header: 1 });
if (probe.length > bestCount) {
bestCount = probe.length;
bestSheet = sheet;
}
}
Column name handling
Column names are Title Case with spaces (e.g., Location Local Authority, Location Latest Overall Rating), not the snake_case shown in third‑party docs. A helper that tries multiple variants keeps the parser resilient to future renames:
function col(row, ...candidates) {
for (const name of candidates) {
if (row[name] !== undefined) return row[name];
}
return undefined;
}
Date handling
When you read the workbook with { cellDates: true }, dates are returned as JavaScript Date objects, while coordinates come back as strings. Both are normalised in the processing step.
const workbook = XLSX.read(buffer, { type: 'buffer', cellDates: true });
Bonus: service type and user‑band extraction
The ODS has 122 columns; about 60 are boolean flags (Y/null) for specific service types and user bands (e.g., Service type - Care home service with nursing). Instead of mapping each column individually, the active ones are extracted into arrays:
function extractTagged(row, prefix) {
return Object.entries(row)
.filter(([key, val]) => key.startsWith(prefix) && (val === 'Y' || val === true))
.map(([key]) => key.slice(prefix.length));
}
Resulting output
{
"serviceTypes": ["Care home service without nursing"],
"serviceUserBands": ["Dementia", "Older People"]
}
Result
The actor supports two modes:
searchLocations– filter by area, rating, service type.lookupProviders– fetch all locations for specific provider IDs.
It loads and parses the 23 MB ODS on every run (≈ 10–15 s) and then filters in memory. No proxy, no browser automation, no anti‑bot handling.
Available on Apify Store: CQC Care Register Scraper