XML / HTML tables / Web scraping (static HTML)

Published: (December 11, 2025 at 02:49 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

Quick workspace prep

# Check working directory
getwd()

# Set working directory (if you must)
setwd("/path/to/project")

# Start with a clean environment (use with care)
rm(list = ls())

# Show files in the current directory
list.files()

Base R vs modern tidyverse / high‑performance packages

  • Base R (read.table, read.csv, read.delim) – works everywhere, flexible.
  • readr (read_csv, read_delim) – faster, consistent parsing, tidyverse‑friendly.
  • data.table::fread – extremely fast for large CSV‑like files.
  • readxl – best for Excel (no external dependencies).
  • haven – best for SAS/SPSS/Stata (via ReadStat).
  • jsonlite – modern JSON parsing.
  • DBI + specific DB backends – standard interface for relational DBs.
  • arrow – Feather/Parquet and interop with Python/Spark; excellent for large, columnar data.

CSV / delimited text

  • Small‑medium files: readr::read_csv()
  • Very large files: data.table::fread() or vroom::vroom()

Examples

# readr – tidy defaults, prints parsing problems
install.packages("readr")
library(readr)

df  <- read_csv("data/sales.csv")               # comma‑separated
df2 <- read_delim("data/data.txt", "\t")        # tab‑separated
# data.table – blazing fast, automatic type detection
install.packages("data.table")
library(data.table)

dt <- fread("data/sales.csv")
# vroom – lazy loading, fast for many files
install.packages("vroom")
library(vroom)

df_v <- vroom("data/sales.csv")

Useful arguments

  • locale = locale(encoding = "UTF-8") – handle encodings.
  • col_types = cols(...) – force column types (avoid mis‑parsing).
  • na = c("", "NA", "NULL") – specify missing tokens.
  • n_max – read only the first N rows (quick inspect).

Excel (XLS / XLSX)

  • readxl::read_excel() – no Java dependency, reliable.
  • tidyxl, openxlsx – for formulas or writing.

Example

install.packages("readxl")
library(readxl)

# Read the first sheet
df <- read_excel("data/book.xlsx")

# Read a specific sheet by name or index
df_sheet3 <- read_excel("data/book.xlsx", sheet = "Sales")
df_sheet2 <- read_excel("data/book.xlsx", sheet = 2)

# Read a range
df_range <- read_excel("data/book.xlsx", range = "A1:F100")

JSON

  • jsonlite::fromJSON() – robust; converts arrays to data frames where appropriate.
install.packages("jsonlite")
library(jsonlite)

# From a local file
j <- fromJSON("data/doc.json", flatten = TRUE)

# From a URL
j2 <- fromJSON("https://api.example.com/data")

Tip: If JSON contains nested lists, use flatten = TRUE or manually tidyr::unnest() to tidy.

XML / HTML tables / Web scraping (static HTML)

  • xml2 + rvest for HTML scraping / table extraction.
  • XML for more advanced XML parsing (but xml2 is generally preferred).
install.packages(c("xml2", "rvest"))
library(xml2)
library(rvest)

page   <- read_html("https://example.com/page-with-tables")
tables <- html_table(html_nodes(page, "table"))  # list of tibbles

Excel‑like & legacy formats (ODS, Google Sheets)

Google Sheets

install.packages("googlesheets4")
library(googlesheets4)

sheet <- read_sheet("https://docs.google.com/spreadsheets/...")

ODS

install.packages("readODS")
library(readODS)

ods_data <- read_ods("data/file.ods")

SAS / SPSS / Stata

  • haven (fast, preserves labelled variables).
  • foreign is older; haven is preferred.
install.packages("haven")
library(haven)

df_sas   <- read_sas("data/data.sas7bdat")
df_spss  <- read_sav("data/data.sav")
df_stata <- read_dta("data/data.dta")

Tip: haven keeps value labels (labelled class). Convert to factors with labelled::to_factor().

MATLAB / Octave

install.packages("R.matlab")
library(R.matlab)

m <- readMat("data/matrix.mat")

For Octave text data:

install.packages("foreign")
library(foreign)

oct_data <- read.octave("data/file.oct")

Parquet, Feather, Arrow — modern columnar formats

install.packages("arrow")
library(arrow)

# Read Parquet
tbl <- read_parquet("data/data.parquet")

# Write Parquet
write_parquet(df, "out/data.parquet")

Benefit: Arrow enables out‑of‑memory, zero‑copy reads and is ideal for large‑scale pipelines.

ODBC / Relational DBs (SQL Server, Postgres, MySQL, etc.)

  • DBI + driver package (RPostgres, RMySQL, odbc) – standard interface.
  • pool for connection pooling in applications.

Example: PostgreSQL via DBI

install.packages(c("DBI", "RPostgres"))
library(DBI)

con <- dbConnect(RPostgres::Postgres(),
                 dbname = "mydb",
                 host   = "db.example.com",
                 port   = 5432,
                 user   = "me",
                 password = "pw")

# Run a query
df <- dbGetQuery(con, "SELECT * FROM schema.table LIMIT 1000")

# Fetch a big table in chunks
res   <- dbSendQuery(con, "SELECT * FROM big_table")
chunk <- dbFetch(res, n = 10000)

# Clean up
dbClearResult(res)
dbDisconnect(con)

ODBC (Windows / DSN)

install.packages("odbc")
library(DBI)

con <- dbConnect(odbc::odbc(), DSN = "my_dsn", UID = "user", PWD = "pw")
dbListTables(con)
dbDisconnect(con)

Tip: Avoid SELECT * on huge tables—select only needed columns and filter on the SQL side.

Cloud storage (S3, GCS, Azure)

  • S3: aws.s3; Arrow can read directly from S3 URIs.
  • Google Cloud Storage: gcsfs (less common) or Arrow/GCS integration.
  • Azure: AzureStor.

Example with aws.s3

install.packages("aws.s3")
library(aws.s3)

# Configure AWS credentials (env vars or ~/.aws/credentials)
s3read_using(FUN = read.csv, object = "s3://my-bucket/data.csv")

Arrow can also read Parquet from S3:

arrow::read_parquet("s3://bucket/path/file.parquet")

Reading in chunks / streaming large files

  • Use database imports and query subsets.
  • data.table::fread(select = ...) to read only specific columns.
  • vroom + dplyr for delayed reads.
  • For complex chunking: readr::read_lines_chunked() or packages LaF and iotools.

Example with LaF

install.packages("LaF")
library(LaF)

laf   <- laf_open_csv("big.csv",
                      column_types = c("integer", "double", "string"),
                      nrows = 1e8)
block <- laf[1:100000, ]  # first 100k rows

Parsing dates, times, and types

Proper parsing avoids downstream surprises.

install.packages("lubridate")
library(lubridate)

df$date     <- ymd(df$date_string)          # "2023-07-01"
df$datetime <- ymd_hms(df$ts)                # "2023-07-01 08:45:00"

Use col_types in readr or colClasses in fread to enforce column types.

Encoding & locale issues

If you see garbled characters:

# readr
df <- read_csv("file.csv", locale = locale(encoding = "latin1"))

# base read.table
df <- read.table("file.txt", fileEncoding = "UTF-8")

Check with Encoding() and convert using iconv() if needed.

Common import pitfalls and troubleshooting

  • Headers misaligned / comments: use skip = n, comment = "#", or skip_empty_rows = TRUE.
  • Inconsistent column types: pre‑specify column types (col_types or colClasses).
  • NA tokens: set na = c("", "NA", "n/a").
  • Thousands separators / decimal marks: locale = locale(decimal_mark = ",", grouping_mark = ".").
  • Memory errors: filter on the SQL side, load a subset, or use Arrow/Parquet to read columns only.

Quick hacks & productivity tips

  • Clipboard import (Windows/Mac):

    df <- read.table("clipboard", header = TRUE, sep = "\t")
  • Peek first rows before full import:

    readr::read_lines("file.csv", n_max = 10)
  • Inspect parsing problems after a readr import:

    problems()
Back to Blog

Related posts

Read more »