XML / HTML tables / Web scraping (static HTML)

Published: 1 week ago (December 11, 2025 at 02:49 PM EST)

4 min read

Source: Dev.to

Quick workspace prep

# Check working directory
getwd()

# Set working directory (if you must)
setwd("/path/to/project")

# Start with a clean environment (use with care)
rm(list = ls())

# Show files in the current directory
list.files()

Base R vs modern tidyverse / high‑performance packages

Base R (read.table, read.csv, read.delim) – works everywhere, flexible.
readr (read_csv, read_delim) – faster, consistent parsing, tidyverse‑friendly.
data.table::fread – extremely fast for large CSV‑like files.
readxl – best for Excel (no external dependencies).
haven – best for SAS/SPSS/Stata (via ReadStat).
jsonlite – modern JSON parsing.
DBI + specific DB backends – standard interface for relational DBs.
arrow – Feather/Parquet and interop with Python/Spark; excellent for large, columnar data.

CSV / delimited text

Examples

# readr – tidy defaults, prints parsing problems
install.packages("readr")
library(readr)

df  <- read_csv("data/sales.csv")               # comma‑separated
df2 <- read_delim("data/data.txt", "\t")        # tab‑separated

# data.table – blazing fast, automatic type detection
install.packages("data.table")
library(data.table)

dt <- fread("data/sales.csv")

# vroom – lazy loading, fast for many files
install.packages("vroom")
library(vroom)

df_v <- vroom("data/sales.csv")

Useful arguments

locale = locale(encoding = "UTF-8") – handle encodings.
col_types = cols(...) – force column types (avoid mis‑parsing).
na = c("", "NA", "NULL") – specify missing tokens.
n_max – read only the first N rows (quick inspect).

Excel (XLS / XLSX)

Example

install.packages("readxl")
library(readxl)

# Read the first sheet
df <- read_excel("data/book.xlsx")

# Read a specific sheet by name or index
df_sheet3 <- read_excel("data/book.xlsx", sheet = "Sales")
df_sheet2 <- read_excel("data/book.xlsx", sheet = 2)

# Read a range
df_range <- read_excel("data/book.xlsx", range = "A1:F100")

JSON

XML / HTML tables / Web scraping (static HTML)

Excel‑like & legacy formats (ODS, Google Sheets)

Google Sheets

install.packages("googlesheets4")
library(googlesheets4)

sheet <- read_sheet("https://docs.google.com/spreadsheets/...")

ODS

install.packages("readODS")
library(readODS)

ods_data <- read_ods("data/file.ods")

SAS / SPSS / Stata

MATLAB / Octave

install.packages("R.matlab")
library(R.matlab)

m <- readMat("data/matrix.mat")

For Octave text data:

install.packages("foreign")
library(foreign)

oct_data <- read.octave("data/file.oct")

Parquet, Feather, Arrow — modern columnar formats

install.packages("arrow")
library(arrow)

# Read Parquet
tbl <- read_parquet("data/data.parquet")

# Write Parquet
write_parquet(df, "out/data.parquet")

Benefit: Arrow enables out‑of‑memory, zero‑copy reads and is ideal for large‑scale pipelines.

ODBC / Relational DBs (SQL Server, Postgres, MySQL, etc.)

Recommended approach

DBI + driver package (RPostgres, RMySQL, odbc) – standard interface.
pool for connection pooling in applications.

Example: PostgreSQL via DBI

install.packages(c("DBI", "RPostgres"))
library(DBI)

con <- dbConnect(RPostgres::Postgres(),
                 dbname = "mydb",
                 host   = "db.example.com",
                 port   = 5432,
                 user   = "me",
                 password = "pw")

# Run a query
df <- dbGetQuery(con, "SELECT * FROM schema.table LIMIT 1000")

# Fetch a big table in chunks
res   <- dbSendQuery(con, "SELECT * FROM big_table")
chunk <- dbFetch(res, n = 10000)

# Clean up
dbClearResult(res)
dbDisconnect(con)

ODBC (Windows / DSN)

install.packages("odbc")
library(DBI)

con <- dbConnect(odbc::odbc(), DSN = "my_dsn", UID = "user", PWD = "pw")
dbListTables(con)
dbDisconnect(con)

Tip: Avoid SELECT * on huge tables—select only needed columns and filter on the SQL side.

Cloud storage (S3, GCS, Azure)

S3: aws.s3; Arrow can read directly from S3 URIs.
Google Cloud Storage: gcsfs (less common) or Arrow/GCS integration.
Azure: AzureStor.

Example with `aws.s3`

install.packages("aws.s3")
library(aws.s3)

# Configure AWS credentials (env vars or ~/.aws/credentials)
s3read_using(FUN = read.csv, object = "s3://my-bucket/data.csv")

Arrow can also read Parquet from S3:

arrow::read_parquet("s3://bucket/path/file.parquet")

Reading in chunks / streaming large files

Use database imports and query subsets.
data.table::fread(select = ...) to read only specific columns.
vroom + dplyr for delayed reads.
For complex chunking: readr::read_lines_chunked() or packages LaF and iotools.

Example with LaF

install.packages("LaF")
library(LaF)

laf   <- laf_open_csv("big.csv",
                      column_types = c("integer", "double", "string"),
                      nrows = 1e8)
block <- laf[1:100000, ]  # first 100k rows

Parsing dates, times, and types

Proper parsing avoids downstream surprises.

install.packages("lubridate")
library(lubridate)

df$date     <- ymd(df$date_string)          # "2023-07-01"
df$datetime <- ymd_hms(df$ts)                # "2023-07-01 08:45:00"

Use col_types in readr or colClasses in fread to enforce column types.

Encoding & locale issues

If you see garbled characters:

# readr
df <- read_csv("file.csv", locale = locale(encoding = "latin1"))

# base read.table
df <- read.table("file.txt", fileEncoding = "UTF-8")

Check with Encoding() and convert using iconv() if needed.

Common import pitfalls and troubleshooting

Headers misaligned / comments: use skip = n, comment = "#", or skip_empty_rows = TRUE.
Inconsistent column types: pre‑specify column types (col_types or colClasses).
NA tokens: set na = c("", "NA", "n/a").
Thousands separators / decimal marks: locale = locale(decimal_mark = ",", grouping_mark = ".").
Memory errors: filter on the SQL side, load a subset, or use Arrow/Parquet to read columns only.

Quick hacks & productivity tips

Clipboard import (Windows/Mac):

df <- read.table("clipboard", header = TRUE, sep = "\t")

Peek first rows before full import:

readr::read_lines("file.csv", n_max = 10)

Inspect parsing problems after a readr import:
```
problems()
```

XML / HTML tables / Web scraping (static HTML)

Quick workspace prep

Base R vs modern tidyverse / high‑performance packages

CSV / delimited text

Recommended

Examples

Useful arguments

Excel (XLS / XLSX)

Recommended

Example

JSON

Recommended

XML / HTML tables / Web scraping (static HTML)

Recommended

Excel‑like & legacy formats (ODS, Google Sheets)

Google Sheets

ODS

SAS / SPSS / Stata

Recommended

MATLAB / Octave

Parquet, Feather, Arrow — modern columnar formats

ODBC / Relational DBs (SQL Server, Postgres, MySQL, etc.)

Recommended approach

Example: PostgreSQL via DBI

ODBC (Windows / DSN)

Cloud storage (S3, GCS, Azure)

Example with `aws.s3`

Reading in chunks / streaming large files

Example with LaF

Parsing dates, times, and types

Encoding & locale issues

Common import pitfalls and troubleshooting

Quick hacks & productivity tips

Related posts

We found our site was slow in Singapore but perfect in Europe — here's why

I put a Game Boy inside ChatGPT (ChatGPT Apps)

Advent of AI - Day 13: Goose Terminal Integration

A Day in the Life of a Marketing Manager Using Microsoft Planner

Quick workspace prep

Base R vs modern tidyverse / high‑performance packages

CSV / delimited text

Recommended

Examples

Useful arguments

Excel (XLS / XLSX)

Recommended

Example

JSON

Recommended

XML / HTML tables / Web scraping (static HTML)

Recommended

Excel‑like & legacy formats (ODS, Google Sheets)

Google Sheets

ODS

SAS / SPSS / Stata

Recommended

MATLAB / Octave

Parquet, Feather, Arrow — modern columnar formats

ODBC / Relational DBs (SQL Server, Postgres, MySQL, etc.)

Recommended approach

Example: PostgreSQL via DBI

ODBC (Windows / DSN)

Cloud storage (S3, GCS, Azure)

Example with aws.s3

Reading in chunks / streaming large files

Example with LaF

Parsing dates, times, and types

Encoding & locale issues

Common import pitfalls and troubleshooting

Quick hacks & productivity tips

Related posts

We found our site was slow in Singapore but perfect in Europe — here's why

I put a Game Boy inside ChatGPT (ChatGPT Apps)

Advent of AI - Day 13: Goose Terminal Integration

A Day in the Life of a Marketing Manager Using Microsoft Planner

Example with `aws.s3`