XML / HTML tables / Web scraping (static HTML)
Source: Dev.to
Quick workspace prep
# Check working directory
getwd()
# Set working directory (if you must)
setwd("/path/to/project")
# Start with a clean environment (use with care)
rm(list = ls())
# Show files in the current directory
list.files()
Base R vs modern tidyverse / high‑performance packages
- Base R (
read.table,read.csv,read.delim) – works everywhere, flexible. - readr (
read_csv,read_delim) – faster, consistent parsing, tidyverse‑friendly. - data.table::fread – extremely fast for large CSV‑like files.
- readxl – best for Excel (no external dependencies).
- haven – best for SAS/SPSS/Stata (via ReadStat).
- jsonlite – modern JSON parsing.
- DBI + specific DB backends – standard interface for relational DBs.
- arrow – Feather/Parquet and interop with Python/Spark; excellent for large, columnar data.
CSV / delimited text
Recommended
- Small‑medium files:
readr::read_csv() - Very large files:
data.table::fread()orvroom::vroom()
Examples
# readr – tidy defaults, prints parsing problems
install.packages("readr")
library(readr)
df <- read_csv("data/sales.csv") # comma‑separated
df2 <- read_delim("data/data.txt", "\t") # tab‑separated
# data.table – blazing fast, automatic type detection
install.packages("data.table")
library(data.table)
dt <- fread("data/sales.csv")
# vroom – lazy loading, fast for many files
install.packages("vroom")
library(vroom)
df_v <- vroom("data/sales.csv")
Useful arguments
locale = locale(encoding = "UTF-8")– handle encodings.col_types = cols(...)– force column types (avoid mis‑parsing).na = c("", "NA", "NULL")– specify missing tokens.n_max– read only the first N rows (quick inspect).
Excel (XLS / XLSX)
Recommended
readxl::read_excel()– no Java dependency, reliable.tidyxl,openxlsx– for formulas or writing.
Example
install.packages("readxl")
library(readxl)
# Read the first sheet
df <- read_excel("data/book.xlsx")
# Read a specific sheet by name or index
df_sheet3 <- read_excel("data/book.xlsx", sheet = "Sales")
df_sheet2 <- read_excel("data/book.xlsx", sheet = 2)
# Read a range
df_range <- read_excel("data/book.xlsx", range = "A1:F100")
JSON
Recommended
jsonlite::fromJSON()– robust; converts arrays to data frames where appropriate.
install.packages("jsonlite")
library(jsonlite)
# From a local file
j <- fromJSON("data/doc.json", flatten = TRUE)
# From a URL
j2 <- fromJSON("https://api.example.com/data")
Tip: If JSON contains nested lists, use flatten = TRUE or manually tidyr::unnest() to tidy.
XML / HTML tables / Web scraping (static HTML)
Recommended
xml2+rvestfor HTML scraping / table extraction.XMLfor more advanced XML parsing (butxml2is generally preferred).
install.packages(c("xml2", "rvest"))
library(xml2)
library(rvest)
page <- read_html("https://example.com/page-with-tables")
tables <- html_table(html_nodes(page, "table")) # list of tibbles
Excel‑like & legacy formats (ODS, Google Sheets)
Google Sheets
install.packages("googlesheets4")
library(googlesheets4)
sheet <- read_sheet("https://docs.google.com/spreadsheets/...")
ODS
install.packages("readODS")
library(readODS)
ods_data <- read_ods("data/file.ods")
SAS / SPSS / Stata
Recommended
haven(fast, preserves labelled variables).foreignis older;havenis preferred.
install.packages("haven")
library(haven)
df_sas <- read_sas("data/data.sas7bdat")
df_spss <- read_sav("data/data.sav")
df_stata <- read_dta("data/data.dta")
Tip: haven keeps value labels (labelled class). Convert to factors with labelled::to_factor().
MATLAB / Octave
install.packages("R.matlab")
library(R.matlab)
m <- readMat("data/matrix.mat")
For Octave text data:
install.packages("foreign")
library(foreign)
oct_data <- read.octave("data/file.oct")
Parquet, Feather, Arrow — modern columnar formats
install.packages("arrow")
library(arrow)
# Read Parquet
tbl <- read_parquet("data/data.parquet")
# Write Parquet
write_parquet(df, "out/data.parquet")
Benefit: Arrow enables out‑of‑memory, zero‑copy reads and is ideal for large‑scale pipelines.
ODBC / Relational DBs (SQL Server, Postgres, MySQL, etc.)
Recommended approach
DBI+ driver package (RPostgres,RMySQL,odbc) – standard interface.poolfor connection pooling in applications.
Example: PostgreSQL via DBI
install.packages(c("DBI", "RPostgres"))
library(DBI)
con <- dbConnect(RPostgres::Postgres(),
dbname = "mydb",
host = "db.example.com",
port = 5432,
user = "me",
password = "pw")
# Run a query
df <- dbGetQuery(con, "SELECT * FROM schema.table LIMIT 1000")
# Fetch a big table in chunks
res <- dbSendQuery(con, "SELECT * FROM big_table")
chunk <- dbFetch(res, n = 10000)
# Clean up
dbClearResult(res)
dbDisconnect(con)
ODBC (Windows / DSN)
install.packages("odbc")
library(DBI)
con <- dbConnect(odbc::odbc(), DSN = "my_dsn", UID = "user", PWD = "pw")
dbListTables(con)
dbDisconnect(con)
Tip: Avoid SELECT * on huge tables—select only needed columns and filter on the SQL side.
Cloud storage (S3, GCS, Azure)
- S3:
aws.s3; Arrow can read directly from S3 URIs. - Google Cloud Storage:
gcsfs(less common) or Arrow/GCS integration. - Azure:
AzureStor.
Example with aws.s3
install.packages("aws.s3")
library(aws.s3)
# Configure AWS credentials (env vars or ~/.aws/credentials)
s3read_using(FUN = read.csv, object = "s3://my-bucket/data.csv")
Arrow can also read Parquet from S3:
arrow::read_parquet("s3://bucket/path/file.parquet")
Reading in chunks / streaming large files
- Use database imports and query subsets.
data.table::fread(select = ...)to read only specific columns.vroom+dplyrfor delayed reads.- For complex chunking:
readr::read_lines_chunked()or packages LaF and iotools.
Example with LaF
install.packages("LaF")
library(LaF)
laf <- laf_open_csv("big.csv",
column_types = c("integer", "double", "string"),
nrows = 1e8)
block <- laf[1:100000, ] # first 100k rows
Parsing dates, times, and types
Proper parsing avoids downstream surprises.
install.packages("lubridate")
library(lubridate)
df$date <- ymd(df$date_string) # "2023-07-01"
df$datetime <- ymd_hms(df$ts) # "2023-07-01 08:45:00"
Use col_types in readr or colClasses in fread to enforce column types.
Encoding & locale issues
If you see garbled characters:
# readr
df <- read_csv("file.csv", locale = locale(encoding = "latin1"))
# base read.table
df <- read.table("file.txt", fileEncoding = "UTF-8")
Check with Encoding() and convert using iconv() if needed.
Common import pitfalls and troubleshooting
- Headers misaligned / comments: use
skip = n,comment = "#", orskip_empty_rows = TRUE. - Inconsistent column types: pre‑specify column types (
col_typesorcolClasses). - NA tokens: set
na = c("", "NA", "n/a"). - Thousands separators / decimal marks:
locale = locale(decimal_mark = ",", grouping_mark = "."). - Memory errors: filter on the SQL side, load a subset, or use Arrow/Parquet to read columns only.
Quick hacks & productivity tips
-
Clipboard import (Windows/Mac):
df <- read.table("clipboard", header = TRUE, sep = "\t") -
Peek first rows before full import:
readr::read_lines("file.csv", n_max = 10) -
Inspect parsing problems after a readr import:
problems()