How to make LLMs work on large amounts of data

Published: 3 days ago (January 16, 2026 at 08:16 PM EST)

4 min read

Source: Dev.to

Text‑to‑SQL vs. LLM‑Based Approaches

Text‑to‑SQL tools have long dominated the market for applying intelligence over massive datasets.
With the rise of large language models (LLMs), the landscape has shifted to include a variety of new techniques, such as:

Retrieval‑Augmented Generation (RAG)
Coding/SQL agents
…and other hybrid solutions

The Core Challenge

LLMs cannot directly see the raw data. Instead, they receive only an abstracted view, for example:

Summaries
Sample rows
Schema descriptions
Partial slices generated by another system

When you need to process large numbers of rows, feeding them all to an LLM becomes impractical.

How to Tackle This with Datatune

Datatune provides a scalable way to bridge the gap between massive data tables and LLMs:

Chunking & Sampling – Break the dataset into manageable pieces or select representative samples.
Schema‑aware Prompts – Include concise schema information so the LLM understands column meanings.
Iterative Retrieval – Use RAG‑style loops to fetch additional rows only when the model requests more context.
Result Aggregation – Combine the LLM’s partial outputs into a final, coherent answer or SQL query.

Quick Start

# Install Datatune
pip install datatune

# Example: Generate a query for a large table
datatune generate \
    --table my_large_table.csv \
    --prompt "Find the top 5 customers by total purchase amount" \
    --max-chunk-size 5000

The command above:

Loads the table in chunks of 5 000 rows.
Sends schema + sampled data to the LLM.
Returns a ready‑to‑run SQL statement (or Python code) that respects the full dataset.

🎵 Datatune

Scalable data transformations with row‑level intelligence.

Datatune isn’t just another Text‑to‑SQL tool. With Datatune, LLMs and agents get full, programmatic access to your data and can apply semantic intelligence to every record.

How It Works

Installation

pip install datatune

Quick Start (Python API)

import datatune as dt
from datatune.llm.llm import OpenAI
import dask.dataframe as dd

llm = OpenAI(model_name="gpt-3.5-turbo")
df = dd.read_csv("products.csv")

# Extract categories using natural language
mapped = dt.map(
    prompt="Extract categories from the description and name of product.",
    output_fields=["Category", "Subcategory"],
    input_fields=["Description", "Name"]
)(llm, df)

# Filter with simple criteria
filtered = dt.filter(
    prompt="Keep only electronics products",
    input_fields=["Name"]
)(llm, mapped)

The Context Length Problem

LLMs are becoming larger and larger in terms of context‑length capabilities. Even with an optimistic 100 M‑token window, a typical enterprise dataset quickly outgrows what can be processed in a single request.

Example: a mid‑sized enterprise

Item	Quantity
Rows in a transactional table	10 000 000
Columns per row	20
Average characters per column	50

10 000 000 rows × 20 columns × 50 characters
= 10 000 000 000 characters
≈ 2.5 billion tokens   (≈ 4 characters per token)

A 100 M‑token context window can only hold 1/25 of that data.

Solving Large‑Scale Data Processing with Datatune

Datatune gives LLMs full access to massive datasets by processing rows in batches:

Each row is transformed using a natural‑language prompt.
Rows are grouped into batches and sent to the LLM.
Dask’s parallel execution splits the data into partitions, allowing many batches to be processed simultaneously.

Datatune Batch Processing

Understanding Data‑Transformation Operations

Datatune provides four first‑order primitives (also called primitives):

Primitive	Description
MAP	Transform each row to new fields.
FILTER	Keep rows that satisfy a condition.
EXPAND	Add new rows derived from existing ones.
REDUCE	Aggregate rows into summary statistics.

All primitives can be driven by natural‑language prompts.

MAP Example

mapped = dt.map(
    prompt="Extract categories from the description and name of the product.",
    output_fields=["Category", "Subcategory"],
    input_fields=["Description", "Name"]
)(llm, df)

Chaining MAP and FILTER

# 1️⃣ Extract sentiment and topics from each review (MAP)
mapped = dt.map(
    prompt="Classify the sentiment and extract key topics from the review text.",
    input_fields=["review_text"],
    output_fields=["sentiment", "topics"]
)(llm, df)

# 2️⃣ Keep only negative reviews (FILTER)
filtered = dt.filter(
    prompt="Keep only rows where sentiment is negative."
)(llm, mapped)

Datatune Agents

Agents let users describe what they want in plain language; the agent decides how to chain primitives (MAP, FILTER, etc.) and can even generate Python code when row‑level intelligence isn’t required.

Datatune Agents

Agent Example

df = agent.do(
    """
    From product name and description, extract Category and Subcategory.
    Then keep only products that belong to the Electronics category
    and have a price greater than 100.
    """,
    df
)

The agent automatically:

Maps product name/description → Category, Subcategory.
Filters rows where Category == "Electronics" and price > 100.

Data Sources

Datatune works with many data back‑ends:

DataFrames – Pandas, Dask, Polars, etc.
Databases – via Ibis integration (DuckDB, PostgreSQL, MySQL, …).

Contributing

Datatune is open‑source and we welcome contributions!

🔗 Repository: https://github.com/vitalops/datatune

How to make LLMs work on large amounts of data

Text‑to‑SQL vs. LLM‑Based Approaches

The Core Challenge

How to Tackle This with Datatune

Quick Start

🎵 Datatune

How It Works

Installation

Quick Start (Python API)

The Context Length Problem

Solving Large‑Scale Data Processing with Datatune

Understanding Data‑Transformation Operations

MAP Example

Chaining MAP and FILTER

Datatune Agents

Agent Example

Data Sources

Contributing

Related posts

Rapg: TUI-based Secret Manager

Quick Data Recovery using Snapshots - Amazon FSx for NetApp ONTAP

Technology is an Enabler, not a Saviour

Industry Survey: Faster Coding, Slower Debugging