Why Big Data Platforms Return to SQL?

Published: (December 12, 2025 at 02:37 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

Background

Structured data is the staple of big data analytics

Big data platforms focus on addressing the demand for big data storage and analytics. Among the vast amount of data to be stored, there is, besides the structured data resulting from business activities, a huge amount of unstructured data—audio, video, etc.—that can occupy over 80 % of all data on a platform. Storing data is only one part of a big data platform’s goal; the more important part is to analyze and use data to generate business value.

Big data analytics covers two types of data: structured and unstructured.

  • Structured data is generated during daily business operations and constitutes the core part of an enterprise. Before big data platforms emerged, structured data made up the majority—or even the entirety—of an enterprise’s information base. As business expands, data accumulates and stresses conventional RDB‑based processing systems. Big data platforms rose to the challenge, supplying solutions for structured‑data analysis problems.
  • Unstructured data (logs, images, audio, video) is often processed to extract relevant structured attributes—e.g., producer, date, category, duration for videos, or user IPs, access times, keywords for logs. Thus, unstructured‑data analytics mainly focuses on the by‑product structured data.

Consequently, structured‑data analytics still dominates business analytics on big data platforms. Mature technologies for processing structured data—chiefly SQL‑driven, relational‑model RDBMSs—are widely used.

SQL dominates the field of structured‑data processing

Returning to SQL syntax is a clear trend in big data analytics. In Hadoop, early PIG Latin was abandoned while Hive persisted; on Spark, SparkSQL is used far more widely than Scala‑based APIs. New big‑data computing systems also prefer SQL for computations. The ancient language reaffirms its position after years of rivalry with various challengers.

Two reasons drive this renewal:

  1. A better choice is not yet available
    Relational databases have been popular for so long that SQL has become an “old friend” of programmers. While SQL is simple for regular queries, it is less convenient for complex procedural or order‑based computations. Alternative techniques are often equally cumbersome, requiring complex UDFs or custom code. Faced with two nuisances, many choose the familiar.

  2. SQL has the endorsement of big‑data vendors
    Vendors seek higher performance, and SQL provides a well‑known benchmark (e.g., TPC‑H) that is easy to understand and assess. This encourages vendors to focus optimization efforts on SQL.

A SQL‑compatible platform is more migration‑friendly

The benefits of a SQL‑compatible platform are clear:

  • SQL is widely known, reducing training costs.
  • Numerous front‑end tools already support SQL, making integration straightforward.
  • Compatibility with existing SQL‑based databases eases migration and lowers costs.

However, these benefits come with the price of enduring SQL’s shortcomings.

Troubles

Low performance

The biggest problem of using SQL is that it is not the most effective means of achieving the high performance required for big‑data computing.

  • SQL lacks data types and definitions for certain high‑performance calculations, relying heavily on engineering optimizations of the execution engine.
  • While decades of experience have produced rich optimization techniques for commercial databases, many big‑data scenarios remain hard to optimize.
  • SQL is not well‑suited for expressing procedural computations or specifying optimal execution paths, which are crucial for high‑performance algorithms. Achieving optimal paths often requires many special modifiers; procedural syntax can be more straightforward.

Historically, SQL was designed for limited hardware. Its design constraints make it difficult to fully exploit modern hardware features such as large memory, massive parallelism, and cluster mechanisms. Examples of performance‑related limitations include:

  • JOIN operations – Traditional SQL JOIN matches records by key values, often using hash calculations. In a big‑memory environment, address‑based matching could be far faster.
  • Unordered tables – While single‑table computations can be parallelized by partitioning, dynamic division across multiple tables (e.g., multi‑table joins) is difficult, requiring static segmentation that hampers flexible thread allocation.
  • Cluster computing – SQL does not distinguish dimension from fact tables and defines JOIN as a filtered Cartesian product. Large‑table joins trigger costly hash shuffles, consuming network bandwidth and diminishing the benefits of cluster scaling.

Example

To retrieve the top 10 rows from a table with one billion records:

SELECT TOP 10 x, y
FROM T
ORDER BY x DESC;

The ORDER BY clause forces a full‑table sort, which is extremely slow at this scale. An algorithm that avoids full sorting could be devised, but SQL cannot directly express it; it must rely on the database optimizer to find an efficient plan.

Back to Blog

Related posts

Read more »