big data

1 day ago · software

Command-line Tools can be 235x Faster than your Hadoop Cluster (2014)

Article URL: https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html Comments URL: https://news.ycombinator.com/item?id=466660...

#command-line #Hadoop #performance #big-data #benchmarking
1 week ago · ai

The Death of the Loop: Why Senior Data Scientists Think in Vectors

In traditional software development, iteration is king. We are taught to think sequentially: take an item, process it, store the result, and move to the next. H...

#data science #vectors #machine learning #big data #linear algebra #Python
3 weeks ago · software

Day 24: Spark Structured Streaming

!Cover image for Day 24: Spark Structured Streaminghttps://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fra...

#apache spark #structured streaming #real-time data pipelines #big data #stream processing #spark master series
0 month ago · software

WTF is Distributed Time-Series Databases?

What is a Distributed Time‑Series Database? A Distributed Time‑Series Database TSDB is a database designed to handle large volumes of data that are associated...

#time-series #distributed-database #scalability #data-storage #monitoring #big-data
1 month ago · software

Day 16: Delta Lake Explained - How Spark Finally Became Reliable for Production ETL

Welcome to Day 16 of the Spark Mastery Series If you remember only one thing today, remember this: Delta Lake = ACID transactions for your Data Lake Why Tradit...

#delta lake #apache spark #etl #data lake #acidity #time travel #big data
1 month ago · software

The Myth of Distributed Computing as a Silver Bullet for Big Data

Introduction Is distributed technology the panacea for big‑data processing? Using a distributed cluster to process big data is mainstream today. Splitting a la...

#distributed computing #big data #cluster architecture #scalability #performance optimization #data processing
1 month ago · software

Day 10: Partitioning vs Bucketing - The Spark Optimization Guide Every Data Engineer Needs

Why Partitioning Matters in Spark Example python df.write.partitionBy'year', 'month'.parquet'/sales' This creates folders such as: year=2024/month=01/ Benefits...

#spark #partitioning #bucketing #data-engineering #big-data #optimization #parquet #lakehouse
1 month ago · software

Data warehouse without using SQL

Currently the vast majority of data warehouses employ SQL to process data. After decades of development, SQL has become the standard language in the database wo...

#data warehouse #SQL alternatives #esProc #SPL language #big data #non‑SQL query #Python integration
1 month ago · software

🔥 Day 5: Introduction to DataFrames - The Most Importantce of Spark API

What is a DataFrame? A DataFrame in Spark is a distributed, column‑based, optimized table‑like structure used for efficient data processing. - Feels like SQL -...

#Apache Spark #DataFrames #big data #ETL #data engineering #Python
1 month ago · software

🔥 Day 3: RDDs - The Foundation of Spark

!Cover image for 🔥 Day 3: RDDs - The Foundation of Sparkhttps://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2...

#apache spark #rdd #big data #distributed computing #data engineering #scala #dataframes
1 month ago · software

WTF is Distributed Data Warehousing?

What is Distributed Data Warehousing? A data warehouse is a centralized repository where an organization stores, organizes, and makes data readily available fo...

#distributed data warehousing #data warehouse #big data #data analytics #cloud data storage #data engineering