What is NumPy? The Backbone of Data Science in Python

Published: (June 11, 2026 at 09:02 AM EDT)
3 min read
Source: Dev.to

Source: Dev.to

If you want to do data science, machine learning, or AI in Python — you will use NumPy constantly. Every major library from pandas to TensorFlow is built on top of it. Here is what it is and why it matters. NumPy stands for Numerical Python. It is a library that gives Python the ability to work with large arrays of numbers extremely fast. Python lists can store numbers. But they are slow for mathematical operations. NumPy arrays do the same thing 10 to 100 times faster. This matters when you are working with thousands of rows of data or training a machine learning model on millions of examples. import numpy as np

sales = np.array([45000, 52000, 38000, 61000, 55000])

print(np.mean(sales)) # average print(np.sum(sales)) # total print(np.max(sales)) # highest print(np.std(sales)) # how spread out the numbers are

Five lines. You get the average, total, maximum, and standard deviation of any list of numbers instantly. No loops. No manual calculation. When you multiply a Python list by 2, Python loops through each item one by one. When you multiply a NumPy array by 2, it does all items simultaneously using optimised C code under the hood. import numpy as np import time

data = list(range(1_000_000)) np_data = np.array(data)

Python loop

start = time.time() result = [x * 2 for x in data] print(f”Python loop: {time.time() - start:.4f} seconds”)

NumPy

start = time.time() result = np_data * 2 print(f”NumPy: {time.time() - start:.4f} seconds”)

On my machine NumPy is 50 times faster for this operation. At a million items. That gap only grows as data gets bigger. sales = np.array([45000, 52000, 38000, 61000, 55000])

Get only months above 50,000

high_months = sales[sales > 50000] print(high_months) # [52000 61000 55000]

avg = np.mean(sales) performance = np.where(sales >= avg, “Good”, “Below average”) print(performance)

[‘Good’ ‘Good’ ‘Below average’ ‘Good’ ‘Good’]

6 months of data: revenue, orders, avg order value

monthly = np.array([ [45000, 120, 375], [52000, 138, 377], [38000, 102, 373], ])

print(monthly[:, 0].sum()) # total revenue across all months print(monthly[:, 1].mean()) # average orders per month

The [:, 0] means “all rows, column 0”. This is how you slice 2D data without writing nested loops. pandas DataFrames are built on NumPy arrays. When you call df[‘Revenue’].mean() in pandas, pandas calls NumPy internally. When you train a machine learning model in scikit-learn, it converts your data into NumPy arrays before processing. Understanding NumPy means you understand what is happening under the hood in every data science tool you will ever use. pip install numpy

Then open Google Colab and try this right now: import numpy as np

data = np.array([10, 20, 30, 40, 50]) print(“Mean:”, np.mean(data)) print(“Doubled:”, data * 2) print(“Above 25:”, data[data > 25])

Three lines and you have filtered, transformed, and analysed data without writing a single loop. NumPy makes Python fast enough for real data science — it is the foundation everything else is built on. Written by Raaga Priya Madhan — CSE student, Bangalore. I write about Python, data science, and ML. See my code on GitHub and connect on LinkedIn

0 views
Back to Blog

Related posts

Read more »

The spec is in the wrong place

My day job is at a large tech company. Hundreds of engineering teams, and every one of them is somewhere different on AI adoption. Some are still treating codin...

The Heuristics Say Don't

A culture that only records its disasters ends up with a biased archive. Wars documented, plagues chronicled, collapses catalogued. The quiet decades go unwritt...