What Actually Determines a File's Type
Source: Dev.to
Overview
Every file format has a specification—an agreed‑upon structure that defines how the bytes in that file are organized. Just like we have standards for internet protocols, we have standards for file types. When an application opens a PDF or parses a PNG, it reads bytes according to that format’s predefined rules.
We usually identify files by their extension (.zip, .txt, .jpg). Extensions are merely hints for humans and the operating system; they are not the source of truth. Renaming photo.jpg to photo.png does not convert the image.
The real determination of a file’s type is done through magic numbers.
Magic Numbers
A magic number is a sequence of bytes, located at the beginning or at specific offsets of a file, that serves as a unique signature to identify the file format or type.
Each file format has an agreed‑upon magic number, which applications check when they need to determine the file format regardless of the file’s extension.
- PNG:
89 50 4E 47 - ZIP:
50 4B 03 04 - BMP:
42 4D(the ASCII characters “BM”)
Below is a Go example that reads a file with a .bmp extension and verifies the magic number to confirm whether it is actually a bitmap image:
package main
import (
"bytes"
"fmt"
"io"
"log"
"os"
)
func main() {
f, err := os.Open("sample.bmp")
if err != nil {
log.Fatal(err)
}
defer f.Close()
// Read the first 2 bytes (the magic number)
header := make([]byte, 2)
if _, err := io.ReadFull(f, header); err != nil {
log.Fatal(err)
}
// BMP signature is 0x42, 0x4D
bmpSig := []byte{0x42, 0x4D}
if bytes.Equal(header, bmpSig) {
fmt.Println("Valid BMP detected")
} else {
fmt.Println("Invalid file format")
}
}
The Unix file command uses these signatures to identify files regardless of their extensions. After the signature, most formats include metadata describing the content (dimensions for images, sample rate for audio, author information for documents, etc.).
Categories of File Structures
File formats generally fall into a few structural categories:
- Binary formats with rigid structure (PNG, JPEG, MP3): every byte position has a specific meaning according to the spec. Programs parse these by reading exact offsets.
- Text‑based structured formats (JSON, XML, HTML, CSV): human‑readable text following grammar rules. Easier to debug but result in larger file sizes.
- Container formats (ZIP, MP4, PDF): act like filesystems within a file, containing multiple embedded files or streams. An MP4 might contain separate video, audio, and subtitle tracks. A DOCX is actually a ZIP containing XML files.
Understanding how to navigate bytes and consulting specifications allows you to write your own parsers for many file types. In a future article, I will demonstrate a simple script that converts images to greyscale by directly modifying the image bytes.