Linux Fundamentals for Data Engineering
Source: Dev.to
Introduction
As a data engineer, most of your work will happen In this article, I will walk you through the Most data engineers start their journey on Windows. To install WSL, open Windows CMD as Administrator wsl —install -d Ubuntu-22.04
After installation, restart your PC. You can now wsl in CMD. One important lesson I learned during setup: WSL -sh instead of bash, you are running a minimal SSH (Secure Shell) is the standard way to connect The basic SSH command syntax is: ssh username@server_ip -p port_number
For example, to connect to our assignment server: ssh root@159.65.222.96 -p 22
Port 22 is the default SSH port. When connecting Are you sure you want to continue connecting? Always type yes and press Enter. One important thing to understand about your
at the end means you are root (full admin)
$ at the end means you are a normal user Always run whoami to confirm which user you are On a shared server, every person should have their To create a new user: adduser briank
An important lesson I learned: Linux usernames must BrianK, I got this error: Please enter a username matching the regular The fix was simple use lowercase: adduser briank
To give the user sudo (admin) privileges: usermod -aG sudo briank
To verify the user was created successfully: id briank
Output: uid=1088(briank) gid=1088(briank) Here are the most important Linux commands every pwd # print current directory ls # list files ls -la # detailed list including hidden files cd Documents # go into a folder cd .. # go up one level cd ~ # go to home directory
touch notes.txt # create empty file mkdir linux_assignment # create folder cp notes.txt backup.txt # copy file mv backup.txt old.txt # rename/move file rm old.txt # delete file cat notes.txt # view file contents head -10 notes.txt # view first 10 lines tail -10 notes.txt # view last 10 lines grep “error” log.txt # search inside file
whoami # current username uname -a # system and kernel info hostname # server name uptime # how long server has been running df -h # disk space usage free -h # memory usage top # running processes (q to quit) ps aux # list all processes
Linux file permissions control who can read, -rwxr-xr-x Breaking this down: r = read (4) w = write (2) x = execute (1) Three groups: owner, group, others. To change permissions: chmod 755 script.sh # owner: rwx, others: r-x chmod 644 notes.txt # owner: rw-, others: r—
To change file ownership: chown briank notes.txt
ip a # show network interfaces ping google.com -c 4 # test connectivity netstat -tulnp # show open ports ss -tlnp # modern version of netstat curl ifconfig.me # show your public IP
PostgreSQL is the most popular open source database apt update apt install postgresql postgresql-contrib -y
systemctl start postgresql systemctl enable postgresql systemctl status postgresql
su -s /bin/bash postgres psql
CREATE DATABASE briank; \c briank CREATE SCHEMA staging;
CREATE TABLE staging.farmers ( id SERIAL PRIMARY KEY, farmer_name VARCHAR(100), county VARCHAR(50), subcounty VARCHAR(50), acreage DECIMAL(5,2), crop VARCHAR(50), loan_amount DECIMAL(10,2), loan_status VARCHAR(20), season VARCHAR(20) );
INSERT INTO staging.farmers (farmer_name, county, subcounty, acreage, crop, loan_amount, loan_status, season) VALUES (‘John Kipchumba’, ‘Uasin Gishu’, ‘Turbo’, 2.5, ‘Maize’, 15000.00, ‘Paid’, ‘2023A’), (‘Mary Jelimo’, ‘Uasin Gishu’, ‘Soy’, 1.8, ‘Maize’, 12000.00, ‘Defaulted’, ‘2023A’), (‘Peter Rotich’, ‘Uasin Gishu’, ‘Eldoret East’, 3.2, ‘Maize’, 20000.00, ‘Paid’, ‘2023B’);
\l — list all databases \c dbname — connect to database \dt — list all tables \du — list all users \q — quit psql
To allow tools like DBeaver or pgAdmin to connect postgresql.conf change listen_addresses: ’ *pg_hba.conf — add this line at the bottom: systemctl restart postgresql
SCP (Secure Copy Protocol) uses SSH to transfer scp C:\Users\Brian\notes.txt root@159.65.222.96:/root/
scp root@159.65.222.96:/root/notes.txt C:\Users\Brian\Downloads\
scp -r myfolder/ root@159.65.222.96:/root/
scp -i ~/.ssh/mykey.pem file.txt root@server:/path/
During this assignment I encountered several Lesson 1 Always check who you are whoami before every session saved me Lesson 2 Usernames must be lowercase BrianK failed briank worked perfectly. Lesson 3 The prompt tells you everything
means root, $ means normal user.
=# in psql means ready, (# means incomplete Lesson 4 WSL is not always Ubuntu apt, sudo, and ssh taught me cat /etc/os-release. Lesson 5 Shared servers have history grep and tail to verify Linux is the backbone of modern data engineering. The best way to learn Linux is by doing. Set up WSL As I continue my journey in data engineering at Brian Kiplangat - LuxDevHQ Data Engineering