· 8 min read

Bash Shell Commands for Data Science: An Essential Toolkit

A practical, example-driven guide to the Bash and Unix command-line tools that every data scientist should know for fast, repeatable dataset inspection, cleaning, transformation and merging - including tips for handling large files and messy real-world data.

Why the shell matters for data scientists

As data volumes grow, doing quick, repeatable data inspection and cleaning at the command line becomes invaluable. Bash and Unix tools let you: stream large files without loading them into memory, compose simple transformations into reliable pipelines, and automate repetitive tasks.

This guide collects practical commands and patterns tailored to data scientists (CSV/JSON/YAML workflows, encoding fixes, sampling, deduplication, joins and more). Whenever possible, prefer tools that understand structured formats (CSV-aware or JSON-aware) over blind text processing.

Useful references:


Setup & portability notes

  • Many commands are available out-of-the-box on Linux. On macOS, install GNU variants (coreutils, gawk, gnu-sed) via Homebrew for consistent behavior.
  • Be careful with sed -i: GNU sed uses -i directly, BSD/macOS sed requires -i ''.
  • Always work on a copy or use version control for datasets: cp data.csv data.csv.bak.

Quick file inspection (the first steps)

Commands to get a feel for a dataset:

# Show first/last lines (header + samples)
head -n 5 data.csv
tail -n 5 data.csv

# Count rows (fast)
wc -l data.csv  # includes header if present

# Show file size and type
ls -lh data.csv
file data.csv

# Show column-separated preview (treat as CSV/TSV visually)
column -s, -t < <(head -n 10 data.csv) | less -#2 -S

If the file is huge, use pv to see streaming throughput: pv data.csv | head -n 20 (pv docs).


Sampling and shuffling

Random sampling is useful for exploratory analysis without loading full data.

# Simple random sample of 100 lines (preserves header)
( head -n 1 data.csv && tail -n +2 data.csv | shuf -n 100 ) > sample.csv

# For reproducible sampling (GNU shuf --random-source)
( head -n 1 data.csv && tail -n +2 data.csv | shuf --random-source=<(yes 42) -n 100 ) > sample.csv

# Approximate sampling using awk (probability p)
# Sample ~1% of rows (excluding header)
awk 'NR==1{print; next} BEGIN{srand(123)} rand()<0.01{print}' data.csv > sample_1pct.csv

Note: shuf requires reading the whole input into memory (not ideal for extremely large files). For large files, use the streaming awk approach.


Column selection and CSV-aware operations

Avoid parsing CSV with naive text tools if fields contain commas or quotes. Use CSV-aware tools:

  • csvkit: csvcut, csvgrep, csvstat, csvsql - good for SQL-like queries on CSV.
  • Miller (mlr): fast, flexible, great for TSV/CSV (column operations, joins, stats).

Examples with csvkit:

# Show column names
csvcut -n data.csv

# Select columns by name
csvcut -c id,name,age data.csv > subset.csv

# Filter rows where age > 30 (csvsql approach)
csvsql --query "SELECT * FROM data WHERE CAST(age AS INTEGER) > 30" data.csv > older_than_30.csv

Examples with Miller (mlr):

# Convert CSV to pretty table, select columns, and sort
mlr --csv cut -f id,name,age then sort -r age data.csv | head -n 20

# Filter rows (age > 30)
mlr --csv filter '$age > 30' data.csv > older_than_30.csv

# Join two CSV files on column 'id' (streaming, memory-friendly)
mlr --csv join -j id -l id -r id left.csv right.csv > joined.csv

Links: csvkit docs, Miller docs.


Text processing: grep, awk, sed, cut, tr

When data is plain (TSV or well-behaved CSV), classic tools shine.

  • grep/ripgrep for fast searching (use ripgrep for speed).
  • cut to extract simple column-based fields (works best with simple separators).
  • tr for simple character translations (e.g., line endings).
  • sed for substitutions and quick in-place edits.
  • awk (gawk) for column-aware streaming transforms and aggregations.

Examples:

# Count unique values in column 3 (tab-separated example)
cut -f3 data.tsv | sort | uniq -c | sort -nr | head -n 20

# Replace all occurrences of 'NA' with empty string in field 5 (tab-separated)
awk -F"\t" 'BEGIN{OFS="\t"} {if($5=="NA") $5=""; print}' data.tsv > cleaned.tsv

# Remove carriage returns (Windows CRLF -> LF)
tr -d '\r' < windows_file.csv > unix_file.csv

# Remove UTF-8 BOM if present
sed '1s/^\xEF\xBB\xBF//' < file.csv > no_bom.csv

# Inline sed substitution (make backup first)
cp file.csv file.csv.bak
sed -i 's/old_text/new_text/g' file.csv

Awk example for simple aggregation (group-by count):

# Count rows per category (CSV with comma separator, no quoted commas)
awk -F',' 'NR>1{counts[$3]++} END{for(k in counts) print counts[k], k}' data.csv | sort -nr

Handling encodings & line endings

Real-world data can have mixed encodings or Windows line endings.

# Detect encoding
file -i data.csv

# Convert to UTF-8 (iconv)
iconv -f ISO-8859-1 -t UTF-8 data.csv -o data_utf8.csv

# Convert DOS to Unix line endings
dos2unix data.csv

De-duplication and sorting

# Sort by key (column 1) and remove consecutive duplicates
sort -t, -k1,1 data.csv | uniq > deduped.csv

# Count duplicate keys (number of occurrences of key in column 1)
cut -d, -f1 data.csv | sort | uniq -c | sort -nr

# For big files: use external sort that can handle large data (GNU sort uses temp files)
sort --parallel=4 --buffer-size=2G -t, -k1,1 bigfile.csv > big_sorted.csv

If deduplication needs to consider multiple columns, use awk to build composite keys.


Joining datasets

For structured joins use csv-aware tools (mlr, csvjoin from csvkit) rather than join unless you pre-sort properly.

# Using csvkit (csvjoin)
csvjoin -c id left.csv right.csv > joined.csv

# Using Miller for large datasets (streaming join)
mlr --csv join -j id -l id -r id left.csv right.csv > joined.csv

If you use the traditional join command, the files must be sorted on the join key first:

sort -t, -k1,1 left.csv > left_sorted
sort -t, -k1,1 right.csv > right_sorted
join -t, -1 1 -2 1 left_sorted right_sorted > joined.csv

Working with JSON & YAML

For JSON, use jq - a powerful JSON query and transformation tool:

# Pretty print
jq . data.json | less -R

# Extract a field from an array of objects
jq '.[] | {id: .id, name: .name}' data.json

# Convert CSV to JSON (using Miller)
mlr --csv --jlistwrap cat data.csv > data.json

For YAML, convert to JSON with yq or python -c snippets, or use specialized yq implementations.


Compression and splitting large files

Store and stream compressed datasets:

# Compress with gzip (fast) or zstd (better compression & speed)
gzip -k data.csv      # produces data.csv.gz
zstd -q data.csv      # produces data.csv.zst

# Stream a gzipped file without fully extracting
zcat data.csv.gz | head -n 20
# or
pv data.csv.gz | gunzip -c | head -n 20

# Split a huge CSV into 1M-line chunks, keeping header
( head -n 1 big.csv > header && tail -n +2 big.csv | split -l 1000000 - chunk_ )
for f in chunk_*; do cat header $f > ${f}.csv; rm $f; done

Split with split or csplit depending on needs; ensure header preservation.


Parallel processing

Use xargs -P or GNU parallel to parallelize independent tasks (e.g., per-file transformations):

# Run a transform on many CSVs in parallel using GNU parallel
ls data-*.csv | parallel 'mlr --csv filter "$age > 30" {} > {.}_older.csv'

# Parallel grep across files
rg "pattern" -n --glob 'data-*.csv' | awk -F: '{print $1}' | sort -u | parallel grep -n "pattern" {}

Be mindful of I/O throughput: parallelization helps CPU-bound tasks, but if disk is the bottleneck, too many jobs will slow everything.


Practical cleaning workflow (example)

Scenario: You have a large CSV with messy encoding, a BOM, duplicated rows and inconsistent NA tokens. You want a clean, compressed file with only columns id,name,age where age is numeric and NA converted to empty.

# 1) Make a backup
cp big.csv big.csv.bak

# 2) Remove BOM and convert encoding to UTF-8
iconv -f ISO-8859-1 -t UTF-8 big.csv.bak | sed '1s/^\xEF\xBB\xBF//' > big.utf8.csv

# 3) Normalize line endings
tr -d '\r' < big.utf8.csv > big.norm.csv

# 4) Select required columns with Miller, handle NA and ensure numeric age
mlr --csv cut -f id,name,age then put 'if($age=="NA"||$age=="na"){$age=""} else {$age=int($age)}' big.norm.csv > big.cleaned.csv

# 5) Remove duplicate rows based on composite key (id + name)
mlr --csv uniq -a -f id,name big.cleaned.csv > big.unique.csv

# 6) Compress final file
zstd -q big.unique.csv -o big.cleaned.csv.zst

This pipeline uses streaming tools, preserves structure and is repeatable.


Debugging tips & gotchas

  • Quoted separators in CSV: naive cut/awk -F, will break on embedded commas. Prefer csvkit or Miller.
  • sed -i portability between GNU and BSD/macOS: test before running on critical data.
  • Always keep a backup: transformations are easy to script but can accidentally clobber data.
  • For reproducibility put your commands in scripts with a small README and, if appropriate, store intermediate checksums (md5sum/sha256sum) to detect unnoticed changes.
  • For very large datasets, watch memory usage: prefer streaming tools and sort with appropriate buffer size and temp directory.

A compact cheat-sheet

  • View: head, tail, less, pv
  • Count: wc -l, awk ‘END{print NR}’
  • Select columns: cut (simple), csvcut / mlr (CSV aware)
  • Filter rows: grep / rg, awk, mlr, csvsql
  • Transform fields: sed, awk, mlr
  • Deduplicate: sort | uniq, mlr uniq
  • Join: mlr join, csvjoin (csvkit), join (requires sorted files)
  • JSON: jq
  • Encoding: iconv, dos2unix
  • Split/compress: split, gzip/zstd, tar
  • Parallelize: xargs -P, parallel

  • csvkit
  • Miller (mlr)
  • jq
  • ripgrep (rg)
  • GNU parallel
  • pv
  • zstd (for modern compression)

Links: csvkit, Miller, jq.


Command-line skills accelerate exploratory work and operational data cleaning. Compose small, well-documented pipelines and prefer structured-format-aware tools when possible to avoid subtle parsing bugs. Use backups, version control and checksums for safety when modifying important datasets.

Back to Blog

Related Posts

View All Posts »