🐧 Linux for Data Engineers: `grep`, `sed`, and `awk`

Posted on Mar 28, 2025

🔎 grep - Searching in Text Files

grep (Global Regular Expression Print) is used to search for patterns in files.

📌 Basic Usage

# Find lines containing 'error' in log.txt
grep 'error' log.txt

🎯 Common Options

OptionDescription
-iIgnore case
-vInvert match (show lines NOT matching)
-cCount occurrences
-nShow line numbers
-rRecursive search
--color=autoHighlight matches

🏆 Examples

# Find all occurrences of 'warning' (case-insensitive) in logs
grep -i 'warning' server.log

# Show lines NOT containing 'failed'
grep -v 'failed' report.txt

# Count occurrences of 'success'
grep -c 'success' results.csv

✂️ sed - Stream Editor for Modifying Text

sed (Stream Editor) is used to find and replace text, delete lines, or modify files.

📌 Basic Usage

# Replace 'foo' with 'bar' in a file
sed 's/foo/bar/g' file.txt

🎯 Common Options

OptionDescription
-iEdit file in place
sSubstitute text
gReplace all occurrences
dDelete lines
pPrint lines

🏆 Examples

# Replace all instances of '2023' with '2024' in data.csv (modify in place)
sed -i 's/2023/2024/g' data.csv

# Delete lines containing 'error'
sed '/error/d' logs.txt

# Print lines 1 to 5
sed -n '1,5p' file.txt

📊 awk - Pattern Scanning and Processing Language

awk is used for text manipulation, filtering, and reporting.

📌 Basic Usage

# Print the first column from a CSV file
awk -F, '{print $1}' data.csv

🎯 Common Options

OptionDescription
-FSet field delimiter
$1, $2Refer to specific columns
NRLine number
NFNumber of fields

🏆 Examples

# Print the second column from a space-separated file
awk '{print $2}' records.txt

# Print lines where the third column is greater than 100
awk '$3 > 100' sales.csv

# Sum values in the second column
awk '{sum += $2} END {print sum}' data.txt

🎯 Combining grep, sed, and awk

# Extract error lines, replace 'fail' with 'error', and print first column
grep 'error' logs.txt | sed 's/fail/error/g' | awk '{print $1}'

🚀 These tools are essential for processing large text-based datasets efficiently! Happy coding! 🎉