sed
is a stream editor. A stream editor is different from a usual editor like, say, SublimeText. In a stream editor, you can only edit things within a narrow examination window, which can be only a line or so. It is used as
sed options script input-files
You mention what is be done in the script. Until the end of the file is reached, what it does is:
sed
reads a line of text from the input file to (a very limited) buffer, known as the pattern space.sed
commands to hold and accumulate text between different iterations.The script can either be inlined, in which case it is enclosed in ""
, like sed "inline_script" input-files
.
or it can be given as a file, in which case it is used as sed -f script_filename input-files
.
Most sed
-related stuff can be found here.
Some of the important sed
options are:
-n
: Don’t print lines by default. By default, the content of the pattern space is automatically put in the output space. For example, if we only want to print (p
) lines that have the word “Deceased” in it, we can do
sed -n '/Deceased/p' covid_data.csv
-f scriptfile
: We’ve already seen this, it’s used to take the script from a file.-s
: Treat each file separately. For example, if we want to print the first 10 lines of each of the files, then we should use this option (if we don’t, it’ll only print the first 10 lines of the first file).-r
: This enables the use of extended regular expressions.A script is written as
[address] [!] command
...
[address] [!] command
A missing address means all lines and a !
means that we look at all lines except the lines specified by address. The address
can be
a single line address:
If we only want to show line 3, we can do
sed -n '3 p' inpFile.csv
If we don’t include -n
, it’ll print line 3 twice.
If we only want to show the last line, we can do
sed -n '$ p' inpFile.csv
Suppose we want to substitute all _
on line 10 with (and print the line). We can do
sed -n -e '10 s/_/ /' -e '10 p' inpFile.csv
The -e
option is used to chain/join the two commands.
a set of lines:
Say we want to show all directories in the current directory. We can then do (note the use of regular expressions)
ls -al | sed -n '/^d/ p'
a range of lines:
If we want to show all the lines between 1 and 10, we can do
sed -n '1,10 p' inpFile
If we want to show all the lines between an if
statement and the closest else
, we can do
sed -n '/^if/,/^else/ p' inpFile.cpp
a nested address:
If we want to show all the print
statements between lines 10 and 25, we can do
sed -n '20,30(/print/ p)' inpFile.py
a complement of a particular address (!
):
If we want to show all the non-print
statements between lines 10 and 25, we can do
sed -n '20,30(/print/! p)' inpFile.py
Here we examine some basic sed
commands.
Line command:
this prints the numbers of the matching statements
sed -n '20,30(/print/! =)' inpFile.py
Modify commands:
insert
: It can be used to insert text before some address. If we want to insert 2019 Batch
(with some spaces beforehand) at the beginning of a file, we can do
sed '1 i\
2019 Batch' outpFile.csv
append
: It appends text after some address, it can only be used with single-line and set of lines address types.
change
: We can use it to replace an entire matched line with some other text.
delete
: It is used to delete an entire matched line.
s
(substitute): The syntax is
[addr1] [,addr2] s/searchstr/replacestr/[flags]
It replaces the text selected by the search string with replacement string. The search string can be a regex. Some flags that can be used are g
which is used to replace all occurrences (otherwise it only replaces the first occurrence)
We can also use a substitution back reference &
to use the matched string as
sed -r 's/19[0-9A-Z]{7}/&@iitb.ac.in/' inpFile.csv
This converts 190050015
to 190050015@iitb.ac.in
for example.
If there are multiple such search patterns we want to back reference, we can do
sed -r 's/st1(r2[A-Z]{2})st3(r3[0-9]{2})/\1\2/' inpFile.csv
This would convert st1r2KZst3r343
to r2KZr343
y
(transform): The syntax is
[addr1][,addr2]y/listOfChars1/listOfChars2/
It replaces characters in the first list with the corresponding characters in the second list.
IO commands:
n
(default sed
workflow): It reads a line from the input stream to the pattern space, removing any trailing newline. It executes the given commands if the address specifications are met, and unless -n
is used, it prints the pattern space to the output stream.
N
: After reading a line to the pattern space, it adds a newline and appends the next line of input to the pattern space. It then executes the commands and prints to the output stream. This is useful for processing multiple lines at the same time.
p
: We’ve already seen this, it is used to print the entire pattern space to the output. Unless -n
is used, it will print the same line twice.
P
: This prints (only) the first line of the pattern space (the content up to and including a newline)
r filename
: This reads the filename and writes it directly to the output stream, it does not copy this to the pattern space.
w filename
: This reads the pattern space and writes it to the pattern space.
b
(branch unconditionally): This is like a goto
statement. For example,
sed -n -r '
/19D[0-9]{6}/ b save
w others.csv
b
:save
w dd-students.csv
' inpFile.csv
writes the names of the dual-degree students to dd-students.csv and the remaining names to others.csv.
q
: This quits sed
and does not process the rest of the file.
sed -e '50q' datafile
quits after printing the first 50 lines
h
: This is used to copy whatever is there in the pattern space to the hold space (overwriting the current content).
H
: This is used to append whatever is there in the pattern space to the hold space.
g
: This is used to copy whatever is there in the hold space to the pattern space (overwriting the current content).
G
: This is used to append whatever is there in the hold space to the pattern space.
For example, if you want to transfer all the names of the dual-degree students to the end, you can do
sed -r -e '/19D[0-9]{6}/ H' -e '/19D[0-9]{6}/ d' -e '$ G' inpFile.csv
Writing sed
scripts in a scriptfile usually hugely improves readability. The above
sed command is equivalent to
#!/bin/sh
sed -r '
/19D[0-9]{6} {
H
d
}
$ {
G
}' inpFile.csv
The braces must appear on a different line from the sed
commands.
awk
is a scripting language that is used for manipulating data and generating reports.
Similar to sed
, it scans a file line by line, splits each input line into fields, compares input line/fields to a pattern, and performs the given action(s) on matched lines.
The basic awk
syntax is
awk [options] 'script' file(s)
awk [options] -f scriptfile file(s)
If a pattern is missing, the action is applied to all lines. If the action is missing, the matched line is printed. You must have either a pattern or an action.
A field is a unit of data in a line. Each field is separated from the other fields by a field separator (by default whitespace). A record is a collection of the fields in a line, and a data file is made up of records.
We can address each field buffer by $1
, $2
, …, $n
, where n is the number of fields in the given record. We can also refer to the entire record buffer by $0
.
awk
has several pre-defined variables:
FS
is the field separator (whitespace by default)RS
is the record separator (\n
by default)NF
is the number of fields in the current recordNR
is the number of the current recordOFS
is the output field separator (space by default)ORS
is the output record separator (\n
by default)FILENAME
is the current filename.For example,
ls -al | awk '{print NR, $9}'
will number and print the files in the current directory
awk
scripts are divded into three major parts.
BEGIN {pre-processing statements}
pattern (action)
pattern (action)
END {post-processing statements}
The part following BEGIN
is done at the beginning of before awk
starts reading any records from the input file. It is helpful for initializing variables, creating report headings, etc.
The body contains the logic to be applied to the input file, one record at a time.
The part following END
is done after all the records in the file have been processed. It is useful for reporting aggregates such as, say, the mean of the data.
Each part of the body can be written well-formatted as
pattern {
statement
statement
...
statement
}
We’ve already seen two patterns, namely BEGIN
and END
. Some other patterns are
Regular expressions: enclosed by /
s like in sed
. This matches any record that contains the given regex. For example, to print all records with the text special
anywhere,
awk '/special/ {print}'
Explicit pattern-matching expressions: ~
for matching and !~
for not matching. For example, to print the records whose first field matches 19D[A-Z0-9]+
,
awk -F, '($1~/19D[A-Z0-9]+/){print NR, $0}' inpFile.csv
Note that /special/ {print}
is the same as ($0~/special/)
.
We can have expressions that have arithmetic operators in them. For example,
awk '$3 * $4 > 500 {print $0}' file
prints the records that have the product of the third and fourth fields greater than 500.
Range patterns: These are used as
pattern1, pattern2 {action}
For example,
awk /190050009/,/190050015/{print} inpFile.csv
awk
expressions are made up of variables, constants, and operators.
What if we want to pass shell variables to an awk
script? The scope of a user-defined variable does not extend to an awk
script, we must instead use the -v
switch like
printf $a
printf "\n"
awk -v v="$a" '
BEGIN {
print "Printing v"; print v'}'
inpFile.csv
Here, $a
is a variable outside the awk
script, that we pass to the awk
by creating a new variable with name v
.
We can also make associative arrays, which are something like maps in C++, where the indices are defined by use. For example, in this data, we can run the following script to find the number of cases per state:
awk -F '
# ($9~/[A-Z][A-Z]/)
{
state[$9]=state[$9]+$10
}
END {
for (i in state) {
print state[i], i;
}
}' covidCases.csv
For output in awk
, we can use print
, printf
(like in C), and sprintf
(like in C).