Thursday, September 16, 2010

Bulk edits of documents with Sed

I recently had a need to edit a number of html documents to remove a section of html code. Doing a bit of a search, it seemed that the best way of doing this was via SED (Stream EDitor). Sed is used in Linux to filter or transform text.

The next thing I wanted to do was to find out how I could use sed to:
  1. Find a string within an html file.
  2. Delete the string.
  3. Do this for more than one html.
The first thing I had to do was to create a bash script with a for loop that can cycle through the html files one by one.

The script I wrote was based on the following (I got this idea from here: http://gabeanderson.com/2008/02/01/unixlinux-find-replace-in-multiple-files/).

for fl in *.php; do
mv $fl $fl.old
sed ’s/FINDSTRING/REPLACESTRING/g’ $fl.old > $fl
#rm -f $fl.old
done


In my case I want to delete the string and I'm dealing with .html files, so the first line of my script becomes:

for fl in *.html; do

The second line remains the same:

#renames the .html files to .html.old

mv $fl $fl.old


The string I wanted to delete was multiple lines of html. I decided to specify the start and end of the string. Also, instead of replacing text, I just want to delete it, so I change the 'g' to a 'd'.

My third line becomes:

#this deletes all the html between the <h4> tag and the close table cell tag and redirects the output to .html

sed ’/<h4>/,/<\/td>/d’ $fl.old > $fl

I didn't want to keep the .old files so I uncommented the fourth line:

rm -f $fl.old #deletes the .html.old files

So, the script ends up looking like this:


#!/bin/bash

# script to replace multiple lines of text in multiple html files

for fl in *.html; do
mv $fl $fl.old
sed '/<h4>Are you/,/<\/td>/d' $fl.old > $fl
rm -f $fl.old
done


To run it, I just had to make it executable and run it from within the directory containing the .html files.

2 comments:

  1. Hey AC,

    Nice work! Sed also has an in-line option (-i), which will change the file directly, rather than print to standard out.

    This means that you don't need to do the moving of files. You can test your sed command without the in-line option, and when you're happy with what you see on stdout, add it in and you're done!

    So your script could be something more simple like this.

    for x in *.html
    do sed -i 's/FIND/REPLACE/g' $x
    done

    But then sed can also work with multiple input files, so in the end your script could be a one-liner, like so.

    sed -i 's/FIND/REPLACE/g' *.html

    Isn't Linux awesome? :-)

    -c

    ReplyDelete
  2. That's much easier! Thanks Chris.

    ReplyDelete