When using grep on html files, how can I display only the text and not the tags?

I have wgot a large website. Rather than setting up a local search engine I use grep to search the site. Grep's output shows the html (of course), but I only want to see the text (and not all the html tags).

How can I accomplish this?

1

2 Answers

One solution I have found is piping grep's output to html2text:

sudo apt-get install html2text
grep "som* interesting" | html2text

This largely works, but it fails a) to keep the color highlighting of grep, b) to use unicode, and c) to replace certain characters. Here is a more complete alternative that does not have these disadvantages.

grep --color=always "test*" * | html2text -utf8 | sed 's/l&rsquo/\"/'

Of course you can edit the stream using sed to change other elements as well.

3

Use lynx command and install it by following:

sudo apt-get install lynx-cur

See the input and output here

$ lynx --dump infile.html | grep 'PATTERN'
HTML Tables HTML tables start with a table tag. Table rows start with a tr tag. Table data start with a td tag. __________________________________________________________________
1 Column: 100 __________________________________________________________________
1 Row and 3 Columns: 100 200 300 __________________________________________________________________
3 Rows and 3 Columns: 100 200 300 400 500 600 700 800 900 __________________________________________________________________

Using :

awk '{gsub(/<[^>]*>/,"")} /PATTERN/{# what you want to do }' infile
2

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

You Might Also Like