I have wgot a large website. Rather than setting up a local search engine I use grep to search the site. Grep's output shows the html (of course), but I only want to see the text (and not all the html tags).
How can I accomplish this?
12 Answers
One solution I have found is piping grep's output to html2text:
sudo apt-get install html2text
grep "som* interesting" | html2textThis largely works, but it fails a) to keep the color highlighting of grep, b) to use unicode, and c) to replace certain characters. Here is a more complete alternative that does not have these disadvantages.
grep --color=always "test*" * | html2text -utf8 | sed 's/l&rsquo/\"/'Of course you can edit the stream using sed to change other elements as well.
3Use lynx command and install it by following:
sudo apt-get install lynx-cur$ lynx --dump infile.html | grep 'PATTERN'
HTML Tables HTML tables start with a table tag. Table rows start with a tr tag. Table data start with a td tag. __________________________________________________________________
1 Column: 100 __________________________________________________________________
1 Row and 3 Columns: 100 200 300 __________________________________________________________________
3 Rows and 3 Columns: 100 200 300 400 500 600 700 800 900 __________________________________________________________________Using awk:
awk '{gsub(/<[^>]*>/,"")} /PATTERN/{# what you want to do }' infile 2