sed really is a powerful utility. I once wrote a tool in sed that strips HTML ta...

saagarjha · on Feb 15, 2019

Given that sed is Turing complete, I'm sure if you tried harder you would have come up with a general solution ;)

mannykannot · on Feb 15, 2019

Absolutely. One approach would be to take this TM implementation and program it to parse HTML.

airstrike · on Feb 15, 2019

Any chance you could share that? Wouldn't mind giving it a whirl in this one side project I just started...

mirimir · on Feb 15, 2019

Isn't it more or less a cliche that using sed etc for stripping HTML tags is unreliable and potentially hazardous?

I've tried it. I could get it to apparently "work". But then I'd get some input that hosed it. Now I just use "w3m -dump". I mean, it's a browser.

testudovictoria · on Feb 15, 2019

There's the classic Stack Overflow answer[0] about matching HTML tags. If you have a small subset of HTML, it can work out pretty well.

[0]: https://stackoverflow.com/questions/1732348/regex-match-open...

mirimir · on Feb 15, 2019

Thanks. That's what I was thinking of. Classic "this parrot is dead" riff.

mannykannot · on Feb 15, 2019

The unreliability follows from it being "basically unreadable", which leads to it probably not doing what you intend. Whether it is hazardous depends on what you use it for.

james_s_tayler · on Feb 15, 2019

I used to write web scrapers with curl/grep/sed back in the day. Fun times.

somacert · on Feb 15, 2019

sed -e 's/<[^>]*>//g'