Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

sed really is a powerful utility. I once wrote a tool in sed that strips HTML tags leaving just plain text as a fun exercise. Naturally it can't handle complicated cases but for many simple use cases it works. The code, though, is basically unreadable.


Given that sed is Turing complete, I'm sure if you tried harder you would have come up with a general solution ;)


Absolutely. One approach would be to take this TM implementation and program it to parse HTML.


Any chance you could share that? Wouldn't mind giving it a whirl in this one side project I just started...


Isn't it more or less a cliche that using sed etc for stripping HTML tags is unreliable and potentially hazardous?

I've tried it. I could get it to apparently "work". But then I'd get some input that hosed it. Now I just use "w3m -dump". I mean, it's a browser.


There's the classic Stack Overflow answer[0] about matching HTML tags. If you have a small subset of HTML, it can work out pretty well.

[0]: https://stackoverflow.com/questions/1732348/regex-match-open...


Thanks. That's what I was thinking of. Classic "this parrot is dead" riff.


The unreliability follows from it being "basically unreadable", which leads to it probably not doing what you intend. Whether it is hazardous depends on what you use it for.


I used to write web scrapers with curl/grep/sed back in the day. Fun times.


sed -e 's/<[^>]*>//g'




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: