Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

wc counts bytes, to make it count characters use -m in the GNU version.


I think the point being made is that -m does not count characters, it counts multi-bytes. Or at least tries to. So the same Unicode point in utf-8 and utf-16 (and utf-32) could be very different strings of bytes. No way to tell unless you know before hand you are dealing with utf-8 or 16. Hence BOM, but no one likes that.

Its hard. And possibly we have to abandon tools like wc when we leave the Latin world.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: