Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What does the tech do exactly?


The bulk of it was:

News article title extraction. News article relevant thumbnail extraction. News article text body extraction. Generating publicly traded stock symbols from business news articles. Some Techmeme-style document clustering.


I am working on a project (more of a public service than a startup) that needs this. I've looked through all of the resources linked in the articles above and nothing works as well as I need it to. The best performer is readability, so I will probably be going with the python port of that.

If your code works well I also think you should put it up on github. You can see what I intend to use this technology for by reading this text snippet: https://github.com/sbuss/revisionews/blob/develop/web/index....


Currently in the middle of a re-architecture/re-write due to its flaws but something similar to this in Ruby I worked on last year: http://github.com/peterc/pismo


This looks pretty great, I'll definitely keep an eye on it. Thanks :)


Which python port are you using? Last time I looked all the python Readability code I could find was either incomplete, old, or buggy.


I've experimented with https://github.com/gfxmonk/python-readability but it's extremely slow. There's a decent fork called decruft that is a couple orders of magnitude faster http://www.minvolai.com/blog/decruft-arc90s-readability-in-p...

Decruft also has a couple bug fixes to python-readability. They both need a lot of work, though. You'll have to do some spelunking to figure out how to actually call the libraries correctly.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: