I created Tuchu because I wanted to increase my reading efficiency. It is a tool that automatically highlights the important parts of any document. Most documents take about two to three seconds to process. It's directed at students, researchers, or anyone with a reading list and little time for that matter.
During my studies I had to go through a lot of literature, for example when I had to select relevant material for my thesis, or when I had to familiarize myself with a course's reading list. Tuchu helped me to get up to speed in these cases. What started off as a command-line Python script is now a web application that does its analysis without any back-end. I don't get to see your documents.
The underlying algorithm that selects what's relevant is called TextRank, an unsupervised summarization method [1]. It models a document (or a collection thereof) as a fully connected graph. Its nodes are parts of the text — I use sentences — and the edges between them are weighted by a similarity measure, in my case simple word overlap. The subset of sentences with the highest PageRank are then highlighted. For good measure, I also highlight sentences that contain signal words that — in my academic experience — signify importance.
It's important to note that Tuchu is not a substitute for doing your own reading. It could make you a faster reader by directing your attention to the important parts, but you'll still have to ponder about the true essence of a document yourself.
Very interesting. I'd love a way to download a highlighted version of the PDF that I fed to Tuchu.
I uploaded a PDF version of a Wikipedia article to see what was selected, and at a quick glance, it's not obvious to me that the most important parts of the article have been highlighted. On the other hand, it's not obvious that trivial parts have been selected either -- leaving me intrigued to look further.
I made a similar site to Tuchu a while ago that attaches the highlights to the pdf: https://anishthite.github.io/ailight/. It's a bit slow though, I've been trying to get it to run faster.
This kind of service would be amazing with a browser extension.
Nowadays a lot of bloggers/reporters keep running round in circles before coming to a point. I actually wrote a post related about this a while back [0]. With browser extension this can save lot of time of readers. Heck I would pay for such a service!
At this point you could print the page to a PDF and then upload it. A browser extension for webpages (as suggested by others) is a good idea! I'll look into it.
I dont have a PDF laying around that I could test it with. It would be nice if you had an example link to a PDF, or screenshots showing an example of the highlighting that the service does.
The background image with the stylized images of highlighted documents is an ideal candidate for replacement with a screenshot of actual highlighted documents.
Dude! This is super helpful. I run a website with millions of huge text PDFs and thousands of users, and being able to implement this will literally save my users man years of time.
Thanks for your feedback. In theory all languages should be supported since TextRank should be language-agnostic. Having said that, I think that my current sentence similarity strategy (by word overlap) is not a good fit for "symbol rich" languages such as Mandarin.
I am not Chinese (or from Asian descent) myself. Tuchu comes from some rainy Sunday Google translating. ;)
A bold claim which I wanted to believe but regrettably can't - yet. Tried it on a variety of material that I already know well, from pedagogical texts to academic papers in multiple discipline to legal briefs to news articles, including my own output.
I like it. it's got potential. With north of 5000 pdfs in my library I'm extremely open to tools like this, and the methodology seems pretty sensible to me. But at present it feels kind of random - good at picking out summary sentences of what a document section will cover, or emphatically stated conclusions, bad at highlighting necessary context. I wonder if it might be better trained on clauses rather than full sentences, although that's probably significantly more work.
I'm sorry I don't have a more positive review, but I think it's a bold attempt even if it falls short, and want to see where it goes. Even as is, I can see myself using it sometimes as its selections offer an interesting alternative to my own skimming/emphasis preferences. Very impressed with the clean no-BS user interface and fast performance.
Hi anigbrowl, thanks for your thorough review! This is really helpful. I am aware that Tuchu's current strength mostly (and in many cases solely) lies with identifying sentences that summarize, not sentences that explain a concept or convey crucial detail. In that sense the claim of "highlighting the important parts" may indeed be a bit bold.
In my opening post I said that doing your own reading is still a requirement, but it might be good to also mention this on the site — at least until I find a way to improve the algorithm. One way could be leveraging pre-trained word embeddings, but this would require a server or downloading a large blob to the user's device beforehand. In any case Tuchu wouldn't be as fast that way.
Is Firefox not supported? Tried submitting a paper and it says "File is not a PDF, too large, or corrupt" in Firefox. Seems to work fine in Chrome.
In any case, the paper I submitted is one I coauthored, so I like to think I'm a reasonably good judge of what's important. Maybe the tool just isn't a good fit for my field or my writing, but the highlights appear to be essentially random.
Thanks for reporting this! All feedback is really welcome. Firefox is supported. The error you encountered originates from the underlying PDF renderer which, depending on the browser you use, sometimes throws an uninterpretable error. It's on my list...
With respect to your highlighting results: I am aware that it can be a hit or miss at this stage. I've had really mixed feedback so far and I do have a theory that the quality of results may depend on the kind of writing (which is odd, bit I digress). If you could send your document to hello@tuchu.app that would be really useful!
This is a great idea! I tried it on a 200-page document [1], and it took over a minute to process in my browser, and the result seems to be about 75% highlighted. Not sure if it's the "math-ese" or length that is the issue.
One cool thing is that it highlighted both the same sentences in English and French. I presume you translate the text before analyzing.
There's no translation being done at all! It's completely language-agnostic (apart from some hardcoded signal words, which I only noted down in English).
Really large documents such as entire books or theses are supported, but as you've experienced it results in a far from ideal experience at this point. Thanks for reporting this and linking to the document, now I know that this is a problem worth addressing. :)
Like the idea! Tested on a couple of random research papers with mixed but decent results. Really look forward to leveraging this sort of tool as it improves.
Hi, heresjohnny. I'd really like to talk with you about how you're doing the PDF and highlighting in the browser. I have some feedback, too. I sent an email to you at hello@tuchu.app and I look forward to your reply. Thanks!
EDIT: I removed my personal email from this message.
I did it by hand. The hardest part was finding a way to create the sentence similarity matrix in a fast way. I solved that problem by creating a Matrix class around Javascript's typed arrays.
For the complete picture, the app itself is written in Angular because I'm familiar with it (probably overkill, really) and I'm using Mozilla's pdf.js [1] to render documents.
Hey dang, is there a moderately strict guideline that says that Show HN articles should have source code? This post is somewhat interesting, but you can imagine marketers abusing the format.
I created Tuchu because I wanted to increase my reading efficiency. It is a tool that automatically highlights the important parts of any document. Most documents take about two to three seconds to process. It's directed at students, researchers, or anyone with a reading list and little time for that matter.
During my studies I had to go through a lot of literature, for example when I had to select relevant material for my thesis, or when I had to familiarize myself with a course's reading list. Tuchu helped me to get up to speed in these cases. What started off as a command-line Python script is now a web application that does its analysis without any back-end. I don't get to see your documents.
The underlying algorithm that selects what's relevant is called TextRank, an unsupervised summarization method [1]. It models a document (or a collection thereof) as a fully connected graph. Its nodes are parts of the text — I use sentences — and the edges between them are weighted by a similarity measure, in my case simple word overlap. The subset of sentences with the highest PageRank are then highlighted. For good measure, I also highlight sentences that contain signal words that — in my academic experience — signify importance.
It's important to note that Tuchu is not a substitute for doing your own reading. It could make you a faster reader by directing your attention to the important parts, but you'll still have to ponder about the true essence of a document yourself.
[1] https://www.aclweb.org/anthology/W04-3252/