Show HN: Tuchu – Automatically highlight the important parts of a document

heresjohnny · on Dec 28, 2020

Hey HN,

I created Tuchu because I wanted to increase my reading efficiency. It is a tool that automatically highlights the important parts of any document. Most documents take about two to three seconds to process. It's directed at students, researchers, or anyone with a reading list and little time for that matter.

During my studies I had to go through a lot of literature, for example when I had to select relevant material for my thesis, or when I had to familiarize myself with a course's reading list. Tuchu helped me to get up to speed in these cases. What started off as a command-line Python script is now a web application that does its analysis without any back-end. I don't get to see your documents.

The underlying algorithm that selects what's relevant is called TextRank, an unsupervised summarization method [1]. It models a document (or a collection thereof) as a fully connected graph. Its nodes are parts of the text — I use sentences — and the edges between them are weighted by a similarity measure, in my case simple word overlap. The subset of sentences with the highest PageRank are then highlighted. For good measure, I also highlight sentences that contain signal words that — in my academic experience — signify importance.

It's important to note that Tuchu is not a substitute for doing your own reading. It could make you a faster reader by directing your attention to the important parts, but you'll still have to ponder about the true essence of a document yourself.

[1] https://www.aclweb.org/anthology/W04-3252/

rdhyee · on Dec 28, 2020

Very interesting. I'd love a way to download a highlighted version of the PDF that I fed to Tuchu.

I uploaded a PDF version of a Wikipedia article to see what was selected, and at a quick glance, it's not obvious to me that the most important parts of the article have been highlighted. On the other hand, it's not obvious that trivial parts have been selected either -- leaving me intrigued to look further.

aquajet · on Dec 28, 2020

I made a similar site to Tuchu a while ago that attaches the highlights to the pdf: https://anishthite.github.io/ailight/. It's a bit slow though, I've been trying to get it to run faster.

krat0sprakhar · on Dec 28, 2020

Looks interesting! Is there a way to try it out on a webpage instead of PDF?

quaintdev · on Dec 28, 2020

This kind of service would be amazing with a browser extension.

Nowadays a lot of bloggers/reporters keep running round in circles before coming to a point. I actually wrote a post related about this a while back [0]. With browser extension this can save lot of time of readers. Heck I would pay for such a service!

[0]: https://www.ankshilp.com/stop_beating_around_the_bush/

heresjohnny · on Dec 29, 2020

Hi krat0sprakhar,

At this point you could print the page to a PDF and then upload it. A browser extension for webpages (as suggested by others) is a good idea! I'll look into it.

donclark · on Dec 28, 2020

I agree as well. This may have a 2nd life as a browser extension to use on articles on the web.

sologuardsman2 · on Dec 28, 2020

+1 to the browser extension idea

donclark · on Dec 28, 2020

I dont have a PDF laying around that I could test it with. It would be nice if you had an example link to a PDF, or screenshots showing an example of the highlighting that the service does.

yorwba · on Dec 28, 2020

The background image with the stylized images of highlighted documents is an ideal candidate for replacement with a screenshot of actual highlighted documents.

iav · on Dec 29, 2020

Dude! This is super helpful. I run a website with millions of huge text PDFs and thousands of users, and being able to implement this will literally save my users man years of time.

heresjohnny · on Dec 29, 2020

Hey iav,

Cool! Would love to hear more. Perhaps you could reach out at hello@tuchu.app? :)

ArubaJamaica · on Dec 29, 2020

Curious to know what website this is

iav · on Dec 29, 2020

bankrupt11.com

karthikb · on Dec 28, 2020

How did Tuchu's highlighting compared to the abstract?

giovannibonetti · on Dec 29, 2020

Does It support other languages than English?

jhvkjhk · on Dec 29, 2020

I tested with a 26 pages Chinese pdf and the answer is no, Tuchu only highlighted three English sentences.

This is weird, as Tuchu means highlight in Chinese.

heresjohnny · on Dec 29, 2020

Hi jhvkjhk,

Thanks for your feedback. In theory all languages should be supported since TextRank should be language-agnostic. Having said that, I think that my current sentence similarity strategy (by word overlap) is not a good fit for "symbol rich" languages such as Mandarin.

I am not Chinese (or from Asian descent) myself. Tuchu comes from some rainy Sunday Google translating. ;)

anigbrowl · on Dec 29, 2020

A bold claim which I wanted to believe but regrettably can't - yet. Tried it on a variety of material that I already know well, from pedagogical texts to academic papers in multiple discipline to legal briefs to news articles, including my own output.

I like it. it's got potential. With north of 5000 pdfs in my library I'm extremely open to tools like this, and the methodology seems pretty sensible to me. But at present it feels kind of random - good at picking out summary sentences of what a document section will cover, or emphatically stated conclusions, bad at highlighting necessary context. I wonder if it might be better trained on clauses rather than full sentences, although that's probably significantly more work.

I'm sorry I don't have a more positive review, but I think it's a bold attempt even if it falls short, and want to see where it goes. Even as is, I can see myself using it sometimes as its selections offer an interesting alternative to my own skimming/emphasis preferences. Very impressed with the clean no-BS user interface and fast performance.

heresjohnny · on Dec 29, 2020

Hi anigbrowl, thanks for your thorough review! This is really helpful. I am aware that Tuchu's current strength mostly (and in many cases solely) lies with identifying sentences that summarize, not sentences that explain a concept or convey crucial detail. In that sense the claim of "highlighting the important parts" may indeed be a bit bold.

In my opening post I said that doing your own reading is still a requirement, but it might be good to also mention this on the site — at least until I find a way to improve the algorithm. One way could be leveraging pre-trained word embeddings, but this would require a server or downloading a large blob to the user's device beforehand. In any case Tuchu wouldn't be as fast that way.

anigbrowl · on Dec 29, 2020

Well, your reach should exceed your grasp! It's already an interesting tool and one that I'm sure will improve as you work further on it.

monkeydust · on Dec 28, 2020

You need an example pdf on the page. Cycle through a few topical examples e.g brexit deal full text (it's over 1000 pages!)

https://ec.europa.eu/transparency/regdoc/rep/1/2020/EN/COM-2...

heresjohnny · on Dec 29, 2020

Hi monkeydust,

Good idea! I'll note it down. :)

0xffff2 · on Dec 28, 2020

Is Firefox not supported? Tried submitting a paper and it says "File is not a PDF, too large, or corrupt" in Firefox. Seems to work fine in Chrome.

In any case, the paper I submitted is one I coauthored, so I like to think I'm a reasonably good judge of what's important. Maybe the tool just isn't a good fit for my field or my writing, but the highlights appear to be essentially random.

heresjohnny · on Dec 29, 2020

Hi 0xffff2,

Thanks for reporting this! All feedback is really welcome. Firefox is supported. The error you encountered originates from the underlying PDF renderer which, depending on the browser you use, sometimes throws an uninterpretable error. It's on my list...

With respect to your highlighting results: I am aware that it can be a hit or miss at this stage. I've had really mixed feedback so far and I do have a theory that the quality of results may depend on the kind of writing (which is odd, bit I digress). If you could send your document to hello@tuchu.app that would be really useful!

mbroshi · on Dec 29, 2020

This is a great idea! I tried it on a 200-page document [1], and it took over a minute to process in my browser, and the result seems to be about 75% highlighted. Not sure if it's the "math-ese" or length that is the issue.

One cool thing is that it highlighted both the same sentences in English and French. I presume you translate the text before analyzing.

[1] https://www.math.mcgill.ca/darmon/theses/leahy/thesis.pdf

heresjohnny · on Dec 29, 2020

Hi mbroshi,

There's no translation being done at all! It's completely language-agnostic (apart from some hardcoded signal words, which I only noted down in English).

Really large documents such as entire books or theses are supported, but as you've experienced it results in a far from ideal experience at this point. Thanks for reporting this and linking to the document, now I know that this is a problem worth addressing. :)

sologuardsman2 · on Dec 28, 2020

Like the idea! Tested on a couple of random research papers with mixed but decent results. Really look forward to leveraging this sort of tool as it improves.

rman666 · on Dec 29, 2020

Hi, heresjohnny. I'd really like to talk with you about how you're doing the PDF and highlighting in the browser. I have some feedback, too. I sent an email to you at hello@tuchu.app and I look forward to your reply. Thanks!

EDIT: I removed my personal email from this message.

theaussiestew · on Dec 28, 2020

Looks useful. What framework did you use to implement TextRank?

heresjohnny · on Dec 29, 2020

Hi theaussiestew,

I did it by hand. The hardest part was finding a way to create the sentence similarity matrix in a fast way. I solved that problem by creating a Matrix class around Javascript's typed arrays.

For the complete picture, the app itself is written in Angular because I'm familiar with it (probably overkill, really) and I'm using Mozilla's pdf.js [1] to render documents.

[1] https://github.com/mozilla/pdf.js

libeclipse · on Dec 29, 2020

This doesn't work well on any of the 4 documents I tested it on. Highlights barely anything with little reason for the parts that it does.

toufique · on Dec 28, 2020

Love this! Would be amazing as a Chrome Extension for web pages.

lowkeynthorough · on Dec 29, 2020

Hey dang, is there a moderately strict guideline that says that Show HN articles should have source code? This post is somewhat interesting, but you can imagine marketers abusing the format.

purplecats · on Dec 29, 2020

this is pretty good!

bzb6 · on Dec 28, 2020

[flagged]

dang · on Dec 28, 2020

Please really don't keep doing this here.

flamble · on Dec 28, 2020

This is clearly a joke, but "tuchu" (突出) is Mandarin for "emphasize" or "stick out", not random syllables.