Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
LeMUR: LLMs for Audio and Speech (assemblyai.com)
129 points by ramie on July 27, 2023 | hide | past | favorite | 23 comments


I know this is kind of offtopic, but y'all _genuinely_ need to up your UX game. Lots of popups, strange blurred text in the background that makes me think the interface itself is a popup, a youtube URL entered into the text box that.. doesn't do anything (it looks like it should), strangely contrasting and miniature text.

Really not trying to be a jerk -- I think this is a neat project.


Congrats on the launch! The "answer format" parameter is a nice idea for advanced use cases.

If someone wants to compare, Universal Summarizer [1] can do real time summarization of audio/speech, with unlimited input token length and is free (for Kagi members or can be tried with a trial account). Just point it to the URL of the podcast/audio/speech file.

Paid API is also available [2].

[1] https://kagi.com/summarizer

[2] https://help.kagi.com/kagi/api/summarizer.html


Not downplaying this, but how is it any different than using any number of free audio transcription libraries (sphinx, google, etc.) and any LLM?


No, but no tech users will think it is magic.


Its the seamless integration that counts. With the playground for example you can have it read and summarize a near 3 hour video https://www.youtube.com/watch?v=Se91Pn3xxSs

with a few clicks and decide if you really want to watch

https://www.assemblyai.com/playground/v2/transcript/6lu93wlw...


Hey HN, Matt from AssemblyAI here. If you want to test out LeMUR one of the fastest ways is with our Google Colab: https://colab.research.google.com/drive/1xX-YeAgW5aFQfoquJPX...

I'm happy to answer questions about the API as well


Do you use Whisper for the transcript (which version? base?) and GPT-3.5-turbo for the language model? Do you provide a self-hosted solution for the companies that don't want their meetings going "on the cloud"? I do not mean to be dismissive of all your work, I know too well the devil is in the details, but what are the key advantages of using your solution over having a Python dev (or GPT-4) write a similar tool using Langchain + whisper + llama2 for example? Again, please do not take this as a cheap shot, I might not be the target audience but if I were to use such a tool I would like everything to run locally because of privacy/corporate spying concerns. Thanks!

EDIT: Also it is unclear if you support other languages than English. Whisper does, so in theory you should. There are companies out there where English is not the work language.


They have their own ASR Conformer-2[0] and support 9 languages (they count it as 12)[1]

It looks like their synchronous transcribe is much slower than whisper, but if you need it fast, you need their realtime ASR (or amazon or google's).

[0] Conformer-2 is trained on 1.1M hours of English https://www.assemblyai.com/blog/conformer-2/ [1] https://www.assemblyai.com/docs/Concepts/supported_languages


You can use deepgram who has their own model but also has an option to use whisper hosted by them


Love using Google Collab as your onboarding doc.


Can I not do the same thing with Whisper to transcribe and then pipe the data into my LLM of choice?


Their ASR model is Conformer trained on 1.1M hours, so the result should be better than Whisper. From their pricing page, with ~ length of a meeting, input size 15000 tokens (60 minutes audio file), output size 2000 tokens (1500 words), LeMUR default, the price estimate is $0.353, which is I think a fairly good price. This tool can save a lot of time for a secretary, even replace them. But I think sending your meeting data is still quite risky.


Comparison by competitor but it’s believable IMO. Basically about the same performance as whisper:

- https://deepgram.com/learn/nova-speech-to-text-whisper-api

Not surprising though as at this level all these options are starting to be leveled by inconsistencies in manual groundtruth. Conformer alone also isn’t the most powerful architecture out there for speech. This is also slower than, say running a large k2 zipformer via onnx on cpu.

Also if you have a small shop at this point you can do all of this yourself with whisper large v2 on a single 16gb gpu via some tweaking of https://github.com/guillaumekln/faster-whisper and an OSS LLM.

Interesting stuff but I think margins in this space are getting ready to simply vanish.


Deepgram will correlate the text in your transcription with the timestamp where that was uttered. This is really really impressive and useful.


I'd recommend just trying the Colab in my comment above to test out how quick you can do what you want with LeMUR versus building your own. Piping in 100 hours of audio into an LLM can be a lot of work compared to an API call, but it'll depend on what you are building


Sounds really useful!

It seems like this is cheaper than a full transcript. Is it because it skips stuff like diarization and aligning time stamps?


Nice! Could also use 2 API calls to openAI. → browser record audio → send audio to openAI for audio to text transcription → send transcribed text for completion → display results → https://attention1.gitlab.io/ai-interface (open code)


UI nit:

The floating icon for cookie settings in the bottom left obscures the play/pause button for the audio track in playground.


This looks decent but sadly you can get better results with OpenAI


This is another great application of AI, well done


I know this is discouraged to complain about name clashes, but Lemur was also a serious brand in the music production space for over 20 years. First they developed hardware, then transitioned to apps, but now they are no more. For me, when you say "audio" and "lemur" it's the first thing I think of.

https://en.wikipedia.org/wiki/Lemur_(input_device)


meybe use it in mycroft?


Bit of a misleading name. Between the local nature of Llama, Alpaca and Orca, one might expect LeMUR to be something you can download for yourself too. But nope, this is closer to the OpenAI "Pay As You Go" model a-la GPT3/4.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: