More

sauain · on July 9, 2024

would love to have your feedback on the python one too :)

sauain · on July 9, 2024

I didn't try it, but I don't see a reason why not:

- RSS feed is transferred via HTTP. - BeatifulSoup can parse both HTML & XML.

(RSS uses XML format)

sauain · on July 9, 2024

Crawlee is open source and free to use and we don't have any plans to monetize it in future. It will be always free to use.

We provide Apify platform to publish your scrapers as Actors for the developer community, and developers earn money through it. You can use Crawlee for Python as well :)

tldr; Crawlee is and always will be free to use and open sourced.

sauain · on July 9, 2024

hey intev,

- Crawlee has out-of-the-box support for headless browser crawling (Playwright). You don't have to install any plugin or set up the middleware. - Crawlee has a minimalistic & elegant interface - Set up your scraper with fewer than 10 lines of code. You don't have to care about what middleware, settings, and anything are or need to be changed, on the top that we also have templates which makes the learning curve much smaller. - Complete type hint coverage. Which is something Scrapy hasn't completed yet. - Based on standard Asyncio. Integrating Scrapy into a classic asyncio app requires integration of Twisted and asyncio. Which is possible, but not easy, and can result in troubles.

mdaniel · on July 9, 2024

> You don't have to install any plugin or set up the middleware.

That cuts both ways, in true 80/20 fashion: it also means that anyone who isn't on the happy path of the way that crawlee was designed is going to have to edit your python files (`pip install -e` type business) to achieve their goals

8organicbits · on July 10, 2024

I've been working on a crawler recently and honestly you need the flexibility middleware gives you. You can only get so far with reasonable defaults, crawling isn't a one-size fits all kinda thing.

mnmkng · on July 10, 2024

Crawlee isn’t any less configurable than Scrapy. It just uses different, in my personal opinion more approachable, patterns. It makes it easier to start with, but you can tweak whatever you want. Btw, you can add middleware in Crawlee Router.

mdaniel · on July 10, 2024

> Crawlee isn’t any less configurable than Scrapy.

Oh, then I have obviously overlooked how one would be able to determined if a proxy has been blocked and evict it from the pool <https://github.com/rejoiceinhope/scrapy-proxy-pool/blob/b833...> . Or how to use an HTTP cache independent of the "browser" cache (e.g. to allow short-circuiting the actual request if I can prove it is not stale for my needs, which enables recrawls to fix logic bugs or even downloading the actual request-response payloads for making better tests) https://docs.scrapy.org/en/2.11/topics/downloader-middleware...

Unless you meant what I said about "pip install -e && echo glhf" in which case, yes, it's a "simple matter of programming" into a framework that was not designed to be extended

8organicbits · on July 10, 2024

Cache management is also what I had in mind. Ive been using golang+colly and the default caching behavior is just different enough from what I need. I haven't written a custom cache middleware, but I'm getting to that point.

sauain · on April 16, 2024

Apify's web scraping platform allows companies to mine data from websites and extract data for AI.

sauain · on April 10, 2024

Build a HackerNews scraper using Crawlee

sauain · on Feb 22, 2024

This blog explains the challenges and breakthroughs we have in 2024 in web scraping world.

sauain · on March 23, 2023

Congratulations to the team!

sauain · on March 23, 2023

This blog explores why NestJS is a good alternative to Django.

timetorecycle · on March 23, 2023

Very nice, thank you for sharing this!

sauain · on March 9, 2023

Amplication, an open source devtool, launched a campaign, inviting first time contributors to contribute to their repository and get all the support they need for the PR and also some cool swag like a limited edition shirt and pack of stickers