Omnivore | Full Time | REMOTE or ONSITE (Tampa, FL) | USA or Canada
Omnivore is a universal API for restaurant point-of-sales. Our API integrates hundreds of apps directly into the brains of the restaurant, without the app developers having to worry about the fragmented POS industry. Our recently launched menu management product makes it easy for restaurants to create their perfect (and integrated!) digital menu for delivery, online ordering, and more.
We're API obsessed. All of our products are served via carefully crafted REST APIs. All our forward technology is built in Go, and we have a substantial amount of Python and Javascript/Node.js running in our cloud and deployed into restaurants.
You should give us a look if these problems sound interesting to you:
* Building a rapidly growing REST API in Go
* Deploying, updating, and managing an on-prem agent deployed to 10,000+ restaurants
* Creating an awesome developer ecosystem in an industry notorious for legacy technology
* Reverse engineering systems built 20+ years ago, and making them easy to use via modern APIs
Omnivore is just over 50 people today, and was built as a remote-first company from day one.
We're hiring:
* Full Stack / Backend Engineers - help build product!
* Senior Go/Golang Engineers - help support a rapidly evolving codebase!
* Site Reliability Engineers - help scale our systems and monitoring!
Omnivore | Full Time | REMOTE or ONSITE (Tampa, FL) | USA or Canada
Omnivore is a universal API for restaurant point-of-sales. Our API integrates hundreds of apps directly into the brains of the restaurant, without the app developers having to worry about the fragmented POS industry. Our recently launched menu management product makes it easy for restaurants to create their perfect (and integrated!) digital menu for delivery, online ordering, and more.
We're API obsessed. All of our products are served via carefully crafted REST APIs. All our forward technology is built in Go, and we have a substantial amount of Python and Javascript/Node.js running in our cloud and deployed into restaurants.
You should give us a look if these problems sound interesting to you:
* Building a rapidly growing REST API in Go
* Deploying, updating, and managing an on-prem agent deployed to 10,000+ restaurants
* Creating an awesome developer ecosystem in an industry notorious for legacy technology
* Reverse engineering systems built 20+ years ago, and making them easy to use via modern APIs
Omnivore is just over 50 people today, and was built as a remote-first company from day one.
We're hiring:
* Full Stack / Backend Engineers - help build product!
* Site Reliability Engineers - help scale our systems and monitoring!
I'm a founder of Emergent One, a startup that uses similar agent-based technology to Chartio to access production databases and build out RESTful APIs. Some of our APIs are write-enabled, which makes the proposed risk even higher than that of Chartio's.
We've spent a lot of time thinking about security risks and writing code to reduce them. I thought I'd share a few things we do and that we've learned from our experience:
* The agent approach is the most popular because it allows for a system administrator to easily sever the connection from the database server without having to worry about writing queries to revoke user access.
* We never run unindexed queries without an explicit request from a customer and a manual entry from an Emergent One employee.
* We're currently looking into security consultants to continuously test our production environment.
* We're building an appliance version of our software much like Github Enterprise in order to accommodate the customers that aren't comfortable with their data hitting the cloud.
* We strive to have very quick and personal customer service directly from engineers. The vast majority of the responses are within the hour.
and last but certainly not least...
* The very best thing we can do is be honest and straightforward about the inherent risk behind our platform. Being able to build and maintain a pristine level of trust is the only thing that will keep us in business.
I'm sure Chartio does things very similarly. Direct-database access technology is not for everyone, but it's also proving to be extremely valuable for both Chartio's customers and ours. The cloud advantage that makes most SaaS software great is still there.
Often times, particularly in enterprise software, part of the terms of a deal include a "source code escrow" clause. This means that if you go out of business, the customer with this term gets your platform's source code released to them so their investment into your technology doesn't implode completely.
The issue with web scraping is that it relies on the scraper to keep up with changes made to the site.
If a site owner changes the layout or implements a new feature, the programs depending on the scraper immediately fail. This is much less likely to happen when working with official APIs.
This should be stressed - sites like Facebook do exactly this. Constant changes mean constantly updating your scraper. When it comes to A/B testing? Your scraper needs to intelligent find the data, which might not always be in the same place.
Sidenote: I wonder if any webapps use randomly generated IDs and class names (linked in the CSS) to prevent scraping. I guess this would be a caching nightmare, though.
I wonder if any webapps use randomly generated IDs and class names (linked in the CSS) to prevent scraping.
In my spare time, I've been playing around with "scrapers" (I like to call them web browsers, personally) that don't even look at markup.
My first attempt used a short list of heuristics that proved to be eerily successful for what I was after. To the point I could throw random websites with similar content (discussion sites, like HN), but vastly dissimilar structures, at it and it would return what I expected about, I'd say, 70% of the time in my tests.
After that, I started introducing some machine learning in an attempt to replicate how I determine what blocks are meaningful. My quick prototype showed mix results, but worked well enough that I feel with some tweaking it could be quite powerful. Sadly, I've become busy with other things and haven't had time to revisit it.
With that, swapping variables and similar techniques to thwart crawlers seems like it would be easily circumvented.
I would be really interested in knowing which heuristics or machine learning techniques produced decent results. That's if I can't convince you to open source the code. I'm working on the same problem at the moment.
We're fine with scrapers and scraping infrastructure, although tubes.io is a very interesting idea.
I'm more interested in what I can do to write fewer scrapers since the content is, at a high level, relatively similar. I've just started with experiments writing "generic" scrapers that try and extract the data without depending on markup. It's going to eventually work well enough but to get the error rate down to an acceptable level is going to take a lot of tweaking and trial and error.
There's a few papers on this, but not much out there. That's why I was interested in someone else working on the same problem in a different space.
> Sidenote: I wonder if any webapps use randomly generated IDs and class names (linked in the CSS) to prevent scraping. I guess this would be a caching nightmare, though.
These guys do a stellar job on the IP addresses: http://www.hidemyass.com/proxy-list -- the good thing is the data is available for an amazing price.
Other sites I have some across will use large images and css sprites to mask price data.
I write a lot of scrapers for fun, rarely profit, just for the buzz
I bet you would only need to randomly shuffle between a few alternatives for all of them. You'd need a dedicated effort to work that one out and the cache implications could be managed. No getting around the trade-off of possible page alternatives vs cache nightmare-ness though, and doing that to json apis would get ugly fast.
At least it's easier to code these tricks than to patch a scraper to get around them.
>The issue with web scraping is that it relies on the scraper to keep up with changes made to the site.
The OP addresses that point. His contention is, there's a lot more pressure on the typical enterprise to keep their public-facing website in tip-top shape than there is to make sure whatever API they've defined is continuing to deliver results properly.
Of course, part of the art of (and fun of) scraping is to see if you can make your scraper robust against mere cosmetic changes to the site.
Of course, part of the art of (and fun of) scraping is to see if you can make your scraper robust against mere cosmetic changes to the site.
I once had to maintain a (legal) scraper and I can tell you there is no fun in making your scraper robust when the website maintainers are doing there best to keep you from scraping there site.
I've seen random class-names and identifiers, switching of DIVs and SPANs (block display). Adding and removing SPANs for nesting/un-nesting elements. And so on.
Ofcourse the site likes to keep the SEO, but most of the time it's easy to keep parts out of context for a scraper.
In most cases, the site doesn't have an api... so we scrape and take the risk that the structure will change. One thing that helps is using tools which give you jquery-like selectors because they give a lot of freedom and are very easy to write/update.
This is indeed painful. I was scraping the Pirate Bay last year for a school project, their HTML would occasionally change in subtle ways that would break my scraper for hours or days until I noticed it.
Yes, the numbers for TechStars Cloud are accurate. If you're interested in learning more about the themed program, feel free to contact me and I'll answer any questions you have!
Just chatted with Max over Skype. Had some great suggestions that I'm going to work on implementing right away - would have gladly paid a small consultation fee for the session.
We totally understand the hesitation there. We're launching with the hosted direct-connection form, but we have tons of plans to specifically address this concern.
First, we're actively working on a database agent (like chart.io using reverse-SSH tunneling) for increased security and control.
Second, we have longer term plans to offer this platform as an appliance, so everything stays within your infrastructure. That's obviously a much larger project.
In the mean time, we have the advantage of being able to make constant improvements to your API while being hosted.
We definitely built the product with existing applications (without an API) in mind. We found it much more difficult to add an API after-the-fact than if you design for one in the beginning. In fact, the idea came to us after attempting to build an API for a 10 year old PHP application at Rackspace.
Omnivore is a universal API for restaurant point-of-sales. Our API integrates hundreds of apps directly into the brains of the restaurant, without the app developers having to worry about the fragmented POS industry. Our recently launched menu management product makes it easy for restaurants to create their perfect (and integrated!) digital menu for delivery, online ordering, and more.
We're API obsessed. All of our products are served via carefully crafted REST APIs. All our forward technology is built in Go, and we have a substantial amount of Python and Javascript/Node.js running in our cloud and deployed into restaurants.
You should give us a look if these problems sound interesting to you:
Omnivore is just over 50 people today, and was built as a remote-first company from day one. We're hiring: Apply at https://jobs.omnivore.io or email me directly at kevin.pfab [at] omnivore.io.