Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Does Google crawl dynamic content? (centrical.com)
165 points by pstadler on Dec 8, 2015 | hide | past | favorite | 59 comments


We built our site, https://appapp.io (a search engine for the App Store) as a one page app. It serves no dynamic content in html from the server, so we were unsure to what extent google would spider/index it.

As far as we can tell, it makes no difference from if it was generated server side: https://www.google.com/search?q=site%3Aappapp.io

So yes, Google definitely does index dynamic content. I would love to know if it ranks it equivalently.

Also, Bing does not: http://www.bing.com/search?q=site%3aappapp.io

(apologies for the minor self-promotion)


If you do a search with 'site:site:https://appapp.io' and go to the last page of results, you'll see Google indexed officially approximately 120 results. For example, this request does not return any content: 'site:https://appapp.io "Release notes for version 6.6.0"'. It should return the page /<lang>/app/we-heart-it/539124565.


Yes, we have more work to do to get Google to index all our content. It's still a bit of a mystery to us (Google webmaster tools tells us about 2500 pages are in their index)

Our goal is not to have every app indexed (as that will by definition be non-original content), but to have our app category pages indexed, e.g. https://appapp.io/gb/genre=Games;has_iap=false;price=Paid/se...


Moreover, note that in the SERP, you can find some pages with the good title but with the wrong description, cf. the 2 first results for example here http://imgur.com/cOV85UJ.

BTW, if you don't want every app to be indexed, I would recommend you the tag <meta name="bots" content="noindex"> in the app pages. Alternatively, you could define the canonical URL as the URL of the original content.


We have a ReactJs application served from S3 with dynamic content on https://teletext.io. All content (eg texts) is loaded asynchronously from a CloudFront distribution and I can confirm this is indexed properly by Google (see https://www.google.com/search?q=site%3Ateletext.io).

The only thing we cannot seem to get right are the meta title and meta description. If you set that asynchronously based on the React page you are rendering, Google only seems to pick it up in about 10% of the pages. So the SERPS doesn't look as pretty as you would like. I didn't found a solution for that yet. :-(


Thanks. We have a lot of optimisation to do. When we launched we weren't sure to what extent we'd be indexed.

Now that we know we are (by Google at least) we can put some focus on optimising our SERPS.


Maybe clearing up some parts about the mysteries there:

I recently did a talk at GDG about the efforts to get our SPA ranking in the Google SERPs, maybe some enlightening parts in there :)

TL;DR: All possible, same rankings, some caveats though.

GDG DevFest 2015 - We can't use Angular. It will hurt our SEO. [video]

https://www.youtube.com/watch?v=Sp1pfC1M7Dg


We found the same thing with Angular. Ironic and annoying.


I suppose this explains all the times I've seen a promising search result with the words I was searching for prominently highlighted, then visited the page to find what I was looking for is no longer there. Sometimes the cached, text-only version has it, and sometimes not. Alternatively, I'll see search results with none of the words I was searching for, yet perhaps they did sometime in the past. Rather annoying.


This is a problem with a lot of paginated sites, such as Tumblr, various forums, and comment pages. Anything that's ordered from newest to oldest won't have a constant correspondence between URL and content.


That's why pagination (when newest to oldest) should be designed as something like "after_id=x". Sure, there is some implementation complexity, but your users will love you if they can actually find the content they searched for.


What about the frontpage which is usually page 1 of the pagination?

Google should give the actual article URL a higher score and pagination pages a lower score. So that in their search results I see the content first and the "dupe content" on pagination pages not at all (or way down). (at least for common blog software)


What about a redirect from your home page to whatever page 1 is at the moment?


Imagine a blog. On frontpage example.com/ (= example.com/?page=1) it shows the newest 10 articles, on example.com/?page=2 it shows the next 10 articles, and so on. Every article has the actual URL in its headline hyperlink (e.g. example.com/?article=123)

Now imagine that Google links to example.com/?page=2 as it found the search phrase also there (at a given time only Google knows). So when the user clicks on the search result link that leads to example.com/?page=2 NOW should the blog software know what Google or the user wants?

One thing that comes to my mind is to use the referrer and if it's a common search engine parse the s=SEARCHTERM string and use an internal article search to find the best matching article.


I think you missed the part earlier in this thread about using ?after=id rather than ?page=n. See Reddit for an example.


You can see it here on HN as well. Click "new" at the top of this page, then "More", and see the "next=" parameter in the URL.


I am of the opinion that for text, pagination should be avoided. Except if the only aim is to increase the page refresh for advertising purposes.


I have modified wikipedia pages, then googled it, to see search result "instantly" updated.

Also, sneaky web sites often give different results to the googlebot user agent than to a non-google firefox user agent

https://en.wikipedia.org/wiki/User_agent

https://addons.mozilla.org/en-GB/firefox/search/?q=user+agen...


Google deliberately crawls in a non-Google looking way to try and detect "masking".


Yeah, that expert sex change site used to bug the hell out of me for that reason... scroll forever and a day after loading.


I'd really love it if you repeated the same tests for Bing, just to get coverage. (Yahoo/Baidu would be the other big two.) Historically, Bing hasn't used fully functional headless browsers to crawl, which has limited its ability to index dynamic content like this.

Google has "only" 70% market share, so it seems irresponsible to make engineering decisions without testing the others. Google+Bing+Yahoo+Baidu get you to 98%.


I just did this page. The page is indexed. When I look for the search term "Update this was posted to Google on Friday the 17th of July, 2015. Monday, the 20th" the page is shown.

Trying to find any of the other search strings in the article for the different loading variants does not return any results. So no variant of javascript injected content is working on Bing currently.


"Originally, none of the actual web crawling and data housing was done by Yahoo! itself. In 2001, the searchable index was powered by Inktomi and later was powered by Google until 2004, when Yahoo! Search became independent. On July 29, 2009, Microsoft and Yahoo! announced a deal in which Bing would henceforth power Yahoo! Search."

https://en.wikipedia.org/wiki/Yahoo!_Search


The post author writes:

> So, very soon, the days of pre-rendering PhantomJs snapshots and serving shadow content to spiders will be over.

To be clear: webmasters of sites with dynamic content should not celebrate yet. There are still influential spiders other than Google's that do not parse JavaScript (for example, Facebook[1] and Twitter[2]).

[1] https://developers.facebook.com/docs/sharing/webmasters/craw...

[2] Can't find an official statement on this, but https://twittercommunity.com/search?q=javascript%20crawl


And, of course, there are still all the other reasons why you shouldn't be serving static text via javascript; I wish the article had included such a caveat.


I'm curious how google strongly penalizes SPAs for being slow to load.

The content may be indexed, but if your visitors are on a mobile network, that initial visit (or a visit with stale cache) is going to be crappy. It's great that they can read in they content (though bing cannot), but if it's buried on page two, does it even matter?

As someone who is a proponent of web perf, these kind of articles make me worried that server side rendering will be ignored because "SEO works now for Javascript", even if it's slow and google is only 70% desktop & 80% mobile search.


SPAs should not be slow. If they are, they haven't been designed properly, I think.


My theory is that the Google crawler is a modified, headless version of Chrome. These results seem consistent with that hypothesis.


Are they also using people's browsing history to 'find' content? E.g. from their safety filter?

Though I don't think it's happening, I've thought it'd be very clever if users became the search spider for Google, telling them when content had gone stale and/or doing the spidering on Google's part. Just by using Google's browser.


I can confirm that is not the case. I've had a canary page for that purpose set up for years and it never fired. If it ever does you can expect a blog post. I have another one that is set up to fire if google ever uses gmailed links to crawl, that one too never fired.

Now, that's only one bit of data but if you want to be sure you can set up a trigger page of your own.


Have you tried accessing the site through Google DNS? 8.8.8.8 and 8.8.4.4.


No, I haven't. Just chrome for the one and gmail for the other.

The way it works is simple, I made a random url, stuck a script in there that sends an email with the url as the subject header. The first script I visited using chrome, the second script I mailed myself a link to from another email account to a gmail account. Both scripts fired when chrome activated the links so I know they work, then I simply let it rest.

I guess the gmail one would require re-crawling of all gmail for it to fire, the chrome only one would be dependent on the version of chrome that I ran the test on phoning home with the link, and I did not re-try this for every version of chrome (or on every platform). So it's not a perfect method but it definitely puts the lie to chrome or gmail data being directly used to power google search results that people would expect to remain confidential because only they have the urls. Score one for obscurity and nice of google to ignore these paths.

I could set up another url for a test using the DNS but you're free to do so yourself as well of course, it is definitely an interesting idea.

And if either of those scripts ever does fire I'll rip google a new one, that would be the sort of abuse of trust that gets my temperature up. But for now I'll take the fact that it hasn't happened as proof that google can be trusted with data to some extent.


DNS only sees the host name, so it can't be used to see what URLs are being accessed.


I think that was the test case that he intended. So set up a domain that is otherwise unknown and then use google's DNS and see if the domain is hit by the search engine.


We (kingandmcgaw.com) have seen plenty of examples that suggest Googlebot tries to discover pages that might exist (as distinct from crawling interlinking pages).

This includes navigating the URL structure (in finding blah.com/one/two/three.html there may be attempts to /one/two and /one) and password-protected admins which are not linked to from anywhere (we suspect Analytics or toolbars are telling Google these pages exist). As a result, Googlebot generates a bunch of false positives in our error logs.


I think they use their DNS service to learn about new domains, it could be interesting if it could be used for new content but I don't think so.


They don't really need to, Verisign (for .com) and ICANN (for gTLDs) provide access to the zone files.


Depends on why they'd want to know about "new" domains. For indexing content, it would certainly make sense to index stuff people (appear to be) read(ing). Just because someone registers a domain, doesn't mean it has any (interesting) content.


+1

Also, maybe it's complementary to the headed version that lot of persons use and reports back. That millions of persons requesting and rendering html, are some kind of distributed indexers too.


Probably also one of the reasons to start the Chrome project, though their headless variant isn't open source.

Are there other headless browser beside PhantomJS? PhantomJS is based on webkit (Safari).


Chromium is Open Source and embeddable, so my guess is a big part of why their headless variant isn't open source is simply that there isn't really much to it.


From a naive point of view it's easy. But it has to scale and launching a new process for every request works only for a pet project.

So a crawler based on a headless browser that consumes little memory and runs for weeks is a major achivement.


Why doesn't that scale? You might be underestimating Google's resources.



Crawling JS content, yes. But does it rank for that content in the same way it would if the whole document was generated on the server?


I wonder if you could use this to find information about the google crawler. Inject system and browser info into the page. Then you can find out what kind of browser engine it runs, with which settings etc.. If you wanted, you could use this information to do undetectable masking (I don't think it would work in the long run, though)

It would be also interesting to see what timeouts it still allows. I wouldn't be surprized if the modified browser "virtualizes" time and runs window.setTimeout immediately. Maybe you could make a busy loop and find out what the real timeouts are. It seems there got to be some, otherwise this would open a way to DOS the crawler (not that I'd do that).


Google may be indexing dynamic content now, but the question I'm curious about is how it affects crawl efficiency. I can't imagine indexing JS content is as efficient as indexing content returned from the original HTTP request.


They just wait around for the page to change a bit and once it's done (after some timeout) then it takes what's currently on the DOM as being the actual page. This is then sent back, analysed and parsed as the content. Similar to how phantomjs does page rendering with timeout..


Related comment from an HNer who worked on this at Google (from 2006 to 2010): https://news.ycombinator.com/item?id=9531344


Regarding SPA-based websites, as long as your site has only a few pages, these results are relevant. I would like to see the same kind of test on a site with 1000+ pages for example. I already did this kind of test in the past and it was failing miserably (i.e. only a dozen of pages were correctly indexed).


The next test could be: Does google crawl hidden text (display:none, very small, very transparent colored text)? My guess is they do crawl it because it can have legitimate uses, but if there is to much of them on a page then they give it a lower ranking.


Hidden text can be problematic as it's often a form of cloaking as can hiding text off the page using css positioning.



I am sure that google "discovers" javascript/ajax content. They also mention this on their guides several times.

But are there any experiments/results related to SEO impact/crawl frequency etc?


Offtopic: "Google search results on tablet"

Recently Google changed their search result page for tablets. First it looked fine, and useful.

But many times the first result page is now completely full of advertisements, only the second page now shows usual links to websites like Github, Wikipedia, Youtube, etc. of a common search term. Very annoying! And the Youtube link is broken on iPad (it tries to link to a non HTTP address). I am just unlucky to be part of an AB-testing?

An news article about the changes: http://searchengineland.com/google-launches-new-search-resul...


centrical.com is blocked where i work by mcafee web gateway due to GTI reputation identifing it as malicious and high risk


Ha! Interesting. It is my personal website since 2001 or so, and has never been compromised. It is a flat file website in S3 with Cloudflare on top - not much to hack. I think mcAfee is a tiny bit too strict ;-)


iirc this is why they originally started development on what is now Chrome.


TL;DR: Yes, most of the time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: