Search Engine Optimisation – Google

At the DSpace user group in Gotenburg. Rob Tansley (Google) giving a talk about search engines, and their application to DSpace.

Background on Search Engines.

Search engines crawl over the pages following links and index the pages they crawl over.

The needs to do 3 things

1. Discover the site (DSpace instance)

2. index all the pages (items and bitstreams)

3. Retrieve enough information to judge relevance and display effective snippets.

Google Scholar differs only in step3.

First thing to do is :

Bing http://www.bing.com/webmaster
Google https://www.google.com/webmasters/tools
Yahoo https://siteexplorer.search.yahoo.com/mysites

Usually you have to leave a file on the DSpace site to prove you own the domain. Its different for DSpace than simple web sites, and requires updating the install after dropping your file in to the correct place:

JSPUI: Drop in webapp directory (alongside robots.txt) and update install
XMLUI: Drop in the webapp/static add a line to sitemap.xmap and update install

XMLUI: Simple config paramater
JSPUI: Add analytics code to your footer-default.jsp

Upgrade! as later versions of dspace have significant improvements for indexing by search engines.

Site Discovery issue: Dspace may have multiple URLs,

Choose one preferred access URL and include a choice of http or https
Ensure other urls respond with 301 moved permanently redirect
Ensure Handles redirect to the preferred URL

This can be done with Apache/Tomcat config

(login can still be https, not as important to have these on single url)

It is easier to manage if DSpace has own domain, specific robot.txt and configuration. Doesnt impact on discoverability, however you can do google custom search engine etc for a single domain.

How to verify its being discovered?

Search for site:url (site:dspace.mit.edu)
This works on Google, Bing, and Yahoo
Google Scholar does not support this, its best to search for a complete title.

Indexing

In terms of indexing items, engines use standard link following web crawlers.

They dont use OAI-PMH because of some key reasons:

usually minimal meta data
usually no access to full text
No predictable relationship of OAI url = this dspace url
Often no link to item itself
very small minority of sites use OAI-PMH

The most important thing is robots.txt

If in doubt, don’t block! – 1.5 and 1.5.1 ship with a bad robots.txt file

Look for this and remove it!

disallow: /browse

Note: robots.txt has to be at the top level of a domain eg http://dspace.foo.edu/robots.txt

A good way to check your site is via a text only browser. View your DSpace site with the Lynx text-only browser from outside your network. Helps check the site is ok for a search engine to effectively navigate site and bitstreams.

Sitemaps

As of DSpace 1.5 sitemaps are supported. presenting pages purely for search engine consumption: a “browse UI” for Web crawlers – its a static file and makes it easy to find new content. Very cheap and good way to keep server load down. DSpace supports both types of sitemaps. a simple html sitemap and the sitemaps.org protocol.

The sitemap html version works for all search engines. This map is generated by a cron job once a day usually. The front page must have the link to your page link to htmlmap, and you can also use this htmlmap using webmaster tools. This is optional. To verify : search for site:url sitemap

Sitemaps is an XML based format support. Instead of a html link, you add a sitemaps link to your robots.txt file. Then submit it to the search engines. Add each engines update URL to the dspace.cfg to prompt search engines to re-crawl. You need to check apache logs to see if the sitemap url has been read. (thoughts: maybe it would be a nice idea to have this logged in dspace admin area)

Returning useful data and your ranking.

Search engines need access to full text, meta data is not as useful. It uses it to judge relevance, ranking and creating useful result snippets. Also used by Google Scholar citation analysis. This is much more important than metadata for search engines.

For restricted content, consider allowing search engine IPs to access the items, with a search engine group with IP authentication. add <meta name robots and content=”noarchive”>

Content of Bitstreams

Make sure search engines can read and interpret contents of bitstreams.

For documents, ensure pdfs contain text and not just image
Fewer files the better not one file per chapter – makes citation analysis harder splitting ranking on each chapter
Word is ok, text best.

Ensure abstract is descriptive for non document items – this is far more valuable than other meta data fields/

Some truths

Often users who hit your site by hitting full text directly wont be able to navigate to your DSpace isntance. They just see the file contents. There is no easy way to stop this. Thoughts: perhaps an approach to look at doing Moodle file.php approach for delivery of this to keep framework.

Handles dont fit into search engines approach to the web. Links to handles may not improve ranking as much as links to the items splash page. No real way to handle this.

Metadata in HTML Headers

DSpace 1.5 + supports including the metadata in the HTML <head> of each item and links to the full text. This lets search engines (esp scholar) parse metadata despite layout changes. If you have customised the registry, ensure you update mappings in the configs/xhtml-head-item.properties. Make sure any customisations you use dont leave these headers out.

Lightening Server Load

if-modified-since header is in DSpace 1.4 or later. They do not have to retrieve unchanged content.

Use sitemaps which prevents crawlers hitting all the browse pages

Careful crafting of robots.txt

Block /browse once robots.txt
check back 1 week and 1 month later to ensure updated items have been indexed

Search Engine Optimisation – Google

Moodle Add-ons Book

Popular articles

Recent Posts

Publications

Connect

Blogroll

Categories

Archives

Moodle 2 Stuff

Services