Search Appliance Replacement

The Google Search Appliance (GSA) currently performs two main functions, web search and data integration or transformation. The GSA will be decommissioned before March 2019, so a replacement solution must be implemented before the end of 2018 to allow time for testing and quality assurance.

GSA is currently central to many areas of our web site (course finder, news, events, staff profiles, vacancies etc). When the GSA stops we not only lose the search function, but these other areas will cease to function.

The replacement service must do web search well and provide data integration options. While the University has other solutions in the data integration space, our preference is for the search replacement to do what the GSA is currently doing, if only to reduce dependencies and implementation issues.

Functional Requirements

Web search

Must

Index all content on multiple websites (on one or more domains, on different CMSs) and allow us to control the index schedule

  • We currently have approximately 140 sites, 30 domains and 10 CMSs

Perform well at our size, meaning results to the user

  • We currently have approximately 200,000 URLs and 110,000 documents

Search results default to widest setting (i.e. all sites) but can be restricted to only some/one site as required

Relevancy calculation must be sound, should be transparent and we must have 'control'

Results output available for non-web use (i.e. in different formats (e.g. XML, JSON))

Security: Results must be related to user's role/permissions (e.g. results from the intranet are only shown to staff), as stored in Active Directory

Analytics on searched usage/terms/etc

Reporting designed to help us improve web content and search results

Technology/architecture fit with our existing infrastructure and capabilities

  • We are mostly likely to move our primary CMS to a hosted model within the next 12 months, but there will always be other CMS and searchable resources that will remain on-site

Should

Index file systems (i.e. does it support Samba, NFS, etc)

Index repositories (e.g. Sharepoint)

Index content in other languages

Synonyms, both in-but and load our own

Autocompletion and wildcard support

Spelling suggestions

Spelling corrections

Personalisation of search results and self-learning

Result snippet dynamic and relevant

Accessibility reporting

Could

Enterprise search (not unpacked here due to the low priority placed on this at this time)

Data integration

Must

Connect to databases - including SQL databases

Index database content as 'documents'

Provide ability to create and manage 'collections' of indexed documents 

Present outputs in XML and JSON formats


Our current integration points

Data setDatabase serverSync scheduleAmount of documents
CoursesVUWWINCOSQLHWB1.vuw.ac.nz24 hrs~3200
Staffvuwwincosqlhwb1.vuw.ac.nz24 hrs~2800

Vacancies


192.168.244.40hourly~30



References

http://www.searchtools.com/guide/index.html: An excellent 'concentrated' read on how search works and how to do it well.


https://blog.liip.ch/archive/2011/01/13/why-a-project-switched-from-google-search-appliance-to-zend_lucene.html: Honest, open sharing of the learnings

https://bigwisdom.quora.com/What%E2%80%99s-The-Best-Alternative-To-Replace-Google-Search-Appliance-GSA: A great first read to orient yourself in the problem space

http://www.yippyinc.com/insights/3-factors-to-choose-your-google-search-appliance-gsa-replacement

http://www.kmworld.com/Articles/Editorial/ViewPoints/Top-5-Criteria-for-Replacing-Your-Google-Search-Appliance-112299.aspx

http://blog.leeromero.org/2008/10/01/categories-of-search-requirements/


Database connectors

https://support.google.com/gsa/answer/4363201


Documents

Search Appliance Replacement