Search Appliance Replacement
The Google Search Appliance (GSA) currently performs two main functions, web search and data integration or transformation. The GSA will be decommissioned before March 2019, so a replacement solution must be implemented before the end of 2018 to allow time for testing and quality assurance.
GSA is currently central to many areas of our web site (course finder, news, events, staff profiles, vacancies etc). When the GSA stops we not only lose the search function, but these other areas will cease to function.
The replacement service must do web search well and provide data integration options. While the University has other solutions in the data integration space, our preference is for the search replacement to do what the GSA is currently doing, if only to reduce dependencies and implementation issues.
Functional Requirements
Web search
Must
Index all content on multiple websites (on one or more domains, on different CMSs) and allow us to control the index schedule
- We currently have approximately 140 sites, 30 domains and 10 CMSs
Perform well at our size, meaning results to the user
- We currently have approximately 200,000 URLs and 110,000 documents
Search results default to widest setting (i.e. all sites) but can be restricted to only some/one site as required
Relevancy calculation must be sound, should be transparent and we must have 'control'
Results output available for non-web use (i.e. in different formats (e.g. XML, JSON))
Security: Results must be related to user's role/permissions (e.g. results from the intranet are only shown to staff), as stored in Active Directory
Analytics on searched usage/terms/etc
Reporting designed to help us improve web content and search results
Technology/architecture fit with our existing infrastructure and capabilities
- We are mostly likely to move our primary CMS to a hosted model within the next 12 months, but there will always be other CMS and searchable resources that will remain on-site
Should
Index file systems (i.e. does it support Samba, NFS, etc)
Index repositories (e.g. Sharepoint)
Index content in other languages
Synonyms, both in-but and load our own
Autocompletion and wildcard support
Spelling suggestions
Spelling corrections
Personalisation of search results and self-learning
Result snippet dynamic and relevant
Accessibility reporting
Could
Enterprise search (not unpacked here due to the low priority placed on this at this time)
Data integration
Must
Connect to databases - including SQL databases
Index database content as 'documents'
Provide ability to create and manage 'collections' of indexed documents
Present outputs in XML and JSON formats
Our current integration points
Data set | Database server | Sync schedule | Amount of documents |
---|---|---|---|
Courses | VUWWINCOSQLHWB1.vuw.ac.nz | 24 hrs | ~3200 |
Staff | vuwwincosqlhwb1.vuw.ac.nz | 24 hrs | ~2800 |
Vacancies | 192.168.244.40 | hourly | ~30 |
References
http://www.searchtools.com/guide/index.html: An excellent 'concentrated' read on how search works and how to do it well.
- http://www.upenn.edu/computing/web/webteam/rnd/search_req.html
- http://www.upenn.edu/computing/web/webteam/rnd/search_compare.html
https://blog.liip.ch/archive/2011/01/13/why-a-project-switched-from-google-search-appliance-to-zend_lucene.html: Honest, open sharing of the learnings
https://bigwisdom.quora.com/What%E2%80%99s-The-Best-Alternative-To-Replace-Google-Search-Appliance-GSA: A great first read to orient yourself in the problem space
http://www.yippyinc.com/insights/3-factors-to-choose-your-google-search-appliance-gsa-replacement
http://blog.leeromero.org/2008/10/01/categories-of-search-requirements/
Database connectors
https://support.google.com/gsa/answer/4363201