Improving the performance of our Varnish instance

Recent observations indicate two main issues with our Varnish cache:

Unpredictable cache coverage: Why is it that some pages are served from cache and others aren't? What exactly is cached? Under what conditions are things that were cached dropped?
Different cache responses from different slaves: Refreshing the screen repeatedly can 'toggle' between two different versions of the truth

A suggested process to improve the performance of Varnish includes:

Understand how Varnish is currently working
Identify improvements
Prioritise the improvements
Implement the highest priority
Remeasure performance
Return to step 4

Understand how Varnish is currently working

Answering the following questions will enrich our understanding of Varnish

What version of Varnish are we running? Answer: 4.0.3. Do we care that 4.0.4 became unsupported in November 2016? Probably not if we have a plan to move to a CDN within a few months. Then the challenge becomes how to decommission Varnish?
What are the machine/server specifications/resources? Answer: Short of memory. Each slave is a virtual server with 6BG shared by all components. Slave 1 has 169MB free and Slave 2 172. Can we please have more memory added?
How are the two instances syncing? Answer: They don't. While the long-run average is that each instance would have the same pages cached, this is quite untrue for any one cache.However, given even a small number of requests per day and over a 24 period without changes to the asset/page, each slave would cache the same
What are the exact rules/settings that are enabled or in place? Answer: Jo has found the configuration file and is drafting suggested improvements to run past Murray and Nathan.
1. What is being cached? Images? Video? PDF? Search results? Answer: page index files, JSON responses usually if on production but not on stage, but sometimes images, PDFs, etc. Jo to addres this with Murray.
2. With what conditions? Answer: Very basic (and blunt)rules.
3. Once a piece of content has been cached in Varnish, under what conditions does it drop out of the cache or requested again from Squiz? Answer: 24 hour rule, but also if content is updated in Matrix triggers will fire to clear the Varnish cache. Otherwise objects are evicted on a least-recently-used basis.
4. How long does a cached resource stay in the cache? Answer: From Murray "The default cache period (as configured in Matrix) is 24hrs. After 24hrs an object will be considered 'stale' - however, Varnish will continue to serve the stale copy while it is refreshed in the background for up to 24hrs (i.e, if it goes stale, but someone requests it again within 24hrs, it will use the old cached copy while it goes and gets a fresh one in the background)."
How does Varnish treat Squiz's HTTP header? For example, does anybody know why the standard headers I see in the console have an "Cache-control:no-store, no-cache, must-revalidate, post-check=0, pre-check=0" and "Expires:Thu, 29 Apr 1982 00:00:00 GMT" Answer: Varnish ignores or has nothing to say about these headers. Murray has explained why the headers are as they are., but we can see easy improvements. Murray agrees that we can easily make real improvements here. Jo to draft suggested header changes for Murray and Nathan to discus.

Identify improvements

Add more memory to the servers: ITS to action
Extend Varnish configuration to ignore more of the images and documents: Jo to draft and run past Nathan and Murray
Smarter HTTP headers: Jo to draft and run past Nathan and Murray
Move to one live Varnish cache (with a (hot) fail-over to a second) to improve cache performance: Discuss with ITS. Would require configuration changes. Work will be redundant with the earlier of using a CDN or moving to a hosted Squiz.
Both slaves and master database need maintenance (vacuuming and re-indexing): Discuss with ITS and Nathan
Reactive the stalled 'copy faster slave and remove the slower slave: Discus with ITS

And on the image optimisation, Murray gave us a thorough answer and it is back to the drawing boards. I will draft a user story and we prioritise if and when it gets done.

Useful references

Change header to know if resource was cached in Varnish or not

HPPT headers to cache API responses

Simple way to leverage browser cache or a more detailed explanation

Achieving a high hit rate in Varnish

Getting started with VCL

10 Varnish cache mistakes: An interesting read

How web caches work, at different levels