- Sites we have optimized
- Out-of-the-box publishing sites
Analysis showed that the issue was related to image renditions in SharePoint. If you’re not familiar with this feature, it does something useful which, ironically, is intended to improve site performance. For each image uploaded, multiple resized versions are automatically created in the background – the idea is that end-users don’t download a large ‘original’ image when only a tiny thumbnail is needed. A classic example is large images added to content pages, which are then shown as a list of rolled-up links on a home page e.g. “most recent news articles”.
If it wasn’t for the performance issue, the principle works well – a 4MB image is typically shrunk to around 200k for a typical size, and that’s a lot less data being downloaded to users. I’m not sure if anything has changed recently in Office 365 (since most sites I’ve been involved in use image renditions), but a couple of clients noticed the issue around the same time we did. Specifically, pages with renditions are slow on “first-time” page loads i.e. whenever the images need to be downloaded because they are not served from the local browser cache. But unfortunately it’s not just that - rendition image files are served with expiry headers of 24 hours, meaning even regular users will have at least one very slow page load every 24 hours. And of course, we might not just be talking about their home page – rather, every page they hit could be slow once every 24 hours. That’s certainly enough to damage user perceptions about Office 365.
But why are image renditions slow?
When further analysis is performed, we see the delay happens on the server – the big surprise is that the delay is not for the actual image to be downloaded, which is what you’d normally expect. The following image shows a real home page being loaded, and we can see a delay of multiple seconds for many images (each one being a “rendition” image), as indicated by the long green bars:
When we dig deeper, we see the delay is not in the content download, but is in the “waiting” stage – this indicates the delay is with Office 365 itself, and not in the actual file being downloaded:
We believe this happens because there are typically “cache misses” on rendition images being served from the BLOB cache. When the image is not served from the blob cache, the SharePoint Online infrastructure is very slow to process and serve the image. It seems that cache hits are very rare for end-users – possibly due in part to BLOB cache settings in SPO (e.g. disk size allowed), but more likely due to sheer number of front-end servers in a typical SPO farm. I’m told that some of the larger farms in the service have between 100 and 200 front-end servers now – clearly this is a very different situation to an on-premises environment which SharePoint was originally designed for. So whilst the renditions architecture would be very effective in a typical on-prem farm of say, 3-6 front-end web servers, in the Office 365 world this is not the case. Of course, if you work closely with the product you sometimes see some examples of things like this that didn’t quite translate perfectly to the cloud world. That said, having worked on many deployments I’m always amazed at just how well SharePoint does work as a service (no doubt due to some hard work from talented Microsoft engineers, including some folks I know) - but there will always be “opportunities for improvement”, and the service often evolves to include these.
So what can we do about it? Well, we could just avoid using SharePoint image renditions completely, but then performance would still be poor due to large files being downloaded to users. So we definitely do want to use different image sizes – and since the core work of resizing images isn’t that hard, why not do it ourselves? We could then take advantage of other things, like some automatic use of a CDN such as Azure CDN (N.B. that link explains what a CDN does if you’re not familiar). This is the direction we took to work around the performance issue in SharePoint Online. Things work pretty well, and what we implemented has the following benefits:
- The Office 365 renditions delay does not occur
- There is no impact to intranet end-users (except the improved performance)
- Intranet authors do not need to do anything different or have additional training
- The image files are hosted in Azure CDN (Content Delivery Network), which places the files in various Azure datacentres around the world, to ensure they are close to the user. This can significantly boost performance further for some users, especially those far away from the Office 365 datacentres (e.g. non-European users in the case of most of our clients)
- All of the capabilities of the image renditions framework are supported (e.g. the ability for an author/administrator to crop an image rendition so that a certain portion of the image is used)
Technical architecture of our solution
I’ll go into more detail in the next post, but briefly our solution works as follows follows:
- An intranet author adds or changes an image in SharePoint
- A Remote Event Receiver executes, and adds an item to a queue in Azure (NOTE – RERs are not failsafe, so we supplement this approach with a fall-back mechanism which ensures broken images never happen. More on this next time).
- An Azure WebJob processes the queue item, taking the following steps:
- Fetches the image from SharePoint (using CSOM)
- Creates different rendition sizes (using the sizes defined in SharePoint)
- Stores all the resulting files in Azure BLOB storage
- The Azure CDN infrastructure then propagates the image file to different Azure datacentres around the world.
- When images are displayed in SharePoint, the link to the Azure CDN is used (courtesy of a small tweak to our display template code). Since CDNs work by supplying one URL, the routing automatically happens so that the nearest copy of the image to the user is fetched.
This is depicted below:
Clearly it’s critical that an intranet home page never displays a broken or missing image – that would be A Very Bad Thing for most organizations. So how can we guard against that? Also, we said that Remote Event Receivers cannot be 100% reliable (and also should not do “heavy” processing work), so…..what about that? And what about existing images in a site that was running before a solution like this is implemented? My colleague Paul Ryan and I wrestled with these challenges and more as we architected the solution and wrote the code - I’ll talk more about the fall-back mechanism (which takes care of these aspects) and go into more detail on the technical implementation in the next post.
It would be great if this issue didn’t exist in Office 365 in the first place of course. But I write this post to show that with the right building blocks, we can certainly supplement functionality around Office 365/SharePoint with some effort. This was a solution developed for clients of Content and Code so it’s not something I can provide the source code for, but hopefully this write-up helps awareness of the issue and potential ways of working around it. More next time..