Tuesday, 22 March 2016

Office 365 performance – image renditions causing slow page loads in SharePoint Online

Just like any website, there are many reasons why page load times might not be amazing in SharePoint Online. Perhaps it’s a page with too many ‘heavy’ controls (e.g search web parts), a particularly slow custom control, the amount of data going over the wire (e.g. due to large images JavaScript/CSS files), use of a known performance killer such as structural navigation, or maybe things are slow from the office due to network infrastructure such as reverse proxies slowing things. If users are far away from where the Office 365 tenant is located, that can certainly exacerbate things. As always, if the site has any kind of customization, some optimization steps need to be taken - good performance won’t always happen by default. Recently however, we’ve been noticing slow page loads even in:

  • Sites we have optimized
  • Out-of-the-box publishing sites

Analysis showed that the issue was related to image renditions in SharePoint. If you’re not familiar with this feature, it does something useful which, ironically, is intended to improve site performance. For each image uploaded, multiple resized versions are automatically created in the background – the idea is that end-users don’t download a large ‘original’ image when only a tiny thumbnail is needed. A classic example is large images added to content pages, which are then shown as a list of rolled-up links on a home page e.g. “most recent news articles”.

If it wasn’t for the performance issue, the principle works well – a 4MB image is typically shrunk to around 200k for a typical size, and that’s a lot less data being downloaded to users. I’m not sure if anything has changed recently in Office 365 (since most sites I’ve been involved in use image renditions), but a couple of clients noticed the issue around the same time we did. Specifically, pages with renditions are slow on “first-time” page loads i.e. whenever the images need to be downloaded because they are not served from the local browser cache. But unfortunately it’s not just that - rendition image files are served with expiry headers of 24 hours, meaning even regular users will have at least one very slow page load every 24 hours. And of course, we might not just be talking about their home page – rather, every page they hit could be slow once every 24 hours. That’s certainly enough to damage user perceptions about Office 365.

Sidenote – first-time page loads vs. returning user page loads
So renditions are slow even outside of first-time page loads. But whilst we’re on the subject, how much do we need to care about first-time page loads anyway? I typically advise my clients not to worry too much – frankly, Office 365 will always be slow here since there are some pretty heavy JavaScript files that need to be downloaded (even if they do come from a CDN). It’s a rich, highly-functional platform after all. But this is very much part of the Office 365 design – I believe Microsoft take the view that users are forgiving, so long as their *subsequent* browsing experience is quick. Most users don’t have the same expectations of their corporate intranet/collaboration platform as they do of public consumer sites such as Facebook and Google – and since most intranet usage is *not* first-time page loads, things work out in the end. I agree with this viewpoint frankly.

But why are image renditions slow?

When further analysis is performed, we see the delay happens on the server – the big surprise is that the delay is not for the actual image to be downloaded, which is what you’d normally expect. The following image shows a real home page being loaded, and we can see a delay of multiple seconds for many images (each one being a “rendition” image), as indicated by the long green bars:

Rendition image delays_Small

When we dig deeper, we see the delay is not in the content download, but is in the “waiting” stage – this indicates the delay is with Office 365 itself, and not in the actual file being downloaded:

Rendition image delays - detail_Small

We believe this happens because there are typically “cache misses” on rendition images being served from the BLOB cache. When the image is not served from the blob cache, the SharePoint Online infrastructure is very slow to process and serve the image. It seems that cache hits are very rare for end-users – possibly due in part to BLOB cache settings in SPO (e.g. disk size allowed), but more likely due to sheer number of front-end servers in a typical SPO farm. I’m told that some of the larger farms in the service have between 100 and 200 front-end servers now – clearly this is a very different situation to an on-premises environment which SharePoint was originally designed for. So whilst the renditions architecture would be very effective in a typical on-prem farm of say, 3-6 front-end web servers, in the Office 365 world this is not the case. Of course, if you work closely with the product you sometimes see some examples of things like this that didn’t quite translate perfectly to the cloud world. That said, having worked on many deployments I’m always amazed at just how well SharePoint does work as a service (no doubt due to some hard work from talented Microsoft engineers, including some folks I know) - but there will always be “opportunities for improvement”, and the service often evolves to include these.

Our solution (based on Azure CDN)

So what can we do about it? Well, we could just avoid using SharePoint image renditions completely, but then performance would still be poor due to large files being downloaded to users. So we definitely do want to use different image sizes – and since the core work of resizing images isn’t that hard, why not do it ourselves? We could then take advantage of other things, like some automatic use of a CDN such as Azure CDN (N.B. that link explains what a CDN does if you’re not familiar). This is the direction we took to work around the performance issue in SharePoint Online. Things work pretty well, and what we implemented has the following benefits:

  • The Office 365 renditions delay does not occur
  • There is no impact to intranet end-users (except the improved performance)
  • Intranet authors do not need to do anything different or have additional training
  • The image files are hosted in Azure CDN (Content Delivery Network), which places the files in various Azure datacentres around the world, to ensure they are close to the user. This can significantly boost performance further for some users, especially those far away from the Office 365 datacentres (e.g. non-European users in the case of most of our clients)
  • All of the capabilities of the image renditions framework are supported (e.g. the ability for an author/administrator to crop an image rendition so that a certain portion of the image is used)

Technical architecture of our solution

I’ll go into more detail in the next post, but briefly our solution works as follows follows:

  1. An intranet author adds or changes an image in SharePoint
  2. A Remote Event Receiver executes, and adds an item to a queue in Azure (NOTE – RERs are not failsafe, so we supplement this approach with a fall-back mechanism which ensures broken images never happen. More on this next time).
  3. An Azure WebJob processes the queue item, taking the following steps:
    1. Fetches the image from SharePoint (using CSOM)
    2. Creates different rendition sizes (using the sizes defined in SharePoint)
    3. Stores all the resulting files in Azure BLOB storage
  4. The Azure CDN infrastructure then propagates the image file to different Azure datacentres around the world.
  5. When images are displayed in SharePoint, the link to the Azure CDN is used (courtesy of a small tweak to our display template code). Since CDNs work by supplying one URL, the routing automatically happens so that the nearest copy of the image to the user is fetched.

This is depicted below:

CDN image renditions for SPO

Fall-back mechanism

Clearly it’s critical that an intranet home page never displays a broken or missing image – that would be A Very Bad Thing for most organizations. So how can we guard against that? Also, we said that Remote Event Receivers cannot be 100% reliable (and also should not do “heavy” processing work), so…..what about that? And what about existing images in a site that was running before a solution like this is implemented? My colleague Paul Ryan and I wrestled with these challenges and more as we architected the solution and wrote the code - I’ll talk more about the fall-back mechanism (which takes care of these aspects) and go into more detail on the technical implementation in the next post. 

Summary

It would be great if this issue didn’t exist in Office 365 in the first place of course. But I write this post to show that with the right building blocks, we can certainly supplement functionality around Office 365/SharePoint with some effort. This was a solution developed for clients of Content and Code so it’s not something I can provide the source code for, but hopefully this write-up helps awareness of the issue and potential ways of working around it. More next time..

8 comments:

Dave said...

I'd noticed the same thing myself - recently as well. Renditions were working as expected until a few months ago, now they can take several seconds to appear.

I wonder if there's a regression issue here, rather than the natural growth of their farms becoming apparent? It seems odd that there was an obvious step where the problem became apparent.

It would be a shame if this is going to become a "just how things are now" aspect of SharePoint Online. Renditions were very useful and very quick to implement.

Stefan Bauer said...

Hi,

encountered the same problem recently in Office 365. The problem I've spotted doesn't come from the image rendition itself because SharePoint is realy quick to process the image renditions.
I the value for a single image rendition was between 5 and 30 milliseconds (SPRequest Duration) to process which is pretty quick and is the same time I got on on premises installation.
I think the problem more comes from the ASP.net and how SharePoint/IIS assembles the requests.

In your traces, as far as i see in your screenshots, you have cache disabled, which is appropriate for two scenarios. The first is, when the user first hit the page because otherwise the images will be loaded from the local cache.
The second situation where the cache will be ignored is that the Internet Explorer/Edge is unable to cache locally on HTTPS Pages another really anying fact. All the asset in this browser will always be loaded from the server, which cause in really bad page performance.

Overall all the rendering performance is not the best at the moment. Let's hope this will be fixed and changed soon.

/Stefan

Chris O'Brien said...

Hey Stefan,

Good info, thanks. Yes, it's definitely first-time page loads that I'm focusing on here (as I tried to make clear), hence testing with cache disabled. I wasn't aware there was a difference in IE/Edge behaviour compared to the other browsers - I'll try and look into this too, but what made you think this? Do you have any additional info?

Thanks!

COB.

Stefan Bauer said...

Hi,

found the post again and it is well hidden by Microsoft. The information on how IE/Edge caches can be found on Back Navigation Caching

In there it describes how caching works in general and what the requirements are. One condition is that:
Served using the HTTP: protocol (HTTPS pages are not cached for security reasons)

This matches my review that Office 365 pages as well as other pages on Edge/IE won't be cached at all. This behaviour also can be reproduced by a network trace in IE.
While Chrome states on most assets of the SharePoint 'From Cache', Mirosoft Edge downloads all the things again and ignore the cache completely.

/Stefan

Chris O'Brien said...

Hey Stefan,

So I looked at this, but I don't see quite the same thing. As far as I can tell, you're 100% right that the *page itself* is not served from the cache, but external resources such as CSS and JS files *are* served from the cache. I verified this in IE/Edge dev tools and also HttpWatch.

You can see this at http://www.sharepointnutsandbolts.com/p/ieedge-network-trace.html

So, I don't think this behaviour is too different to other browsers. But anyway, I guess the main thing is that broadly, the vast majority of files *will* be served from the cache on a non first time page load (as expected).

Do you think I'm missing something, or do you also think that's the conclusion?

Cheers,

COB.

Stefan Bauer said...

Hi Chris,

checked it again and took a closer look at the images. You are right that they come from cache now. There must have been some changes applied on Office 365. A couple of weeks I struggled with the same problems as you described.
I checked it from Edge down to IE 10 and in any case the files was downloaded newly on each request.
I also checked several other browsers and Safari on Mac was the fastest on downloading and rendering the page the time was nearly the 50% quicker than others and TTFB was strangely better too.

The same slow TTFB but that mostly depended on the tennant I tried it.

Seems like they tweaked some stuff in the background because I had some realy strange troubles at this time.

I'll keep an eye on this but so far I can say is that your solution is worth to be considered on the next branding project or as future improvements.

/Stefan




Chris O'Brien said...

Hi Stefan,

Cool, thanks for confirming!

Cheers,

COB.

DJ Prince said...

I spoke with someone at Microsoft, they tried to roll out a fix for the image rendition bug a few weeks ago...without success. They are still working on it. What we have done in some cases is to refer to the large thumbnail image stored in SharePoint ( _w and _jpg prefix)