Chris O'Brien: SharePoint Syntex AI - tips for choosing between structured and unstructured document understanding and form processing models

SharePoint Syntex, the AI-powered document understanding capability of Microsoft 365 which was previously part of Project Cortex (we'll come to that in a second), provides two approaches to understanding your content in structured and unstructured document understanding. I talked about document understanding in two previous articles:

SharePoint Syntex - training Syntex to read your documents like a human - part 1 (classifiers)
SharePoint Syntex - training Syntex to read your documents like a human - part 2 (entity extractors)

However, document understanding is only 50% of the Syntex capability. Unstructured document understanding is more suited to very structured documents such as invoices, receipts, and order forms - but there's a lot more to the decision than that. On the surface, many of the document AI processing scenarios you might consider could use either approach - but whilst there's huge overlap, the two models have some significant differences in capability, licensing and how they are applied.

Choosing the right approach can be confusing at first, so that's the focus of this article. You'll find the decision between structured (previously known as "form processing" as in the image above) and unstructured document understanding comes up frequently as you work with Syntex - I also cover it in SharePoint Syntex AI - my top 5 real-world tips.

But first, did you know that the "Project Cortex" label is disappearing?

Microsoft's Viva - branding and naming changes

With the advent of Microsoft Viva, the "topics" part of Project Cortex has become "Viva Topics" and is part of the that product set. The retirement of the Cortex brand label also means that Syntex becomes it's own thing. Indeed the Syntex add-on for Office 365 didn't use the Cortex label once launched - so in the Microsoft 365 admin center you'll see:

So a quick summary of the original naming compared to the current naming is:

With that out of the way, let's get back to unstructured document understanding.

Structured vs. unstructured document understanding - what to use when?

So how do the structured and unstructured models differ? When would you use each one? Microsoft have a fairly useful article at Compare custom models in Microsoft Syntex, however some of the biggest differences lie in how the AI models are trained and what they're capable of, something the article doesn't really cover. I think things can be clearer still - here are the major differences as I see them:

Aspect	Unstructured document understanding	Structured document understanding
Best suited for	Unstructured or less structured content - documents can be written in different ways	Highly structured content - based on a specific format with high consistency
Capabilities	Classify your documents/identify content types (e.g. to drive compliance policies) Extract content	Extract content Extract table content
Summary of AI model training	Machine teaching based on flexible rules for classification and extraction (e.g. proximity rules such as "X should be within 50 characters of Y")	Machine teaching based on well-defined locations with the document
Underlying technology	Native to SharePoint Syntex	Power Apps AI Builder - form processing
Technical implementation	Model is applied to the SharePoint library through model settings	A Power Automate Flow is created and associated with the originating library - but currently cannot be applied to different libraries
How applied across your tenant	Create model once, apply to multiple SharePoint libraries (one by one)	Created model is tied to originating document library (today)
Licensing	Syntex license only: £3.40 or $5 per user per month (see "Who needs a Syntex license" later)	Syntex license + AI Builder credits: Each user needs a Syntex license, and in terms of AI Builder credits, orgs with 300+ Syntex licenses receive a bundled allowance of 1m credits (one off). See AI Builder calculator for examples of how many credits are consumed by different operations - but in short, 1m in credits is extremely generous even for large organizations. As a guideline, if you process 5000 documents per month that would be just 4 credits per month. To purchase, 1 unit of AI Builder credit costs £377 or $500
Supported file types	All Office file types, .eml, JPG, PNG, PDF, RTF, TIFF, txt	JPG, PNG or PDF
Key limitations	Office files are truncated at 64k characters OCR-scanned files (PDF, image or TIFF) are limited to 20 pages	Tables must be simple - no nested tables or merged cells Signatures, checkboxes and radio buttons cannot be extracted Max 500 pages

Let's dig into more detail on a couple of these points.

Differing approaches to machine teaching

I mention in the table above the differences in how the AI models are trained. Perhaps the images below explain it best.

In structured document understanding, I'm very precisely teaching Syntex where to find a previously-specified element of the document I want to extract - in this case the invoice reference:

In document understanding, I can do the same thing - however, I can also add some flexibility by defining "explanations" (essentially rules) which provide some extra context about variations in the document format that the AI might need to deal with:

I can also teach the machine how specific parts of the document relate to each other with a proximity rule. In the explanation below, I'm saying "first find the 'Fees and Payment' phrase, then find 'Total' which is more than 20 tokens away but less than 100. Once there, find the thing that looks like a GBP currency value which is VERY close, in fact less than 10 tokens away:

So, these differences can rule in or out one model depending on what your documents look like.

Considerations when applying across your tenant

The current limitation of an unstructured document understanding model being tied to the SharePoint library where it was created is an important factor. This means that, as of today, there's no real pathway to using one of these models across your Microsoft 365 environment - so if your invoices/order forms/receipts get stored in many different SharePoint sites or document libraries you have a lot of work on your hands to replicate the solution.

This is somewhat surprising since the implementation is a Power Automate Flow. Conceivably, you could copy and paste Flow actions to speed things up - however, I note that simply trying to repoint the Flow to another SharePoint document library (by updating references) currently fails, perhaps indicating that there are some internal references which become out of sync. Indeed, the Microsoft documentation states:

Unstructured document understanding models can currently only be applied to the SharePoint document library from which you created them. This allows licensed users with access to the site to create a structured document understanding model. Note that an admin needs to enable structured document understanding on a SharePoint document library for it to be available to licensed users.

Hopefully Syntex unstructured document understanding models become much more reusable and portable in the future.

Other benefits with SharePoint Syntex licensing

Notably, Microsoft provide some other capabilities as part of the Syntex license. These are badged as "premium" items and include:

Term store analytics

Insights on how tags are being applied to your content - term store operations, open and closed term sets, terms without synonyms and more
See Term store reports | Microsoft Docs

Content type push to hub

More control over where content types get applied to your environment. Being able to pushing from the central term store to a hub can become the first part in a chain, where the second part is the existing "push from hub to associated sites" capability. In the end, you get your content types to where they need to be without PowerShell or other roll your own approaches
See Push content type to hub

Import using SKOS format

The ability to import SharePoint taxonomy terms from a common format
See Import a term set using a SKOS-based format

Who needs a Syntex license?

At £3.40/$5 per user per month for the Syntex add-on, understanding exactly who in the organization requires a Syntex license becomes critical. The SharePoint Syntex FAQ states the following, though the highlighting is mine:

Anyone using, consuming, or otherwise benefiting from SharePoint Syntex capabilities requires a license. This includes the following scenarios:

Access a Content Center
Create a document understanding model in a Content Center
Upload content to a library where a document understanding model is associated (whether in a Content Center or elsewhere)
Manually execute a document understanding model
View a library where a document understanding model is associated
Create an unstructured document understanding model via the entry point in a SharePoint library
Upload content to a library where an unstructured document understanding model is associated

In summary the licensing requirements are certainly pervasive - even viewing a SharePoint library where a Syntex model is used requires a license.

Conclusions

An organization's ability to automate processes around documents, improve findability and extract important corporate knowledge from files will be important factors in agility and effectiveness over the next few years. You can find the AI that supports this in a few different places in the Microsoft stack (and the wider cloud market), but Syntex is the easy-to-consume technology that brings this directly to where your documents are in Office 365 - SharePoint and Teams. With a little experimentation and persistence, a non-technical business user can build powerful automations and effective search tools.

Syntex licensing means that it's likely to be used in specific use cases - perhaps for your organization it could be something around CVs/resumes, RFPs, project documents, invoices, order forms or partner details - but you may not feel it's appropriate to license every user in your organisation, at least not until the value provided is clear.

Making the right decisions between the two Syntex approaches of structured and unstructured document understanding is vital and this can be a confusing area. Some of the major differences include what the AI models are capable of, how easy they are to use in different places, supported file types and licensing. Hopefully this article is useful in helping you navigate this.

Chris O'Brien

Sunday, 28 February 2021

SharePoint Syntex AI - tips for choosing between structured and unstructured document understanding and form processing models

Microsoft's Viva - branding and naming changes

Structured vs. unstructured document understanding - what to use when?

Differing approaches to machine teaching

Considerations when applying across your tenant

Other benefits with SharePoint Syntex licensing

Who needs a Syntex license?

Conclusions

No comments:

MVP profile

Popular articles

RSS feed

Previous articles

About Me