SharePoint Syntex, the AI-powered document understanding capability of Microsoft 365 which was previously part of Project Cortex (we'll come to that in a second), provides two approaches to understanding your content in structured and unstructured document understanding. I talked about document understanding in two previous articles:
- SharePoint Syntex - training Syntex to read your documents like a human - part 1 (classifiers)
- SharePoint Syntex - training Syntex to read your documents like a human - part 2 (entity extractors)
However, document understanding is only 50% of the Syntex capability. Unstructured document understanding is more suited to very structured documents such as invoices, receipts, and order forms - but there's a lot more to the decision than that. On the surface, many of the document AI processing scenarios you might consider could use either approach - but whilst there's huge overlap, the two models have some significant differences in capability, licensing and how they are applied.
Choosing the right approach can be confusing at first, so that's the focus of this article. You'll find the decision between structured (previously known as "form processing" as in the image above) and unstructured document understanding comes up frequently as you work with Syntex - I also cover it in SharePoint Syntex AI - my top 5 real-world tips.Microsoft's Viva - branding and naming changes
With the advent of Microsoft Viva, the "topics" part of Project Cortex has become "Viva Topics" and is part of the that product set. The retirement of the Cortex brand label also means that Syntex becomes it's own thing. Indeed the Syntex add-on for Office 365 didn't use the Cortex label once launched - so in the Microsoft 365 admin center you'll see:
With that out of the way, let's get back to unstructured document understanding.
Structured vs. unstructured document understanding - what to use when?
So how do the structured and unstructured models differ? When would you use each one? Microsoft have a fairly useful article at Compare custom models in Microsoft Syntex, however some of the biggest differences lie in how the AI models are trained and what they're capable of, something the article doesn't really cover. I think things can be clearer still - here are the major differences as I see them:
Aspect | Unstructured document understanding | Structured document understanding |
---|---|---|
Best suited for | Unstructured or less structured content - documents can be written in different ways | Highly structured content - based on a specific format with high consistency |
Capabilities | Classify your documents/identify content types (e.g. to drive compliance policies) Extract content |
Extract content Extract table content |
Summary of AI model training | Machine teaching based on flexible rules for classification and extraction (e.g. proximity rules such as "X should be within 50 characters of Y") | Machine teaching based on well-defined locations with the document |
Underlying technology | Native to SharePoint Syntex | Power Apps AI Builder - form processing |
Technical implementation | Model is applied to the SharePoint library through model settings | A Power Automate Flow is created and associated with the originating library - but currently cannot be applied to different libraries |
How applied across your tenant | Create model once, apply to multiple SharePoint libraries (one by one) | Created model is tied to originating document library (today) |
Licensing | Syntex license only: £3.40 or $5 per user per month (see "Who needs a Syntex license" later) |
Syntex license + AI Builder credits: Each user needs a Syntex license, and in terms of AI Builder credits, orgs with 300+ Syntex licenses receive a bundled allowance of 1m credits (one off). See AI Builder calculator for examples of how many credits are consumed by different operations - but in short, 1m in credits is extremely generous even for large organizations. As a guideline, if you process 5000 documents per month that would be just 4 credits per month. To purchase, 1 unit of AI Builder credit costs £377 or $500 |
Supported file types | All Office file types, .eml, JPG, PNG, PDF, RTF, TIFF, txt | JPG, PNG or PDF |
Key limitations | Office files are truncated at 64k characters OCR-scanned files (PDF, image or TIFF) are limited to 20 pages |
Tables must be simple - no nested tables or merged cells Signatures, checkboxes and radio buttons cannot be extracted Max 500 pages |
Let's dig into more detail on a couple of these points.
Differing approaches to machine teaching
I mention in the table above the differences in how the AI models are trained. Perhaps the images below explain it best.
In structured document understanding, I'm very precisely teaching Syntex where to find a previously-specified element of the document I want to extract - in this case the invoice reference:I can also teach the machine how specific parts of the document relate to each other with a proximity rule. In the explanation below, I'm saying "first find the 'Fees and Payment' phrase, then find 'Total' which is more than 20 tokens away but less than 100. Once there, find the thing that looks like a GBP currency value which is VERY close, in fact less than 10 tokens away:
Considerations when applying across your tenant
The current limitation of an unstructured document understanding model being tied to the SharePoint library where it was created is an important factor. This means that, as of today, there's no real pathway to using one of these models across your Microsoft 365 environment - so if your invoices/order forms/receipts get stored in many different SharePoint sites or document libraries you have a lot of work on your hands to replicate the solution.
This is somewhat surprising since the implementation is a Power Automate Flow. Conceivably, you could copy and paste Flow actions to speed things up - however, I note that simply trying to repoint the Flow to another SharePoint document library (by updating references) currently fails, perhaps indicating that there are some internal references which become out of sync. Indeed, the Microsoft documentation states:
Unstructured document understanding models can currently only be applied to the SharePoint document library from which you created them. This allows licensed users with access to the site to create a structured document understanding model. Note that an admin needs to enable structured document understanding on a SharePoint document library for it to be available to licensed users.
Hopefully Syntex unstructured document understanding models become much more reusable and portable in the future.
Other benefits with SharePoint Syntex licensing
Notably, Microsoft provide some other capabilities as part of the Syntex license. These are badged as "premium" items and include:
- Term store analytics
- Insights on how tags are being applied to your content - term store operations, open and closed term sets, terms without synonyms and more
- See Term store reports | Microsoft Docs
- Content type push to hub
- More control over where content types get applied to your environment. Being able to pushing from the central term store to a hub can become the first part in a chain, where the second part is the existing "push from hub to associated sites" capability. In the end, you get your content types to where they need to be without PowerShell or other roll your own approaches
- See Push content type to hub
- Import using SKOS format
- The ability to import SharePoint taxonomy terms from a common format
- See Import a term set using a SKOS-based format
Who needs a Syntex license?
At £3.40/$5 per user per month for the Syntex add-on, understanding exactly who in the organization requires a Syntex license becomes critical. The SharePoint Syntex FAQ states the following, though the highlighting is mine:
Anyone using, consuming, or otherwise benefiting from SharePoint Syntex capabilities requires a license. This includes the following scenarios:
- Access a Content Center
- Create a document understanding model in a Content Center
- Upload content to a library where a document understanding model is associated (whether in a Content Center or elsewhere)
- Manually execute a document understanding model
- View a library where a document understanding model is associated
- Create an unstructured document understanding model via the entry point in a SharePoint library
- Upload content to a library where an unstructured document understanding model is associated
In summary the licensing requirements are certainly pervasive - even viewing a SharePoint library where a Syntex model is used requires a license.