Thursday, 13 May 2021

Speaking at ESPC21 - using Power Apps and AI for Incident Reporting

The European SharePoint, Office 365 and Azure Conference has always been one of my favorite events in Microsoft technology, and I'm looking forward to delivering a session at the event again this year. The event is running on June 1-2, and the good news of course is that you can attend with no travel required. I can guarantee you will have no problems with travel AC adapters! ESPC always has amazing content, and as usual there are keynotes and announcements from Microsoft execs such as Jeff Teper, Karuana Gatimu, Charles Lamanna, Adam Harmetz and others. 

I really like what the conference has done with pricing when in "virtual mode" - there's a free registration option, but also pay-for choices which get you many extras including on-demand session access, certain bonus sessions and a choice of pre-event tutorials. There's a link to pricing and the conference schedule at the end. 

Some details on my session:

Building an Incident Reporting Solution with Power Apps and AI

AI is no longer a high-end concept that only applies to organisations with large I.T. budgets. Instead, it is readily available in numerous ways in Microsoft cloud technologies, and when your content is already in Microsoft 365 it's easy to tap into. The scenario used here is incident reporting, but the approaches shown in this session can apply to *many* common applications.

Using a combination of Power Platform and Azure Cognitive Services, we'll show how to add image recognition and tagging to an app in a few easy steps. This session is aimed at developers, citizen developers, and anyone else building solutions in Microsoft 365.

Session link: Building an Incident Reporting Solution with Power Apps and AI

I will be taking any questions you may have live during my session, so come prepared! 

As a speaker at ESPC21 Online I can share with you a special 25% discount on Pro Access tickets, which includes a pre-event tutorial of your choice, all event sessions on demand and more. If interested, just use code ESPCPRO when booking here

You can find the full event schedule here and check out some of ESPC’s reasons why you don’t want to miss this event here. I hope to see you at ESPC21 Online!

Thursday, 22 April 2021

SharePoint Syntex - teaching AI to extract contents of structured documents with Form Processing

In previous articles on SharePoint Syntex I've talked mainly about the document processing approach - in this post I'll discuss it's counterpart, form processing. For those following along, my overall set of articles on this theme so far are:

Syntex - document processing

Syntex - general
That last article in particular is designed to help you understand the difference between the two models and when to use each one. As you read about Syntex you might form the view that "form processing is for things like invoices and order forms and document understanding is for everything else" - certainly some of the guidance infers this. However, that position is far too simplistic - there are differences in licensing, capabilities, supported file types and more - and you'll want to get this decision right to avoid having to rework AI models. My "tips for choosing" article might be helpful since it has a table of differences and details of licensing aspects to look out for. 

But today, we focus on form processing!

Syntex form processing - integrated AI Builder

As the briefest of recaps, Syntex form processing is typically more suited to highly-structured and consistent document formats compared to document processing, that much is true. Since the AI Builder technology within Microsoft's Power Platform is used, there are a few implications to consider:
  • To use, AI Builder credits are needed in addition to Syntex licenses (see AI Builder calculator). However, if your org has 300+ Syntex licenses you receive a generous allowance of 1m credits - this more than gets you started
  • Supported file types include JPG, PNG or PDF - but not Office files
  • Entire tables can be extracted from the document (in contrast to document processing)
  • The model is applied via a Power Automate Flow to the SharePoint document library where your documents reside (i.e. where you create the model from) - but there is no easy way to use this in other locations
In short, it's AI Builder conveniently built into SharePoint document libraries - so you don't have to do the integration or somehow pass each document to the model, it's taken care of for you.

Our invoice format

Before we get started on the process, it's worth seeing the format of documents used in this process. Like many classic examples of this type, they are invoices:


Implementing Syntex form processing

The approach followed here can be summarised as:
  1. Define the information to extract (i.e. teach Syntex what the fields are e.g. "Invoice Reference", "Invoice Date" etc.)
  2. Add documents for analysis
  3. Tag documents (i.e. teach Syntex where to find the relevant content in the document)
  4. Train the model
  5. Test
  6. Use in your document library
In your SharePoint document library, find "Automate" > "AI Builder" > "Create a model to process forms":



You'll see this message alerting you that AI Builder credits are needed:

Give your model a name - I'm using "COB invoice" for now. I want a new SharePoint content type to be created with this name so these documents are easily identified and classified amongst any others:

Syntex then begins to create your AI model:


Once the model has been created we define which information within the document we want to extract:

As the image shows, I start specifying some things I want to extract such as:
  • The invoice date
  • The invoice reference
  • The VAT number
Syntex now allows me to supply a collection of documents to train the model:

I create a new collection of documents for my invoice scenario:


I have some invoices ready to go, so I select those to upload:


Once uploaded I'm ready to analyze!


Once the analysis is complete we move into the tagging phase

The tagging phase

As you move your mouse, Syntex allows you to highlight portions of the document by drawing boxes around identified pieces of text. By doing this, you map them to the fields you defined at the beginning - these appear in a picker for selection, with a checkbox indicating whether you've already mapped this item. So I move through the document teaching Syntex what is the invoice reference, what is the date, the supplier name and so on.




As you can see, Syntex allows me to pick something as granular as an individual word or even character, or expand to pick a phrase or string of characters. Items with a green border are already tagged:

Tables can also be tagged in this way:
Once I'm done tagging I'm presented with a summary of the model, with a list of the fields I've defined:




We're now ready to move into the training phase

The training phase

We start by hitting the Train button:



Once the model has been training you can either run a quick test against a new document (not one used for training) or go ahead and publish it to your SharePoint document library:



Let's go ahead and publish the model. Once I have a published version, any subsequent changes will create a draft - this allows me to test things out (and get them wrong) whilst not disrupting the extraction that's already in place.

Once a model is published, we can go ahead and use it:


This makes the model available for use in a Power Automate Flow, and the person using will need to consent to the connections being used:


The resulting Flow looks like this:


If you're interested in the mechanics, the piece that does the extraction is this - the "Predict" action for AI Builder which links to the model we just created:


The results

So let's go back to the invoice format we are using:




When this file is uploaded to SharePoint, initially it's just any old document:


..but then after a couple of minutes the document is correctly identified and classified as a "COB Invoice" and the values I trained the model for are extracted:


Excellent. Now I can drag in many old invoices and have them properly classified and summarised:


..and after a couple of minutes:



Conclusion


Syntex is hugely powerful in automatically unlocking critical data from documents - it doesn't need to be buried inside any more. At the beginning of this series, we discussed how the best research suggests knowledge workers spend 20-30% of their time just searching for information or expertise, and many of us would recognise that having to open many documents to check their contents can contribute to this. As above, I can build SharePoint document library views so that information is readily-accessible or the view is sorted, filtered or grouped according to extracted information.

These benefits go far beyond search and views though. Having my documents correctly identified means that I can apply security and compliance policies to them, for example a conditional access policy which means employees can't print or download sensitive contracts from an unmanaged device, or a retention policy that means a Master Services Agreement is retained for 6 years. Syntex can drive these approaches so that policies are applied by the AI recognising the document, and this can work across documents of wildly varying formats so long as there's some consistency that a document understanding rule can be applied to.

Being able to automatically extract information also means I can build process automation around my documents, for example if something comes in for a certain region or above a certain value, I can route approval processes or notifications accordingly. There are many possibilities here alone. 

Ultimately it comes down to classification and extraction, and there are so many possible use cases around CVs, proposals, statements of work, RFPs, employee contracts, invoices, sales/purchase orders,  service agreements, HR policies and just about any other document type you can think of. This is democratised AI in action, and it's great to have it so accessible in SharePoint.   

Thursday, 25 March 2021

Ignite 2021 announcements summary - Teams, Power Platform, Azure, Security and Compliance

Microsoft announced a lot of changes and enhancements to their cloud products this month at Ignite, their major conference. Keeping up with the firehose of constant change is difficult at the best of times, but the flurry of announcements at these events - not to mention minor updates masquerading as major announcements - only makes it more difficult. You'd be forgiven for missing some key developments frankly. Whilst the "Ignite Book of News" that Microsoft create is useful (link at the end), I find myself needing a more concise summary - something I can reference when talking to my team or with clients. All of which leads me to my "Ignite on a slide" summaries that I'm sharing here.

In this post you'll find slides as images, and a downloadable deck which combines all of them - I cover the following technologies:

  • Teams
  • Power Platform
  • Azure
  • Microsoft Security & Compliance

Feel free to re-use or share.

Microsoft Teams


(Click to enlarge)

Microsoft Power Platform


(Click to enlarge)

Azure


(Click to enlarge)


Security & Compliance


(Click to enlarge)

Of course, my summaries are somewhat subjective and you might feel there's something I've missed - but hopefully they're useful somehow.

Download the combined deck

Summary

Hopefully these summaries are useful in some form and you're free to use them as you like. When you need more detail on any of these announcements (and you will), I highly recommend using Microsoft's published "Book of News":

Link:  Ignite Book of News

Sunday, 28 February 2021

SharePoint Syntex AI - tips for choosing between document understanding and form processing models

SharePoint Syntex, the AI-powered document understanding capability of Office 365 which was previously part of Project Cortex (we'll come to that in a second), provides two approaches to understanding your content in document understanding and form processing. I talked about document understanding in two previous articles:

However, document understanding is only 50% of the Syntex capability. Form processing is more suited to very structured documents such as invoices, receipts and order forms - but there's a lot more to the decision than that. On the surface, many of the document AI processing scenarios you might consider could use either approach - but whilst there's huge overlap, the two models have some significant differences in capability, licensing and how they are applied.

Choosing the right approach can be confusing at first, so that's the focus of this article. But first, did you know that the "Project Cortex" label is disappearing?

Microsoft's Viva - branding and naming changes

With the advent of Microsoft Viva, the "topics" part of Project Cortex has become "Viva Topics" and is part of the that product set. The retirement of the Cortex brand label also means that Syntex becomes it's own thing. Indeed the Syntex add-on for Office 365 didn't use the Cortex label once launched - so in the Microsoft 365 admin center you'll see:


So a quick summary of the original naming compared to the current naming is:

With that out of the way, let's get back to form processing. 

Form processing vs. document understanding - what to use when?

So how does form processing differ from document understanding? When would you use each one? Microsoft have a fairly useful article at Difference between document understanding and form processing models, however some of the biggest differences lie in how the AI models are trained and what they're capable of, something the article doesn't really cover. I think things can be clearer still - here are the major differences as I see them:


Aspect Document understanding Form processing
Best suited for Unstructured or less structured content - documents can be written in different ways Highly structured content - based on a specific format with high consistency
Capabilities Classify your documents/identify content types (e.g. to drive compliance policies)

Extract content
Extract content

Extract table content
Summary of AI model training Machine teaching based on flexible rules for classification and extraction (e.g. proximity rules such as "X should be within 50 characters of Y")  Machine teaching based on well-defined locations with the document
Underlying technology Native to SharePoint Syntex Power Apps AI Builder - form processing
Technical implementation Model is applied to the SharePoint library through model settings A Power Automate Flow is created and associated with the originating library - but currently cannot be applied to different libraries
How applied across your tenant Create model once, apply to multiple SharePoint libraries (one by one) Created model is tied to originating document library (today)
Licensing Syntex license only:

£3.40 or $5 per user per month

(see "Who needs a Syntex license" later)
Syntex license + AI Builder credits:

Each user needs a Syntex license, and in terms of AI Builder credits, orgs with 300+ Syntex licenses receive a bundled allowance of 1m credits (one off). 

See AI Builder calculator for examples of how many credits are consumed by different operations - but in short, 1m in credits is extremely generous even for large organizations. As a guideline, if you process 5000 documents per month that would be just 4 credits per month.

To purchase, 1 unit of AI Builder credit costs £377 or $500
Supported file types All Office file types, .eml, JPG, PNG, PDF, RTF, TIFF, txt JPG, PNG or PDF
Key limitations Office files are truncated at 64k characters

OCR-scanned files (PDF, image or TIFF) are limited to 20 pages
Tables must be simple - no nested tables or merged cells
Signatures, checkboxes and radio buttons cannot be extracted

Max 500 pages


Let's dig into more detail on a couple of these points.


Differing approaches to machine teaching 

I mention in the table above the differences in how the AI models are trained. Perhaps the images below explain it best.

In form processing, I'm very precisely teaching Syntex where to find a previously-specified element of the document I want to extract - in this case the invoice reference:


In document understanding, I can do the same thing - however, I can also add some flexibility by defining "explanations" (essentially rules) which provide some extra context about variations in the document format that the AI might need to deal with:

I can also teach the machine how specific parts of the document relate to each other with a proximity rule. In the explanation below, I'm saying "first find the 'Fees and Payment' phrase, then find 'Total' which is more than 20 tokens away but less than 100. Once there, find the thing that looks like a GBP currency value which is VERY close, in fact less than 10 tokens away:

So, these differences can rule in or out one model depending on what your documents look like.

Considerations when applying across your tenant

The current limitation of a form processing model being tied to the SharePoint library where it was created is an important factor. This means that, as of today, there's no real pathway to using one of these models across your Microsoft 365 environment - so if your invoices/order forms/receipts get stored in many different SharePoint sites or document libraries you have a lot of work on your hands to replicate the solution.

This is somewhat surprising since the implementation is a Power Automate Flow. Conceivably, you could copy and paste Flow actions to speed things up - however, I note that simply trying to repoint the Flow to another SharePoint document library (by updating references) currently fails, perhaps indicating that there are some internal references which become out of sync. Indeed, the Microsoft documentation states:

Form processing models can currently only be applied to the SharePoint document library from which you created them. This allows licensed users with access to the site to create a form processing model. Note that an admin needs to enable form processing on a SharePoint document library for it to be available to licensed users.

Hopefully Syntex form processing models become much more reusable and portable in the future. 

Other benefits with SharePoint Syntex licensing

Notably, Microsoft provide some other capabilities as part of the Syntex license. These are badged as "premium" items and include:

  • Term store analytics
    • Insights on how tags are being applied to your content - term store operations, open and closed term sets, terms without synonyms and more 
    • See Term store reports | Microsoft Docs
  • Content type push to hub
    • More control over where content types get applied to your environment. Being able to pushing from the central term store to a hub can become the first part in a chain, where the second part is the existing "push from hub to associated sites" capability. In the end, you get your content types to where they need to be without PowerShell or other roll your own approaches
    • See Push content type to hub
  • Import using SKOS format

Who needs a Syntex license?

At £3.40/$5 per user per month for the Syntex add-on, understanding exactly who in the organization requires a Syntex license becomes critical. The SharePoint Syntex FAQ states the following, though the highlighting is mine:

Anyone using, consuming, or otherwise benefiting from SharePoint Syntex capabilities requires a license. This includes the following scenarios:

  • Access a Content Center
  • Create a document understanding model in a Content Center
  • Upload content to a library where a document understanding model is associated (whether in a Content Center or elsewhere)
  • Manually execute a document understanding model
  • View a library where a document understanding model is associated
  • Create a form processing model via the entry point in a SharePoint library
  • Upload content to a library where a form processing model is associated

In summary the licensing requirements are certainly pervasive - even viewing a SharePoint library where a Syntex model is used requires a license. 

Conclusions

An organization's ability to automate processes around documents, improve findability and extract important corporate knowledge from files will be important factors in agility and effectiveness over the next few years. You can find the AI that supports this in a few different places in the Microsoft stack (and the wider cloud market), but Syntex is the easy-to-consume technology that brings this directly to where your documents are in Office 365 - SharePoint and Teams. With a little experimentation and persistence, a non-technical business user can build powerful automations and effective search tools. 

Syntex licensing means that it's likely to be used in specific use cases - perhaps for your organization it could be something around CVs/resumes, RFPs, project documents, invoices, order forms or partner details - but you may not feel it's appropriate to license every user in your organisation, at least not until the value provided is clear.  

Making the right decisions between the two Syntex approaches of document understanding and form processing is vital and this can be a confusing area. Some of the major differences include what the AI models are capable of, how easy they are to use in different places, supported file types and licensing. Hopefully this article is useful in helping you navigate this.

Sunday, 31 January 2021

Slide deck and videos - Building AI into Power Platform solutions

This is a quick post to publish some resources I created recently covering considerations for using AI in the Power Platform. They are from a talk I gave at ESPC 2020 (the European SharePoint, Microsoft 365 and Azure Conference).

For many people getting into building Power Apps and Power Automate solutions, the obvious first choice is the "AI Builder" capability which comes as part of the platform, but as I've discussed elsewhere there are certainly options beyond that - each with different costs and capabilities. 

Topics covered include:

  • What you can expect to pay for AI
  • Different implementation approaches, specifically:
    • Power Apps AI Builder
    • Use of Azure Cognitive Services from code
    • Use of Azure Cognitive Services in a Flow (Power Automate)
  • A real world scenario - building an Incident Reporting Power App which uses AI to alert a human when a serious incident is detected
One area of focus is how Microsoft 365 technologies can easily be strung together to create high value solutions with minimal effort. The slide below depicts how Power Apps, SharePoint, Power Automate, Azure Cognitive Services and Teams each play their part in my demo scenario:



We can expand on how each technology is used as follows:

The slide deck hopefully has a few more useful slides too - it can be browsed or downloaded below.

Slide deck - Building AI into Power Platform solutions:



Demo videos


Demo 1In this video we take a part-built app and integrate with Power Automate to take the photo captured from the user's phone and store it in Microsoft 365/SharePoint:


Demo 2In this video we add AI to the application by integrating image recognition from Azure Cognitive Services. We do this using Power Automate in Microsoft 365, with no code required:



Sunday, 3 January 2021

Trends and predictions - I.T. project priorities for 2021

The holiday season and start of a new year is always a good time to pause, collect thoughts and reflect on what the upcoming 12 months are likely to bring. As usual for this time of year, there's no shortage of crystal ball-gazing and there are many forecasts and prediction lists from industry watchers and analysts. However, for our work at Content+Cloud I find that some specific thinking around the components of Microsoft 365, modern workplace, security and existing in the Microsoft partner ecosystem is helpful. Whilst these aren't our only service lines by any means, a significant slice of our projects revolve around these strands - and so my list has a sprinkling of these Microsoft-specifics mixed with my general expectations for the types of organisation we work with.


A backdrop of accelerated digital transformation

Before we get to my 2021-specific list, it goes without saying of course that the general theme of accelerated digital transformation will persist and the staple projects this brings in the Microsoft space will continue. Generalised examples for us include:

  • Adoption of Microsoft 365, including SharePoint and Teams migration projects, implementation of Teams Voice etc.
  • Public cloud adoption (Azure) - either led by datacenter migrations, Windows/SQL end of life, app modernisation or greenfield development
  • Desktop refresh/Managed Desktop implementation 
  • Digital Workplace implementation
  • Cybersecurity projects
  • Cloud operating model and service design

Given that many organisations are still on the cloud modernisation journey and the more mature see digital transformation as an ongoing process rather than an individual programme or project, all these are relative constants for our era that provide a bedrock of work to address. But moving beyond this segment, let's think about what might be more distinct for this year. 

Predictions and project types for 2021

  • Zero trust and information governance projects - representing continued security prioritisation, especially implementation of technologies which support zero trust (e.g. device and identity management), policy implementation, data classification and governance (especially around external sharing), security awareness training, phishing simulation, dark web monitoring and standards accreditation.
  • "Productivity governance" projects - for organisations invested in Microsoft 365 the need for need more robust Teams, SharePoint and Power Platform controls and policies will become commonplace. Organisations failing to address this will suffer from risk and complexity from the unmanaged proliferation of workspaces and apps and the lack of a coherent experience.
  • Teams apps and solutions growth - a move towards Teams becoming the "OS for business" with organisations having an increased appetite for bringing more of their tools inside Teams. Companies will reassess their posture for Teams store apps, custom apps and integrations - driven by user demand for enhanced meeting solutions, project and sales solutions, HR and team processes, remote collaboration apps and integrated calling (i.e. Teams Voice). New custom applications will be surfaced in Teams in many cases.
  • Hybrid working excellence - indeed, some companies will adopt a "remote first" approach when considering their tools and experiences. In both cases, focal points will include:
    • Digital Workplace - this will be a priority for companies who have not yet addressed their gaps, and ready-to-go solutions truly integrated with Microsoft 365 (like our Fresh product) will provide the best overall outcomes
    • Remote collaboration - co-authoring of whiteboards, mind maps, process flows etc. Notably, Microsoft need to evolve their Whiteboard offering to compete with Miro, Lucidchart/Lucidspark, XMind, draft.io and others here.
    • Employee experience - in particular employee onboarding, communications and engagement, knowledge sharing and communities of practice.

      For most this will be Digital Workplace-oriented (see above). However, others with specific needs may look to improve their communications ability in different ways (e.g. the rise of the "employee app" for retail and field-based organisations or those with a contingent workforce). 

      For others, the focus may be on modernisation of project or CRM tools.

    • A push on "working out loud" culture - regular narration of work on an internal social tool such as Yammer, publishing team news regularly (especially short form), early sharing of plans and work, lunch and learn sessions etc.
    • Virtual events and training - building the muscle and technology support for virtual town hall or leadership connection events (internally-facing) and webinars and mini-conferences (externally-facing). Teams Live Events may be the default for most organisations invested in Microsoft 365, but Microsoft will need to continue to innovate here too.
  • Return to office execution - for many organisation this starts with dusting off their plans delayed by later lockdowns, with the implications of the vaccination schedule providing a new lens. The threat of future legal challenges (perhaps from employees suffering long-term Covid health issues), in addition to internal optics, drives the need to prove that the employer duty of care was properly discharged. This means increased record keeping and proper implementation of social distancing - with desk/room/shared space booking apps, health declarations and Covid test result tracking coming to the fore. 
  • Process improvement and automation - deeper automation of customer and supplier operations, finance and HR processes, JML (joiners/movers/leavers). Expect to see integration of digital signatures and AI into more processes, and the auto-generation of Office and PDF documents (contracts, invoices and the like) to become more commonplace.

Other trends of note


In broad terms, we'll also see other impacts from movements such as the democratisation of AI and automation, data modernisation and the rise of low code technologies.

Other specific developments mean we can also expect big changes to how documents work in the future. We all use documents yet they have huge limitations (something I discussed recently in Project Cortex - training SharePoint Syntex to read your documents like a human) - however, the near future will bring "dynamic shared content" which can be surfaced and co-authored simultaneously in a document, e-mail message, Teams chat message and/or SharePoint page. This will be powered by Microsoft's Fluid Framework and will underpin a new generation of collaboration experiences from both Microsoft and 3rd party developers.

I believe other trends will also emerge, such as increased adoption of "portable huddle" technology - a BYOD approach to meeting room tech for equipping a shared space which is NOT a city-based corporate office or high-end working hub. This will support in-person local group collaboration in less formal locations for workshops, kick-offs, team meetings and so on, whilst providing a good audio visual experience and allowing integration with Teams and collaboration tools. 


Summary


These initiatives and project types represent a focusing of the lens for ambitious organisations looking to optimise organisational effectiveness in 2021. Aside from my list above and other emerging trends, other themes such as cost reduction are certainly not going to go away - projects around asset management and cloud governance/cost optimisation in particular will persist. 

2021 will be a fascinating year for I.T. as we emerge from a tumultuous period - but transformation opportunities abound for those able to seize them! 


P.S. If you think I missed something big in the Microsoft space or my views don't chime with yours, let's discuss on Twitter or LinkedIn


Tuesday, 29 December 2020

Project Cortex - training SharePoint Syntex to read your documents like a human - part 2 (entity extractors)

In the previous article we looked at how to get started with SharePoint Syntex, covering in particular the initial steps of creating a document understanding model. In this article we'll look at how Syntex can extract content from your documents - allowing you to unlock "golden" information so people don't have to open 10 documents to find what they're looking for. Before we get into things, remember that a document understanding model can have two elements:

  • A classifier - this allows Syntex to identify what type of document it is (e.g. a "C+C Statement of Work" in the example I'm using)
  • An entity extractor - unsurprisingly, this allows Syntex to extract information once trained

We'll focus on the entity extractor today, and this is the fun part. If you remember our scenario from the last article, I'm extracting the total value from each Statement of Work document I have in Office 365. Here's what that looks like - it's the 3rd highlighted rectangle here:

If you remember, creating both a classifier and an extractor uses this process:
Syntex needs some training files to use as we're developing the AI model, but in my case I added these last time when I created the model initially and defined the classifier. As you might imagine, these are some test Statement of Work documents with one or two others thrown in there - the "others" are used to train Syntex about "negative" cases. These go into a special "Training Files" library within the Content Center, and I'll use those same files for the extractor.

Implementing an entity extractor in the AI model

The first step is to head back to the Content Center and find the model you're adding the extractor to:

Once in the model, choose the "Create and train extractors" action:

Next, name your extractor and specify if you want the data to be extracted to a new column on the SharePoint library (and the data type if so) - usually you do. Since I'm extracting the total value from each Statement of Work, so the name I use is "Engagement value": 

We're then taken into the "Label" tab, the first step of three when defining a classifier or extractor. 

Creating the extractor - labelling step

 
Accuracy requires labelling and "explanations"
When labelling your files for an extractor, you are teaching Syntex where the value is in your sample files. But as we'll see, simply showing Syntex where it is in a couple of files isn't enough. We need to create "explanations" too - the AI engine uses both pieces of info.

Here, we are dealing with the labelling step.

In the labelling tool (where all formatting is removed from the document), I find the costs table which is present in all of our Statements of Work and I highlight the value from the total row:

I then hit the "Next file" button and repeat for the next document in the training files library: 

Once I've labelled at least five files, I move to the "Train" tab.

Creating the extractor - explanations step

For the training part of the process, we create one or more explanations to help guide the AI further. When we created explanations for the classifier, we were providing Syntex with patterns to help identify and classify the document. For the extractor, we do something similar but here we are providing patterns to guide Syntex to the content we are trying to extract.

Explanations can be created from scratch or from a template:

Templates already exist in the system for common pieces of info you may want to pull out of documents - for example, dates, numbers, phone numbers, addresses and so on:

For the sake of learning I'll create my explanations from scratch, even though the first one is actually a currency value and a template exists for that. I give it a name, choose the Pattern list type and provide the variants to account for how the engagement value may be written in my documents (different number formats):

I then save this explanation and create another one. This time I'm helping the AI find the overall section within our SOW documents which the costs table can be found in - I'm simply looking for the title of that section, "Fees and Payment":




I create one more to find the phrase "Total".

Now that I have all of those, I combine them so that I can essentially say "first, please find the phrase 'Fees and Payment', then 'Total', THEN the thing that looks like a GBP currency value. I do this by creating a new explanation of type "Proximity" - and specifying how far apart each element is. Syntex uses the concept of tokens to specify proximity, and my resulting explanation looks like this:   

More accurately, I'm saying "first find the 'Fees and Payment' phrase, then find 'Total' which is more than 20 tokens away but less than 100. Once there, find the thing that looks like a GBP currency value which is VERY close, in fact less than 10 tokens away.

As you can imagine, tuning the tokens in a proximity explanation helps the accuracy of the AI and reduces the chances of Syntex being unable to find your content. My final set of explanations looks like this - it's the 3 phrase or pattern explanations AND the proximity explanation which combines the others:

Creating the extractor - training/testing step

I'm now ready to train and test. Similar to when I did this for the classifier, I select some training files which haven't been used in labelling (including one document that isn't a Statement of Work):



The "Prediction" column then tells me what Syntex predicts would be the extracted text for each document. Success! This looks good:


That's almost a 100% success rate - but you might notice that the model failed to extract content from one SOW document, and indeed Syntex tells me this:


 
Upon further inspection, this particular document seems to have a structure different to what I'm expecting - specifically, I find that the author has used a different heading for this section of the document!


So at least I understand why this is happening - I can now tweak my explanations if this is an expected case, or politely remind the project manager that they should be following our standard structure! Either way, there's a path to resolving this. 

I now finish the process by clicking on the "Exit training" button:


Seeing results - applying the model to document libraries

Our work is now done! We have a completed AI model and we can apply it to document libraries around the Microsoft 365 tenant:

A Syntex AI model does need to be applied to libraries individually, but in most cases your documents of a certain type may not be distributed that widely anyway. In the future, we can expect APIs and provisioning mechanisms to manage this at scale.

Once the model has been applied, Syntex extracts the content I trained it to - meaning I don't need to open each individual document:

Summary

We've now seen the process of creating a document understanding model in SharePoint Syntex - something that will allow us to recognise the document AND extract content from it. We can take this further too. Instead of just extracting a single piece of information (e.g. the value from a Statement of Work) we can, of course, extract multiple pieces in the same extractor.

Overall, these capabilities of Syntex provide a great leap forward in terms of how information can be found. High value information no longer needs to be buried inside documents, meaning that employees either do not see it or are forced to open many individual documents to find it. We can create mini-databases and tools from content that was previously locked away - including capabilities which provide sorting, filtering and powerful search experiences. To the future!