AWS Offers Machine Learning to Extract Text from Documents

The cloud service provider also rolled out two new managed services for IoT.

Jeffrey Burt

May 31, 2019

5 Min Read
Artificial intelligence (AI), machine learning

Amazon Web Services wants to make it easier for users to pull data and texts from documents as well as build applications for the internet of things with the release of several services that officials previewed at its re:Invent show last fall.

The new services aim to give enterprises tools to leverage emerging technologies like machine learning to search and extract information and to take advantage of the fast-growing IoT.

AWS’ Textract service uses machine learning techniques to enable companies to more easily scan a broad array of layouts, including tables and forms, that traditional optical character recognition (OCR) technology can’t recognize, which leads to inaccuracies when it tries. Textract can identify and grab data and text from columns, rows and forms and understand the context, such as a social security number on a tax form or the quantity of a product in a warehouse found in an inventory report.

The data and text can then be used in a database and leveraged by software in such areas as accounting, auditing and compliance or used to build intelligent search on large archives of documents, according to AWS. The service can be used to search such formats as PDFs, photo files like JPEGs and PNGs, and scans, and can take the extracted data and content and use it with other AWS database and analytics services, including Elasticsearch, DynamoDB and Athena. It also can be integrated with such AWS machine learning services as Comprehend, Comprehend Medical, Translate and SageMaker.

Textract takes scanned files that are stored in an Amazon S3 bucket, reads them and returns data in JSON text, which is annotated with page numbers, sections, form labels and data types, the company said. Enterprises can then use the data in such applications as smart search indexes, redacting text and regulatory compliance.

Enterprises can use Textract without having to develop custom code or having experience in machine learning.

The new service is a big deal for a broad array of industries that still rely heavily on paper, including health care, government and legal, according to Patrick Moorhead, principal analyst with Moor Insights and Strategy.


Moor Insights’ Patrick Moorhead

“This is the first at-scale, turnkey solution that intelligently identifies and converts text to data,” Moorhead told Channel Futures. “It uses [machine learning], not OCR, which means increased precision and intelligence to know where to look on a form.” 

Until now, companies primarily had to rely on either manual searches or OCR, both of which are slow and can lead to inaccuracies.

“Almost all text-to-data applications use the age-old OCR algorithms that have been around for decades,” he said. “Textract uses ML to intelligently locate the right position on a document, let’s say on a specific checkbox, and more efficiently converts it into an intelligible output.”

On the IoT front, AWS made available two services, including its IoT Things Graph, a managed service that lets developers create IoT applications visually rather than having to code.

“You can represent your business logic in a flow composed of devices and services,” AWS chief evangelist Jeff Barr wrote in a blog. “Each web service and each type of device (sensor, camera, display, and so forth) is represented in Things Graph as a model. The models hide …

…the implementation details that are peculiar to a particular brand or model of device, and allow you to build flows that can evolve along with your hardware.”

IoT applications require a variety of devices and web services to automate tasks in smart homes, industrial automation and other areas, but a lack of standards means developers need to write a lot of code, according to AWS. IoT Things Graph gives developers a drag-and-drop interface for connecting and coordinating the interactions between devices and services, which means applications can be built more quickly.

“You represent devices and services using prebuilt reusable components, called models, that hide low-level details, such as protocols and interfaces, and are easy to integrate to create sophisticated workflows,” they said in a description of the service.

IoT Events is another fully managed service that is designed to make it easier for companies to respond to events from IoT sensors and applications, including such changes in equipment as when a motion detector activates lights and security cameras. Companies traditionally had to build custom applications to collect data, apply decision logic to detect an event and then have another application react.

With IoT Events, “you can easily create detector models that represent your devices, their states, and the transitions (driven by sensors and events, both known as inputs) between the states,” Barr wrote. “The models can trigger actions when critical events are detected, allowing you to build robust, highly automated systems. Actions can, for example, send a text message to a service technician or invoke an AWS Lambda function.”

The IoT services are part of a larger multiyear effort by AWS to build out its cloud capabilities around IoT. In 2015, the company launched AWS IoT Core, a set of services for connected devices, and has been adding services since, including Greengrass and IoT Device Management in 2017 and IoT Device Defender last year. At the re:Invent show, the company introduced a number of planned IoT services, including Things Graph and IoT Events. 

Read more about:

Free Newsletters for the Channel
Register for Your Free Newsletter Now

You May Also Like