AWS Offers Machine Learning to Extract Text from Documents
Amazon Web Services wants to make it easier for users to pull data and texts from documents as well as build applications for the internet of things with the release of several services that officials previewed at its re:Invent show last fall.
The new services aim to give enterprises tools to leverage emerging technologies like machine learning to search and extract information and to take advantage of the fast-growing IoT.
AWS’ Textract service uses machine learning techniques to enable companies to more easily scan a broad array of layouts, including tables and forms, that traditional optical character recognition (OCR) technology can’t recognize, which leads to inaccuracies when it tries. Textract can identify and grab data and text from columns, rows and forms and understand the context, such as a social security number on a tax form or the quantity of a product in a warehouse found in an inventory report.
The data and text can then be used in a database and leveraged by software in such areas as accounting, auditing and compliance or used to build intelligent search on large archives of documents, according to AWS. The service can be used to search such formats as PDFs, photo files like JPEGs and PNGs, and scans, and can take the extracted data and content and use it with other AWS database and analytics services, including Elasticsearch, DynamoDB and Athena. It also can be integrated with such AWS machine learning services as Comprehend, Comprehend Medical, Translate and SageMaker.
Textract takes scanned files that are stored in an Amazon S3 bucket, reads them and returns data in JSON text, which is annotated with page numbers, sections, form labels and data types, the company said. Enterprises can then use the data in such applications as smart search indexes, redacting text and regulatory compliance.
Enterprises can use Textract without having to develop custom code or having experience in machine learning.
The new service is a big deal for a broad array of industries that still rely heavily on paper, including health care, government and legal, according to Patrick Moorhead, principal analyst with Moor Insights and Strategy.

Moor Insights’ Patrick Moorhead
“This is the first at-scale, turnkey solution that intelligently identifies and converts text to data,” Moorhead told Channel Futures. “It uses [machine learning], not OCR, which means increased precision and intelligence to know where to look on a form.”
Until now, companies primarily had to rely on either manual searches or OCR, both of which are slow and can lead to inaccuracies.
“Almost all text-to-data applications use the age-old OCR algorithms that have been around for decades,” he said. “Textract uses ML to intelligently locate the right position on a document, let’s say on a specific checkbox, and more efficiently converts it into an intelligible output.”
On the IoT front, AWS made available two services, including its IoT Things Graph, a managed service that lets developers create IoT applications visually rather than having to code.
“You can represent your business logic in a flow composed of devices and services,” AWS chief evangelist Jeff Barr wrote in a blog. “Each web service and each type of device (sensor, camera, display, and so forth) is represented in Things Graph as a model. The models hide …
- Page 1
- Page 2