words

Natural Language Processing for Business Development

Reading time: 9 minute(s) @ 200 WP

Purpose

The purpose of this paper is to demonstrate the application of Natural Language Processing to support automated document review for business development activities. This example analysed the 2022 Defense Budget Overview FY 2022 Budget Request. Initial analyses demonstrates that Natural Language Processing can quantify meaning and identify contextual topics. This provides a useful tool for a consistent business development review of documents.

Natural Language Processing (NLP) is a data science process to analyze documents for measurable findings. This can includes simple word frequency analysis but also contextual meaning and sentiment. NLP continues to advance in support of liberal arts type work (e.g., books, speeches). In recent research on free-form maintenance logs for aircraft we demonstrated some innovative methods applying NLP to gain operational level insight.

For Business Development NLP cannot replace human review entirely, but it can help identify opportunities by quantifying some of the information not available in budget tables. Narratives in DoD staffed documents are carefully coordinated throughout the organization to concisely communicate purpose. It is a slow, methodical (often painful) process to reach DoD organizational agreement on words used in these documents. Knowing this, we can use NLP to augment our budget tables and pie charts to get a more insightful picture of budgetary intent and program importance.

Data Science Approach

To perform NLP to the budget we will follow a standard process:

  • Load the document(s) into a corpus.
  • Tokenize the corpus to store words in context maintaining position.
  • Create a document frequency matrix to analyze non-positional features (e.g., bag of words).
  • Topic modeling with machine learning.

This process converts the documents into a form that we can ask questions like: “In what context and prevalence is engineering part of this budget?” or “How important is artificial intelligence?”

Contextual Key Words and Phrases

After we have loaded the DoD budget request into a corpus and then created tokens we can examine words in context. Below is the key word artificial shown in context throughout the budget request.

Keyword in context
pre keyword post
advantages to our forces including artificial intelligence hypersonic technology cyber and quantum
budget also funds the joint artificial intelligence center’s jaic efforts for small
and emerging fields such as artificial intelligence machine learning quantum science neuroscience
advanced capability enablers microelectronics hypersonics artificial intelligence and 5g being the best
hypersonic cruise missile capability billion artificial intelligence ai reflecting the rapidly growing
and emerging technologies such as artificial intelligence ai to reduce sustainment costs
rotary wing aircraft invests in artificial intelligence to increase the speed of
that support lethality digital and artificial intelligence critical for essential modernization readiness
and russia new technologies like artificial intelligence autonomy and robotics will change
mellon university are aggressively pursuing artificial intelligence technologies the army software factory
technologies through targeted investments in artificial intelligence cyber weapons unmanned technologies directed

Word and Phrase Frequency

We can examine the overall frequency of words across the budget. Below is one type of abstraction called a word cloud that provides an initial view into the relative frequency of words. This is quantitative, but abstract.

Examining closer below we see the top 50 frequent words. Note: In this example we have removed stop words (“the”, “I”…)

Next we look at n-grams. These are combined words in phrase patterns. Below we select two word n-grams.

Statistical analysis

We may be interested in the distribution of words or phrased throughout the document. This can help us understand entire consistent themes. Below we look at some key words as they are distributed across the document. The token index is the relative location in the budget.

Classification into Topics

Semantics and sentiments in documents provide insight beyond counting words and phrases. Topic modeling is a generative machine learning approach to find the most important topics in the budget. Below we apply the [Latent Dirichlet allocation] (https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) method.

Reading the words in each topic gives a sense of the most important budget areas. There are many ways to expand topic modeling to gather more insight.

Topics can also be customized to find specific themes. For example, a customized semi-supervised learning approach could search for topics defined by a custom file. Below is an example YAML text file that will seed our machine learning. The process will work to find relevent topics by these key words.

## Dictionary object with 4 key entries.
## - [modeling_sim]:
##   - modeling, simulation
## - [engineering]:
##   - engineering, build, construct
## - [mission_operations]:
##   - operations, mission
## - [AI]:
##   - artificial intelligence, science, ai, research

Below is the result with the custom semi-supervised topic search with the budget segmented into paragraphs and examining the percentage of the entire budget’s text where our topics of interest are found:

Comparing Budgets from Previous Year

Often it is helpful to compare documents. In our case we may be interested in relative changes that reflect new priorities. Below we compare the FY 2022 to the FY 2021 budget request to measure changes in key words. This provides quantified evidence on changes in priorities year-to-year. Similar document comparisons could compare different customers’ budgets or plans.

Summary

This example demonstrates a few of the NLP methods that can augment human review of budget documents. Consistent quantification of written text to help understand customers requirements and allocations is a valuable addition to current subjective processes derived from narrative perspectives.

Next
Next

aliens