The Securities and Exchange Commission (SEC) requires publicly traded companies to file an annual report (10-K) that includes an overview of a company’s business and financial condition, as well as audited financial statements. These annual filings include management, discussion, and analysis (MD&A) sections that are primarily text. Based on the idea that fraud and deception contained within an MD&A can be detected, and that such detection eases prosecution for fraud, this paper reports on testing the following hypothesis: “Fraudulent MD&As display higher (a) quantity, (b) expressivity, (c) affect, (d) uncertainty, (e) nonimmediacy, (f) complexity, and less (g) diversity and (h) specificity of language than nonfraudulent MD&As.” The testing is based on “deception theory from [the] communication and psychology literature with linguistic analysis techniques derived from the field of computational linguistics,” specifically those applied in other realms.
The authors used the SEC’s accounting and auditing enforcement releases (AAERs) to obtain a sample collection of MD&As comprised of fraud cases discovered between 1995 and 2004 and to derive another similar sample collection in which fraud had not been detected. Then, Agent 99, a speech tagger and text analysis tool, extracted a set of 24 cues. This set is a subset and modification of Zhou et al’s constructs and variable definitions [1] used in other realms. In a parsimonious fashion, 24 is winnowed to ten. The authors used several statistical and machine learning techniques and methods on both sets to try to predict whether a particular sample element was a fraudulent one or a nonfraudulent one.
This strategy was able to correctly predict deception or nondeception with an accuracy rate of about 67 percent, which is an improvement over the nearly 54 percent accuracy one could expect from human processing.