Extraction and Evaluation of Statistical Information from Social and Behavioral Science Papers

With substantial and continuing increases in the number of published papers across the scientific literature, development of reliable approaches for automated discovery and assessment of published findings is increasingly urgent. Tools which can extract critical information from scientific papers and metadata can support representation and reasoning over existing findings, and offer insights into replicability, robustness and generalizability of specific claims. In this work, we present a pipeline for the extraction of statistical information (p-values, sample size, number of hypotheses tested) from full-text scientific documents. We validate our approach on 300 papers selected from the social and behavioral science literatures, and suggest directions for next steps.


INTRODUCTION
In parallel with major shifts toward transparency in scientific process and output, advances in information extraction and natural language processing have opened up new avenues for representation and reasoning over the large and growing scientific literature. Ideally, the community's confidence in a published finding is informed by a long and well-understood history of related work as greater scientific context. At present, this context is largely qualitative -gathered through keyword-based searches and investigatorled exploration of an ad-hoc sample of similar papers. But looking forward toward the vision of a queryable scholarly record, critical information and metadata can be extracted and aggregated to inform greater context for individual claims.
One critical piece of this context, particularly in hypothesisdriven work, is statistical information reported alongside a claim and associated hypothesis test(s), e.g., t-test, F-test, or chi-squared test. The most commonly reported piece of statistical information is the p-value, or the probability of obtaining a result at least as extreme as the observed result of a statistical hypothesis test, assuming that the null hypothesis is correct [7] [11]. In addition to p-values, studies typically report the test statistic and sample size, and may report other descriptive statistics of the dataset.
The work we present here builds on publicly-available statistical extraction software (Statcheck, [8]). We improve upon this tool, expanding the breadth of statistical tests considered and adding the extraction of sample size and number of hypotheses tested. The tool we have built ingests a scientific article in PDF and converts it to text, tokenizes sentences, and searches the text for specified regular expressions in order to output p-values, sample sizes and number of hypotheses present in the paper. We validate our approach on 300 papers selected from the social and behavioral science literature and offer a comparison of our tool to the Statcheck baseline.
This work falls into the broader category of mathematical formula extraction tools. For example, SymbolScraper 1 uses heuristic methods to extract symbol labels and bounding boxes from borndigital PDF files. The output is an XML file containing all symbols and their positions on each page. Although the extracted information is comprehensive, the patterns of statistical expressions become much less explicit and it is non-trivial to accurately restore those patterns at the symbol level. A learning based method was proposed in [12], which trained a CRF model to extract in-line mathematical expressions. The method achieved an F 1 = 89% on a corpus of manually annotated ACL papers. However, the authors used an in-house PDF analysis tool for data preparation, which is not publicly available. Recently, deep learning was applied to develop a mathematical formula extraction tool called ScanSSD [15]. ScanSSD outputs bounding boxes of math equations and images cropped from the input PDF file. However, recognizing text and symbols from the images requires additional OCR tools. The current model also extracts a fraction of false positives based on our qualitative assessments. 2

STATISTICAL FEATURES
In hypothesis-driven work, and sometimes in exploratory work, a statistical test is performed to offer evidence in support or refute of a null hypothesis. Common statistical tests include various t-tests, chi-squared tests, binomial tests, and ANOVA. Each of these outputs a test statistic, which measures how closely observed data matches the distribution expected under the null hypothesis of that test, or the assumption that no statistical relationship exists between two sets of observed data and measured phenomena. We focus our attention in this work on three critical statistical features of an empirical study: p-values, sample size and number of hypotheses tested (see Table 1 for examples).

Feature
Example Text p-value w/o test statistic p = 0.01, p < 0.03, p > 0.07 p-value w/ test statistic t (10) = 1.3, p = 0.01 Sample Size N = 100, n = 50 Hypothesis Test t, z, F p-value. In testing a null hypothesis H 0 against an alternative hypothesis H 1 based on data x obs , the p-value is defined as the probability, calculated under the null hypothesis, that a test statistic is as extreme or more extreme than its observed value. The null hypothesis is typically rejected -and the finding is declared statistically significant -if the p-value falls below the (current) type I error threshold α = 0.05 [3]. More recently, concerns about process and purpose around p-values have highlighted the critical importance of context in interpreting statistical outcomes through this lens [1,3,19], further motivating the extraction of a richer set of statistical features from scientific documents. Sample size. The sample size is the size of the observed dataset, or |x obs |. In the social and behavioral science studies that were the focus of our tool during development, this was in many cases the number of participants in the study. The sample size of a study is critically important as an indicator of the power of a study and confidence in study outcomes (see, e.g., [9]). Number of hypotheses tested. Understanding the number of hypotheses tested, whether or not they are explicitly described in a paper as such, is central to ongoing conversations about correct use of statistical methods and the direct attention being paid to p-hacking [10,17] and related bad practices. Of particular concern is the use or lack thereof of appropriate tools for correction for multiple comparisons, e.g., Bonferroni [5] or false discovery rate (FDR, [18]). Put simply, the greater number of tests of the same hypothesis, the more likely that one of them will return a positive finding. Significance must be calibrated accordingly.

FEATURE EXTRACTION PIPELINE
Our tool represents a statistical feature extraction pipeline, which ingests PDF text, preprocesses that text, extracts statistical information using regular expressions, and synthesizes extracted information into meaningful statistical insights. Our pipeline integrates existing software for text extraction from PDF and sentence tokenization, and builds on initial extraction capabilites in Statcheck [8] to expand the statistical tests considered, add output of the sample size both through derivation and explicit extraction, and report the number of hypotheses tested.

Conversion to text
A necessary first step is to convert PDF to text, through encoding and decoding individual characters. Tools widely used to extract text from PDF include PDFBox, 3 XpdfReader, 4 PDFMiner, 5 and PyPDF2 6 . We use XpdfReader because it works well on bulk documents stored in a single folder. The conversion process is imperfect. On some occasions, the tool outputs missing characters, mismatches symbols or fails to extract text at all (see Figure 1). A particularly challenging task is the extraction of tables and figures, a problem of significant study in its own right, e.g., [6,16].

Sentence tokenization
Sentence tokenization is the process of splitting extracted text into individual sentences. For formal documents, the tokenization algorithms built in to spaCy, NLTK, etc, perform well since the tokenizer is trained on a corpus of formal english text. Many tokenizers perform less well for documents with extensive use of abbreviations, measurements, and other forms not found in standard written English [13,14]. We have used NLTK [4].

Statistical feature extraction
As noted, we build on extraction capabilities initially deployed in the Statcheck tool [8]. Specifically, Statcheck uses regular expressions to find statistical results in the following forms: t(df) = value, p = value; F(df1,df2) = value, p = value; r(df) = value, p = value; [chi]2(df, N = value) = value, p = value (N is optional, delta G is also included); Z = value, p = value. All regular expressions take into account that test statistics and p values may be exactly (=) or inexactly (< or >) reported.     We extend these representations as follows. We use regular expressions to extract similar information for the following additional statistical tests: Q-test, Logistic Regression, b-test, d-test and Hazard Ratio. In addition, we build a more extensive list of regular expressions to better capture reporting of the tests described above and part of then original Statcheck tool, namely, t-test, F-test, correlation, Chi-square, and Z-test (see Table 2). A listing of regular expressions used to extract p-values reported alongside test statistics, for each of these tests, is given in Table 3.

F-Test
Critically, we also consider p-values reported without an accompanying test statistic. Differentiation between these two classes of reported p-values is important for downstream interpretation of extracted information. For example, the presence or absence of a test statistic alongside a p-value directly informs our evaluation of number of hypotheses tested (see below). For the extraction of p-values reported without associated test statistics, we consider expressions of the form: p (>,<,=) float, P (>,<,=) float, ps (>,<,=) float, Ps (>,<,=) float, pvalue (>,<,=) float, p-value (>,<,=) float. Float here includes scientific notation, e.g., p = 1.7e+3. Sample size. We identify sample sizes, in parallel, in two ways. First, we derive sample size from test statistics where possible, through back-calculation based on degrees of freedom (see Table 2). Second, we search for direct mention of sample size using regular expressions of the form 'n or N = int', following a similar approach to that taken for p-value extraction. Number of hypotheses tested. We extract the number of hypotheses tested indirectly from the paper by making use of the extracted p-values and associated test statistics when reported (see Table 1 for an example of p-values reported with and without an associated test statistic). Specifically, we count the number of pvalues reported alongside a test statistic as proxy for the number of hypotheses (statistically) tested in the paper.

VALIDATION
To validate our pipeline, we consider a dataset of 300 papers, 30 each randomly selected from prominent journals in the following 10 social and behavioral science fields: Economics; Health; Education; Political Science; Marketing; Criminology; Psychology; Sociology; Management; and Public Administration. We manually label each pvalue and sample size reported in each of the 300 papers, and track the number of p-values reported with and without accompanying test statistics. This manual labelling is done, for each paper, for both PDF and converted text documents to facilitate in depth evaluation. A report of extraction accuracy is provided in Table 4. We compare three approaches: manual extraction of p-values from converted text; our full pipeline model; the Statcheck tool. Accuracy is calculated based on total number of p-values extracted over the dataset using the given approach vs the total number of pvalues present in the original PDFs. Our approach meaningfully improves on the Statcheck tool. Accuracy metrics reported on the text-extracted documents indicate the critical importance of the conversion to text process. In particular, we observe during our labelling that statistical information is often captured in tables, where conversion is particularly prone to error. Table 5 gives the Precision, Recall and F1 performance metrics for our model for both the extraction of p-values (with and without test statistic) and the extraction of sample size from both an original PDF document and text (obtained after conversion). We note that precision of our sample size extractor is relatively lower, indicating that our approach looking for instances of 'n or N = int' is overly inclusive. Somewhat lower precision for p-values reported without test statistics is similarly attributed. Recall was high for all three information categories, but relatively lower for extraction from PDF than from text. These scores also suggest that accuracy of our tool would be improved with more accurate text extraction.

CONCLUSION
We have presented a pipeline for the extraction of statistical information from full text PDF of scientific documents, and validated our tool on a set of 300 papers from the social and behavioral science literatures. Motivating this work is ongoing concern about the reproducibility and generalizability of published claims, which emerged in the social sciences but has since left nearly no empirical field untouched [2]. It is clear that meta-reasoning over a body of literature could provide critically important framing for results of an individual study and move the community to more efficient discovery. Yet, manual search, extraction and assembly of statistical information across corpora will not scale. Rather, computational tools to support this process are needed.
Our work points to some specific next steps for extraction of statistical information from scholarly work. We have noted that tools which can better extract information from tables and figures will be particularly useful, as statistical information is often embedded in these formats. In addition, the section of a paper in which statistical information is reported may add relevant context, as may language around the statistical result. Mining text around extracted statistical information is proposed as a valuable future direction, both for the aim of refining statistical information extraction and for supplementing extracted statistics with investigator interpretations.