GitHub Repository

Introduction

Logical reasoning is crucial for tasks such as mathematical problem-solving, automated decision making, coding, and planning. Current LLMs often fail at complex logical reasoning tasks. A recent approach, LogicLM by Pan et al. (2023), mitigates this issue by first using LLMs to translate reasoning problems in natural language to formal logic statements and then using deterministic symbolic logic solvers to infer accurate conclusions. Despite these advances, LogicLM falls short of achieving perfect scores on reasoning benchmarks, primarily due to inaccuracies in the initial translation by the LLM. Additionally, their experiments are limited to large, closed-source LLMs. LogicLLaMa by Yang et al. (2023) addresses the translation of natural language statements to first-order logic statements using standard fine-tuning, achieving performance comparable to GPT-4. Improvement in NL-FOL translation is likely to be directly proportional to improvement of LogicLM’s performance on reasoning questions related to FOL. Consequently, we see that accurate NL-FOL translation can stand to improve LLM-aided state-of-the-art reasoning capabilities. In addition to accuracy, an important desiderata is accessible NL-FOL translation. Thus, we focus on investigating NL-FOL translation by CodeLLaMa-7B, a small, inexpensive, open-source model.

We adapt Google’s step-by-step distillation (SBSD) method by Hsieh et al. (2023), which involves generating rationales for NL-FOL pairs and fine-tuning CodeLLaMA-7B on the multi-task problem of simultaneously predicting the label (the FOL statement) and the rationale. Although our approach achieved performance close to LogicLLaMa, it did not surpass it. Nonetheless, our contributions include:

  1. Assessing the efficacy of SBSD for the NL-FOL task compared to standard fine-tuning
  2. Clustering-based Diverse Search of FOL Examples (CSFE), our original method to increase diversity in the logical structure of FOL examples for few-shotting LLMs for FOL-related tasks
  3. Conducting a granular analysis of specific errors made by our fine-tuned LLM on the NL-FOL task and consequently arguing the need for more nuanced NL-FOL evaluation metrics
  4. Performing ablation studies to better understand the impact of epoch size, dataset size, and accuracy on performance, where we find preliminary results showing the potential of this approach to maintain strong performance with significantly less labeled data

Related Work

LogicLM

Previous approaches to improving the reasoning abilities of LLMs include few-shot prompting, CoT prompting by Suzgun et al. (2023), least-to-most prompting by Zhou et al. (2023) and fine-tuning by Wang et al. (2023). Recently, the LogicLM framework by Pan et al. (2023) outperformed other approaches. It consists of the following steps:

LogicLM with GPT-4 outperforms GPT-4 with CoT prompting by 8.34% on the FOLIO dataset by Han et al. (2024). However, LogicLM only experiments with GPT-4 and GPT-3.5 as the LLM of choice for step (1). We are interested in investigating the performance of a smaller open-source model, specifically for the archetype of first-order-logic. Our motivation for this is the speed, affordability, and privacy afforded by smaller open-source language models.

LogicLLaMa

The approach closest to ours is LogicLLaMa by Yang et al. (2023). In the LogicLLaMa paper, LLaMa-7B was fine-tuned on 35K examples of NL-FOL translations for the task of NL-FOL translation. They use the MALLS dataset consisting of 34K synthetically generated examples of NL-FOL pairs and 1K pairs from the training set of LogicNLI by Tian et al. (2021). The authors also developed an evaluation metric, logical equivalence (LE), to evaluate their performance and found that their fine-tuned version of LLaMa-7B was shy of GPT-4 5-shot by just 3.7% on LE.

Google’s Step-by-Step Distillation

Recently, Hsieh et al. (2023) at Google proposed a framework named Step-by-Step Distillation (SBSD) to improve performance on reasoning tasks, including natural language inference, commonsense question answering, and arithmetic word problems.

SBSD involves the following steps:

Google’s results show that SBSD outperforms standard fine-tuning using less training data for the aforementioned reasoning tasks. In our research, we adapt Google’s SBSD method for a different reasoning task: NL-FOL translation. We compare our results against the performance of LogicLLaMa’s standard fine-tuning approach.

Approach

Our modified version of Google’s SBSD framework adapted to the NL-FOL translation task is as follows:

Rationale Generation using GPT-4o

First, we prompt GPT-4o to generate rationales given a NL-FOL pair. We use GPT-4o since it is strong at reasoning, fast, and inexpensive. Rationales are step-by-step explanations as to how to arrive at the given FOL translation based on the NL statement. We construct two different types of rationales: (1) Class-A and (2) Class-B.

Class-A Rationales

Our prompt for generating Class-A rationales is optimized using our original approach: clustering-based diverse selection of FOL examples for few-shot learning (CSFE). In few-shot prompting, variety in chosen examples is crucial for better generalization. Our method CSFE aims to diversify the logical structures of the examples chosen to few-shot GPT-4o. We create a dataset of masked/abstract FOL statements, replacing specific predicates and objects with generic labels (Ps and Os) while maintaining logical operators as-is using regular expressions. For instance, isYoung(daniel) ∧ isOld(danielsDad) is abstracted to P(O) ∧ P(O). This approach ensures clustering focuses on the underlying logical structure rather than variable names. We implemented a tokenizer based on logical connectors, used a count vectorizer, and performed k-means clustering on the abstracted statements. From each cluster, we randomly sampled an NL-FOL pair.

Qualitatively, we observed that our clustering algorithm successfully learned to form clusters of distinct logical structures. We observed that FOL statements in different clusters varied in terms of number of predicates, number of arguments, number of logical connectors, and types of logical connectors. However, there was a lack of variety with respect to statements that are existentially quantified, universally quantified, or neither—that is to say, there was no clear cluster where we could draw existentially-quantified statements or non-variable-based statements from. We will refer to this type of variety in FOL statements as Existential-Universal-Neither-diversity, or EUN-diversity for short. Since EUN-diversity would also be important for GPT-4o to generalize its performance, we decided to also draw samples such that the final samples had diversity with respect to cluster (and consequently, logical structure) as well as EUN-diversity. Finally, we include 3 diverse examples in our Class-A prompt.

Class-B Rationales

Due to the existence of multiple examples in the prompt, the Class-A prompt was long. Therefore, we augmented the Class-A rationales with Class-B rationales that only contained one randomly-selected example in the prompt.

Overall, we generated 2667 Class-A rationales and 5456 Class-B rationales, where the Class-A rationales were of higher quality due to an optimized prompt, but were also more expensive. The following is an example of a GPT-4o generated rationale:

Natural Language First Order Logic Rationale
If a person is a judge, they preside over court proceedings and make legal decisions. ∀x(Person(x) ∧ Judge(x) → (PresideOverCourtProceedings(x) ∧ MakeLegalDecisions(x))) Identify Key Entities and Relationships: The key entities are “person” and “judge.” The relationships are “preside over court proceedings” and “make legal decisions,” which are conditions associated with being a judge. Use the variable x to represent a person. Logical Formulation Structure: ∀x(Person(x) ∧ Judge(x) → (PresideOverCourtProceedings(x) ∧ MakeLegalDecisions(x))). Explanation: The universal quantifier ∀x ensures the statement applies to all persons. The predicate Person(x) and Judge(x) identify the individual as a person and specifically as a judge. The implication operator → conveys the conditional nature of the statement. The predicates PresideOverCourtProceedings(x) and MakeLegalDecisions(x) describe the activities associated with being a judge. The conjunction ∧ connects these activities, reflecting the cumulative conditions laid out in the natural language statement.

Multi-task Fine-tuning

After generating the rationales we simultaneously fine-tune CodeLLaMa-7B on both generating the label (i.e. the FOL statement) as well as the rationale. We use CodeLLaMa since we hypothesize that CodeLLaMa’s enhanced coding abilities will translate to enhanced logical problem-solving abilities. We use 7B parameters since it is much smaller than the models used by LogicLM (GPT-3.5 and GPT-4) and is also the same size as our baseline, LogicLLaMa. We conduct the fine-tuning on MALLS v0.1, consisting of 27K NL-FOL pairs, i.e. 27K question-label pairs as well as 8123 rationales. We ensure that the NL-FOL pairs that the 8123 rationales are based on are a strict subset of the 27K NL-FOL pairs. This ensures that the rationales cannot improve model performance by providing additional examples of NL-FOL pairs that the model can learn from. This is to keep our comparison to our baseline fairer.

Based on Google’s success, we hypothesize that the rationales contain important and widely-applicable principles and reasoning steps of how to conduct NL-FOL translation that the model learns through the rationale-generation task. For instance, the model could learn that the natural language phrasing “If..., then...” requires the logical connector ‘→’ from our rationales. This is because in our rationale-generation prompts, we include the instruction to directly reference the NL statement when justifying a choice of logical operator. Like Google, we prepend the task prefixes [LABEL] and [RATIONALE] to the input data that we fine-tune on so as to allow the model to differentiate between the two tasks. We conduct fine-tuning on subsets of the MALLS dataset released by the LogicLLaMA authors.

Baselines and Evaluation

We compare our approach which uses fine-tuning with SBSD on CodeLLaMa-7B on MALLS v0.1 and 8123 rationales against our baseline of LogicLLaMa which uses standard fine-tuning (without SBSD) with LoRA on MALLS v0 and no rationales. MALLS v0.1 and MALLS v0 are similar synthetic NL-FOL datasets released by the LogicLLaMa authors. The difference is that the former has 27K data points which are auto-verified whereas the latter has 34K data-points without auto-verification. We use the former to increase accuracy of our training data, which is an important concern since the dataset is synthetically generated. We evaluate the models on FOLIO-parsed, a subset of the FOLIO dataset parsed by the LogicLLaMa authors. FOLIO-parsed contains 2195 human-expert-written NL-FOL pairs.

Experiments

Data

We use two datasets for our experiments: MALLS v0.1 and MALLS v0 by Yang et al. (2023).

Additionally, we generate 8,123 rationales from randomly sampled NL-FOL pairs within the MALLS v0.1 dataset. Of these, 2,667 rationales are Class-A rationales, generated using our CSFE approach, which aims to improve the diversity and representativeness of the rationale set.

We evaluate all models on folio-parsed, a parsed version of the original FOLIO dataset by Han et al. (2024), parsed by the authors of the LogicLLaMa paper by Yang et al. (2023).

For our experiments, we construct three distinct training sets:

By comparing these training sets, we aim to evaluate the impact of dataset size and verification on model performance.

Evaluation Method

The predicted FOL pairs are evaluated against the true FOL pairs using two metrics: BLEU and Logical Equivalence (LE).

BLEU: The BLEU score measures the ability of a language model to accurately extract relevant predicates and objects, and to make appropriate choices in naming them.

Logical Equivalence (LE): The LE score evaluates the equivalence of two logical statements based on their truth tables, computing the overlap ratio. This metric compares the logical structure and meaning between the predicted and true FOL pairs.

We adopt LogicLlama’s approach for evaluating both BLEU and LE scores. By comparing these metrics, we evaluate the performance of the models in terms of both linguistic closeness to True FOL and logical correctness in comparison to True FOL.

Experimental Details

We fine-tuned the CodeLlama 2-7B model on the training sets with a learning rate of 3 × 10−5 and a batch size of 8. For generation, we used a maximum token limit of 100, a temperature setting of 0.7, top_k of 50, and top_p of 0.7. All experiments were conducted on Together AI’s cloud service using an A100 GPU. We also experimented with the number of epochs to determine its effect on the model’s performance, measured using the BLEU and Logical Equivalence (LE) metrics.

Results

The following table shows the results of fine-tuned CodeLlama 2-7B with different datasets and number of epochs on FOLIO Parsed, our test set. We compare our results to the Direct Translation finetune results of Llama 2-7B from LogicLlama by Yang et al. (2023).

Dataset Model Name Epochs Median BLEU Mean BLEU Median LE Mean LE
D1 CodeLlama 2-7B 2 0.2320 0.3163 0.8125 0.7695
D2 CodeLlama 2-7B 1 0.2215 0.3134 0.7813 0.7629
D2 CodeLlama 2-7B 2 0.2215 0.3130 0.7813 0.7614
D2 CodeLlama 2-7B 3 0.2218 0.3157 0.8125 0.7636
D3 CodeLlama 2-7B 2 0.2215 0.3079 0.7813 0.7545
MALLS Llama 2-7B (Direct Translation) 0.3720 0.8180

Analysis

Ablation Result Analysis

From the results table, we see that epoch size has minimal effect on accuracy. We see that using the 27K auto-verified dataset leads to a slight performance improvement over the 34K un-verified dataset. Most interestingly, we see that in one of our experiments, despite reducing the number of FOL labels to 8K (D3), we maintain excellent performance, achieving 1% less LE than D1, with 30% of the labels. This is similar to Google’s findings in their SBSD paper where they were able to leverage SBSD to achieve higher performance with significantly less training data. This result is especially promising since there are very few English NL-FOL datasets, and even fewer non-English datasets. Consequently, SBSD has the potential to augment equity in terms of language access for this task.

However, while this result may be promising, it is unfortunately inconclusive at this point. When running our main experiment with 27K labels and 8123 rationales, we ensured that the 8123 rationales were based on FOL labels that were a strict subset of the 27K FOL labels to ensure that the rationales cannot provide any new FOL information. Unfortunately, we made the mistake of not enforcing this rule when conducting the aforementioned ablation experiments on the 8K labels subset and randomly selected the 8K labels instead. We attempted to re-run this ablation experiment with the rule enforcement, however our compute provider, TogetherAI, consistently failed to deploy our model. Nevertheless, the result of the experiment is still promising since in the unlikely worst case, if the sets of FOL statements for our labels and rationales were completely disjoint (which is unlikely), we would have 8000 + 8123 = 16123 distinct FOL statements, which is less than 27K, while maintaining comparable performance.

Error Analysis

We identify 2 categories of errors: E1 and E2.

E1: Subjectivity of Object vs. Predicate Interpretation

Subjectivity of determining if an object is general enough to remain interpreted as an object, or if a predicate needs to exist to state that the object is an object of a specific type.

Consider the following example. NL: “All people who regularly drink coffee are dependent on caffeine.” Predicted FOL: ∀x (Person(x) ∧ RegularlyDrinksCoffee(x) → DependentOnCaffeine(x)). True FOL: ∀x (Drinks(x) → Dependent(x)).

The interpretation of the predicted FOL is that for any object x, if x is a person and regularly drinks coffee, x is dependent on caffeine. The interpretation of the True FOL is that for any object x, if x drinks [coffee], then x is dependent [on caffeine]. We argue that our predicted FOL is in fact more accurate than the True FOL in both LE and BLEU. With respect to LE, we see that logically, it is not necessary that all objects are people. For example, the conclusion that a cat that drinks coffee is dependent on caffeine is inconclusive based on the NL statement since the NL statement only discusses people (humans). However, the True FOL would (arguably) mistakenly mark the conclusion as true. Furthermore, despite our Predicted FOL being logically correct, we receive a lower LE score instead of 1 due to the extra predicate compared to the True FOL. With respect to BLEU, we clearly see that our predicted FOL is in fact more expressive than the True FOL since we specify predicates RegularlyDrinksCoffee as opposed to simply Drinks. Thus, we believe we receive a lower BLEU score than deserved for this category of examples.

E2: Subjectivity of Identifying Universally Quantified Statements

Consider the following example: NL: “Two is positive.” Predicted FOL: ∀x (Two(x) → Positive(x)). True FOL: Positive(two).

This type of error is admittedly more valid to penalize, since given just the context of the NL statement, the more straightforward FOL representation is the True FOL. However, the predicted FOL is interpreted as: for all objects x, if x is two, then x is positive. This, while less straightforward, fundamentally has the same meaning as the True FOL, and should arguably not be penalized as much as it is.

Both E1 and E2 arguably arise from the inherent subjectivity of NL-FOL translation due to the existence of multiple correct FOLs for a given NL statement.

Baseline-Comparison Analysis

While we tried to make our comparisons fair, there still exist certain differences between LogicLLaMa and our approach that should be controlled for, such as (a) us using CodeLLaMa and LogicLLaMa using vanilla LLaMa, (b) LogicLLaMa training on an additional 1K examples from the LogicNLI dataset, (c) us using MALLS v0.1 and them using MALLS v0, and (d) different hyper-parameters for fine-tuning as well as their use of LoRA. Any combination of these differences could affect our conclusion of the effect of SBSD vis-à-vis standard fine-tuning. For more granular conclusions, further ablation studies are required.

Conclusion

While we were unable to beat the challenging baseline set by LogicLLaMa, we demonstrate the potential of the SBSD framework in achieving comparable performance with less data on the NL-FOL translation task. Our original clustering-based approach qualitatively demonstrates significant utility in achieving diversity for few-shot prompting for FOL-related tasks. Our error analysis contributes to the existing research by demonstrating the need for more nuanced evaluation metrics that take into account the subjective nature of NL-FOL translation. In the future, we intend on conducting further ablation studies robustly experimenting with the performance of SBSD on decreasing dataset sizes.