Introduction
Logical reasoning is crucial for tasks such as mathematical problem-solving, automated decision making, coding, and planning. Current LLMs often fail at complex logical reasoning tasks. A recent approach, LogicLM by Pan et al. (2023), mitigates this issue by first using LLMs to translate reasoning problems in natural language to formal logic statements and then using deterministic symbolic logic solvers to infer accurate conclusions. Despite these advances, LogicLM falls short of achieving perfect scores on reasoning benchmarks, primarily due to inaccuracies in the initial translation by the LLM. Additionally, their experiments are limited to large, closed-source LLMs. LogicLLaMa by Yang et al. (2023) addresses the translation of natural language statements to first-order logic statements using standard fine-tuning, achieving performance comparable to GPT-4. Improvement in NL-FOL translation is likely to be directly proportional to improvement of LogicLM’s performance on reasoning questions related to FOL. Consequently, we see that accurate NL-FOL translation can stand to improve LLM-aided state-of-the-art reasoning capabilities. In addition to accuracy, an important desiderata is accessible NL-FOL translation. Thus, we focus on investigating NL-FOL translation by CodeLLaMa-7B, a small, inexpensive, open-source model.
We adapt Google’s step-by-step distillation (SBSD) method by Hsieh et al. (2023), which involves generating rationales for NL-FOL pairs and fine-tuning CodeLLaMA-7B on the multi-task problem of simultaneously predicting the label (the FOL statement) and the rationale. Although our approach achieved performance close to LogicLLaMa, it did not surpass it. Nonetheless, our contributions include:
- Assessing the efficacy of SBSD for the NL-FOL task compared to standard fine-tuning
- Clustering-based Diverse Search of FOL Examples (CSFE), our original method to increase diversity in the logical structure of FOL examples for few-shotting LLMs for FOL-related tasks
- Conducting a granular analysis of specific errors made by our fine-tuned LLM on the NL-FOL task and consequently arguing the need for more nuanced NL-FOL evaluation metrics
- Performing ablation studies to better understand the impact of epoch size, dataset size, and accuracy on performance, where we find preliminary results showing the potential of this approach to maintain strong performance with significantly less labeled data
Related Work
LogicLM
Previous approaches to improving the reasoning abilities of LLMs include few-shot prompting, CoT prompting by Suzgun et al. (2023), least-to-most prompting by Zhou et al. (2023) and fine-tuning by Wang et al. (2023). Recently, the LogicLM framework by Pan et al. (2023) outperformed other approaches. It consists of the following steps:
- Problem Formulation: Depending on the type of reasoning problem, the LLM is instructed to convert the natural language reasoning problem into one of four archetypes of formal reasoning: deductive reasoning, first-order logic, constraint satisfaction, and analytic reasoning.
- Symbolic Reasoning: The formal formulation produced by the LLM is deterministically, and consequently accurately, solved using the corresponding suitable symbolic logic solver. For first-order logic, this is employing a first-order logic prover.
- Result Interpretation: The output from the symbolic logic solver is processed by an LLM or a result interpreter, which then generates the final answer to the initial reasoning problem.
LogicLM with GPT-4 outperforms GPT-4 with CoT prompting by 8.34% on the FOLIO dataset by Han et al. (2024). However, LogicLM only experiments with GPT-4 and GPT-3.5 as the LLM of choice for step (1). We are interested in investigating the performance of a smaller open-source model, specifically for the archetype of first-order-logic. Our motivation for this is the speed, affordability, and privacy afforded by smaller open-source language models.
LogicLLaMa
The approach closest to ours is LogicLLaMa by Yang et al. (2023). In the LogicLLaMa paper, LLaMa-7B was fine-tuned on 35K examples of NL-FOL translations for the task of NL-FOL translation. They use the MALLS dataset consisting of 34K synthetically generated examples of NL-FOL pairs and 1K pairs from the training set of LogicNLI by Tian et al. (2021). The authors also developed an evaluation metric, logical equivalence (LE), to evaluate their performance and found that their fine-tuned version of LLaMa-7B was shy of GPT-4 5-shot by just 3.7% on LE.
Google’s Step-by-Step Distillation
Recently, Hsieh et al. (2023) at Google proposed a framework named Step-by-Step Distillation (SBSD) to improve performance on reasoning tasks, including natural language inference, commonsense question answering, and arithmetic word problems.
SBSD involves the following steps:
- Rationale Generation: Few-shot CoT prompting of a high-capacity LLM like GPT-4 to generate rationales that explain the reasoning process from question to label.
- Multi-task Fine-tuning: Fine-tuning a smaller model to simultaneously generate both rationales and labels.
Google’s results show that SBSD outperforms standard fine-tuning using less training data for the aforementioned reasoning tasks. In our research, we adapt Google’s SBSD method for a different reasoning task: NL-FOL translation. We compare our results against the performance of LogicLLaMa’s standard fine-tuning approach.
Approach
Our modified version of Google’s SBSD framework adapted to the NL-FOL translation task is as follows:
Rationale Generation using GPT-4o
First, we prompt GPT-4o to generate rationales given a NL-FOL pair. We use GPT-4o since it is strong at reasoning, fast, and inexpensive. Rationales are step-by-step explanations as to how to arrive at the given FOL translation based on the NL statement. We construct two different types of rationales: (1) Class-A and (2) Class-B.
Class-A Rationales
Our prompt for generating Class-A rationales is optimized using our original approach: clustering-based diverse selection of FOL examples for few-shot learning (CSFE). In few-shot prompting, variety in chosen examples is crucial for better generalization. Our method CSFE aims to diversify the logical structures of the examples chosen to few-shot GPT-4o. We create a dataset of masked/abstract FOL statements, replacing specific predicates and objects with generic labels (Ps and Os) while maintaining logical operators as-is using regular expressions. For instance, isYoung(daniel) ∧ isOld(danielsDad) is abstracted to P(O) ∧ P(O). This approach ensures clustering focuses on the underlying logical structure rather than variable names. We implemented a tokenizer based on logical connectors, used a count vectorizer, and performed k-means clustering on the abstracted statements. From each cluster, we randomly sampled an NL-FOL pair.
Qualitatively, we observed that our clustering algorithm successfully learned to form clusters of distinct logical structures. We observed that FOL statements in different clusters varied in terms of number of predicates, number of arguments, number of logical connectors, and types of logical connectors. However, there was a lack of variety with respect to statements that are existentially quantified, universally quantified, or neither—that is to say, there was no clear cluster where we could draw existentially-quantified statements or non-variable-based statements from. We will refer to this type of variety in FOL statements as Existential-Universal-Neither-diversity, or EUN-diversity for short. Since EUN-diversity would also be important for GPT-4o to generalize its performance, we decided to also draw samples such that the final samples had diversity with respect to cluster (and consequently, logical structure) as well as EUN-diversity. Finally, we include 3 diverse examples in our Class-A prompt.
Class-B Rationales
Due to the existence of multiple examples in the prompt, the Class-A prompt was long. Therefore, we augmented the Class-A rationales with Class-B rationales that only contained one randomly-selected example in the prompt.
Overall, we generated 2667 Class-A rationales and 5456 Class-B rationales, where the Class-A rationales were of higher quality due to an optimized prompt, but were also more expensive. The following is an example of a GPT-4o generated rationale:
| Natural Language | First Order Logic | Rationale |
|---|---|---|
| If a person is a judge, they preside over court proceedings and make legal decisions. | ∀x(Person(x) ∧ Judge(x) → (PresideOverCourtProceedings(x) ∧ MakeLegalDecisions(x))) |
Identify Key Entities and Relationships: The key entities are “person” and “judge.” The relationships are “preside over court proceedings” and “make legal decisions,” which are conditions associated with being a judge. Use the variable x to represent a person. Logical Formulation Structure: ∀x(Person(x) ∧ Judge(x) → (PresideOverCourtProceedings(x) ∧ MakeLegalDecisions(x))). Explanation: The universal quantifier ∀x ensures the statement applies to all persons. The predicate Person(x) and Judge(x) identify the individual as a person and specifically as a judge. The implication operator → conveys the conditional nature of the statement. The predicates PresideOverCourtProceedings(x) and MakeLegalDecisions(x) describe the activities associated with being a judge. The conjunction ∧ connects these activities, reflecting the cumulative conditions laid out in the natural language statement. |
Multi-task Fine-tuning
After generating the rationales we simultaneously fine-tune CodeLLaMa-7B on both generating the label (i.e. the FOL statement) as well as the rationale. We use CodeLLaMa since we hypothesize that CodeLLaMa’s enhanced coding abilities will translate to enhanced logical problem-solving abilities. We use 7B parameters since it is much smaller than the models used by LogicLM (GPT-3.5 and GPT-4) and is also the same size as our baseline, LogicLLaMa. We conduct the fine-tuning on MALLS v0.1, consisting of 27K NL-FOL pairs, i.e. 27K question-label pairs as well as 8123 rationales. We ensure that the NL-FOL pairs that the 8123 rationales are based on are a strict subset of the 27K NL-FOL pairs. This ensures that the rationales cannot improve model performance by providing additional examples of NL-FOL pairs that the model can learn from. This is to keep our comparison to our baseline fairer.
Based on Google’s success, we hypothesize that the rationales contain important and widely-applicable principles and reasoning steps of how to conduct NL-FOL translation that the model learns through the rationale-generation task. For instance, the model could learn that the natural language phrasing “If..., then...” requires the logical connector ‘→’ from our rationales. This is because in our rationale-generation prompts, we include the instruction to directly reference the NL statement when justifying a choice of logical operator. Like Google, we prepend the task prefixes [LABEL] and [RATIONALE] to the input data that we fine-tune on so as to allow the model to differentiate between the two tasks. We conduct fine-tuning on subsets of the MALLS dataset released by the LogicLLaMA authors.
Baselines and Evaluation
We compare our approach which uses fine-tuning with SBSD on CodeLLaMa-7B on MALLS v0.1 and 8123 rationales against our baseline of LogicLLaMa which uses standard fine-tuning (without SBSD) with LoRA on MALLS v0 and no rationales. MALLS v0.1 and MALLS v0 are similar synthetic NL-FOL datasets released by the LogicLLaMa authors. The difference is that the former has 27K data points which are auto-verified whereas the latter has 34K data-points without auto-verification. We use the former to increase accuracy of our training data, which is an important concern since the dataset is synthetically generated. We evaluate the models on FOLIO-parsed, a subset of the FOLIO dataset parsed by the LogicLLaMa authors. FOLIO-parsed contains 2195 human-expert-written NL-FOL pairs.
Experiments
Data
We use two datasets for our experiments: MALLS v0.1 and MALLS v0 by Yang et al. (2023).
- MALLS v0.1: The dataset comprises 27,000 NL-FOL pairs that have been auto verified.
- MALLS v0: The dataset includes a total of 34,000 NL-FOL pairs that are non verified.
Additionally, we generate 8,123 rationales from randomly sampled NL-FOL pairs within the MALLS v0.1 dataset. Of these, 2,667 rationales are Class-A rationales, generated using our CSFE approach, which aims to improve the diversity and representativeness of the rationale set.
We evaluate all models on folio-parsed, a parsed version of the original FOLIO dataset by Han et al. (2024), parsed by the authors of the LogicLLaMa paper by Yang et al. (2023).
For our experiments, we construct three distinct training sets:
- D1: MALLS v0.1 and Rationales
- D2: MALLS v0 and Rationales
- D3: 8,000 randomly-sampled NL-FOL pairs from MALLS v0.1 and Rationales
By comparing these training sets, we aim to evaluate the impact of dataset size and verification on model performance.
Evaluation Method
The predicted FOL pairs are evaluated against the true FOL pairs using two metrics: BLEU and Logical Equivalence (LE).
BLEU: The BLEU score measures the ability of a language model to accurately extract relevant predicates and objects, and to make appropriate choices in naming them.
Logical Equivalence (LE): The LE score evaluates the equivalence of two logical statements based on their truth tables, computing the overlap ratio. This metric compares the logical structure and meaning between the predicted and true FOL pairs.
We adopt LogicLlama’s approach for evaluating both BLEU and LE scores. By comparing these metrics, we evaluate the performance of the models in terms of both linguistic closeness to True FOL and logical correctness in comparison to True FOL.
Experimental Details
We fine-tuned the CodeLlama 2-7B model on the training sets with a learning rate of 3 × 10−5 and a batch size of 8. For generation, we used a maximum token limit of 100, a temperature setting of 0.7, top_k of 50, and top_p of 0.7. All experiments were conducted on Together AI’s cloud service using an A100 GPU. We also experimented with the number of epochs to determine its effect on the model’s performance, measured using the BLEU and Logical Equivalence (LE) metrics.
Results
The following table shows the results of fine-tuned CodeLlama 2-7B with different datasets and number of epochs on FOLIO Parsed, our test set. We compare our results to the Direct Translation finetune results of Llama 2-7B from LogicLlama by Yang et al. (2023).
| Dataset | Model Name | Epochs | Median BLEU | Mean BLEU | Median LE | Mean LE |
|---|---|---|---|---|---|---|
| D1 | CodeLlama 2-7B | 2 | 0.2320 | 0.3163 | 0.8125 | 0.7695 |
| D2 | CodeLlama 2-7B | 1 | 0.2215 | 0.3134 | 0.7813 | 0.7629 |
| D2 | CodeLlama 2-7B | 2 | 0.2215 | 0.3130 | 0.7813 | 0.7614 |
| D2 | CodeLlama 2-7B | 3 | 0.2218 | 0.3157 | 0.8125 | 0.7636 |
| D3 | CodeLlama 2-7B | 2 | 0.2215 | 0.3079 | 0.7813 | 0.7545 |
| MALLS | Llama 2-7B (Direct Translation) | — | — | 0.3720 | — | 0.8180 |
Analysis
Ablation Result Analysis
From the results table, we see that epoch size has minimal effect on accuracy. We see that using the 27K auto-verified dataset leads to a slight performance improvement over the 34K un-verified dataset. Most interestingly, we see that in one of our experiments, despite reducing the number of FOL labels to 8K (D3), we maintain excellent performance, achieving 1% less LE than D1, with 30% of the labels. This is similar to Google’s findings in their SBSD paper where they were able to leverage SBSD to achieve higher performance with significantly less training data. This result is especially promising since there are very few English NL-FOL datasets, and even fewer non-English datasets. Consequently, SBSD has the potential to augment equity in terms of language access for this task.
However, while this result may be promising, it is unfortunately inconclusive at this point. When running our main experiment with 27K labels and 8123 rationales, we ensured that the 8123 rationales were based on FOL labels that were a strict subset of the 27K FOL labels to ensure that the rationales cannot provide any new FOL information. Unfortunately, we made the mistake of not enforcing this rule when conducting the aforementioned ablation experiments on the 8K labels subset and randomly selected the 8K labels instead. We attempted to re-run this ablation experiment with the rule enforcement, however our compute provider, TogetherAI, consistently failed to deploy our model. Nevertheless, the result of the experiment is still promising since in the unlikely worst case, if the sets of FOL statements for our labels and rationales were completely disjoint (which is unlikely), we would have 8000 + 8123 = 16123 distinct FOL statements, which is less than 27K, while maintaining comparable performance.
Error Analysis
We identify 2 categories of errors: E1 and E2.
E1: Subjectivity of Object vs. Predicate Interpretation
Subjectivity of determining if an object is general enough to remain interpreted as an object, or if a predicate needs to exist to state that the object is an object of a specific type.
Consider the following example. NL: “All people who regularly drink coffee are dependent on caffeine.” Predicted FOL: ∀x (Person(x) ∧ RegularlyDrinksCoffee(x) → DependentOnCaffeine(x)). True FOL: ∀x (Drinks(x) → Dependent(x)).
The interpretation of the predicted FOL is that for any object x, if x is a person and regularly drinks coffee, x is dependent on caffeine. The interpretation of the True FOL is that for any object x, if x drinks [coffee], then x is dependent [on caffeine]. We argue that our predicted FOL is in fact more accurate than the True FOL in both LE and BLEU. With respect to LE, we see that logically, it is not necessary that all objects are people. For example, the conclusion that a cat that drinks coffee is dependent on caffeine is inconclusive based on the NL statement since the NL statement only discusses people (humans). However, the True FOL would (arguably) mistakenly mark the conclusion as true. Furthermore, despite our Predicted FOL being logically correct, we receive a lower LE score instead of 1 due to the extra predicate compared to the True FOL. With respect to BLEU, we clearly see that our predicted FOL is in fact more expressive than the True FOL since we specify predicates RegularlyDrinksCoffee as opposed to simply Drinks. Thus, we believe we receive a lower BLEU score than deserved for this category of examples.
E2: Subjectivity of Identifying Universally Quantified Statements
Consider the following example: NL: “Two is positive.” Predicted FOL: ∀x (Two(x) → Positive(x)). True FOL: Positive(two).
This type of error is admittedly more valid to penalize, since given just the context of the NL statement, the more straightforward FOL representation is the True FOL. However, the predicted FOL is interpreted as: for all objects x, if x is two, then x is positive. This, while less straightforward, fundamentally has the same meaning as the True FOL, and should arguably not be penalized as much as it is.
Both E1 and E2 arguably arise from the inherent subjectivity of NL-FOL translation due to the existence of multiple correct FOLs for a given NL statement.
Baseline-Comparison Analysis
While we tried to make our comparisons fair, there still exist certain differences between LogicLLaMa and our approach that should be controlled for, such as (a) us using CodeLLaMa and LogicLLaMa using vanilla LLaMa, (b) LogicLLaMa training on an additional 1K examples from the LogicNLI dataset, (c) us using MALLS v0.1 and them using MALLS v0, and (d) different hyper-parameters for fine-tuning as well as their use of LoRA. Any combination of these differences could affect our conclusion of the effect of SBSD vis-à-vis standard fine-tuning. For more granular conclusions, further ablation studies are required.
Conclusion
While we were unable to beat the challenging baseline set by LogicLLaMa, we demonstrate the potential of the SBSD framework in achieving comparable performance with less data on the NL-FOL translation task. Our original clustering-based approach qualitatively demonstrates significant utility in achieving diversity for few-shot prompting for FOL-related tasks. Our error analysis contributes to the existing research by demonstrating the need for more nuanced evaluation metrics that take into account the subjective nature of NL-FOL translation. In the future, we intend on conducting further ablation studies robustly experimenting with the performance of SBSD on decreasing dataset sizes.