The Power of Pause: Training LLMs to Think Before Making Function Calls

The Power of Pause: Training LLMs to Think Before Making Function Calls

How I Fine-Tuned Gemma 2-2B to Make Better API Calls

As developers, we constantly look for ways to bridge the gap between human language and machine action. The evolution of Large Language Models (LLMs) has opened exciting possibilities for creating interfaces that can understand natural language requests and translate them into structured actions. Recently, I've been exploring one such capability — function calling — and discovered a simple yet powerful enhancement: adding an explicit "thinking step" before function execution.

In this article, I'll share how I fine-tuned Google's Gemma 2-2B model to include this thinking step, why it matters, and how you can use it in your projects.

What is Function Calling?

Function calling is a capability that allows language models to interact with external tools or APIs in a structured way. Instead of just responding with text, a model with function calling can:

  1. Recognize when a user's request requires an external action (like checking the weather or evaluating branching logic according to a business requirement)

  2. Format a proper API call with the right parameters

  3. Receive and process the result of that function call

  4. Continue the conversation with that newly acquired information

For example, consider this user request:

"What's the temperature in New York right now?"

A standard LLM without function calling capabilities might respond with:

"I don't have real-time data, but New York is typically around 70°F in spring."

Or worse, the model will simply hallucinate an incorrect reply because it hasn't been instructed to admit it doesn't have an answer. In contrast, a model with function calling would:

  1. Recognize this as a weather query

  2. Call a weather API with "New York" as the location parameter

  3. Return the actual current temperature

The difference is substantial — one gives a generic answer, while the other delivers accurate, real-time information.

The Problem with Standard Function Calling

While function calling represents a significant advancement, standard implementations can still make errors in two critical areas:

  1. Function selection: Choosing the wrong function to execute

  2. Parameter preparation: Passing incorrect or incomplete parameters

These issues often stem from the model making immediate decisions without explicit reasoning. That's where the "thinking step" comes in.

Adding a Thinking Step: The Key Improvement

My enhancement to the standard function calling approach is straightforward but effective: I fine-tuned the model to include an explicit thinking phase before making any function call.

During this thinking step, the model:

  • Analyzes what the user is asking for

  • Identifies the appropriate function to call

  • Determines which parameters are needed and how to format them

  • Considers potential edge cases

Only after this deliberate reasoning process does the model proceed to make the actual function call.

Here's an example of how this works in practice:

User: "I need to convert 500 USD to Euros."

Model (thinking): "The user wants to convert currency from USD to EUR. I should use the convert_currency function. This function requires three parameters: amount (500), from_currency (USD), and to_currency (EUR)."

Model (function call): {"name": "convert_currency", "arguments": {"amount": 500, "from_currency": "USD", "to_currency": "EUR"}}

Function result: {"converted_amount": 420.5, "from_currency": "USD", "to_currency": "EUR"}

Model: "500 USD is approximately 420.50 Euros."

The thinking step might seem trivial, but it significantly improves the model's ability to make correct function calls by forcing it to reason explicitly about its decisions.

How I Fine-Tuned the Model

I based my work on Google's Gemma 2-2B Instruction model , a smaller yet capable language model that offers a good balance of performance and resource requirements.

For training data, I used a modified version of the "NousResearch/hermes-function-calling-v1" dataset, enhanced with a "thinking" step computed from "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B".

This dataset contains examples of:

  • User queries that might require function calls

  • Available functions/tools with their specifications

  • The thinking process before making the function call

  • The actual function calls and their responses

I applied LoRA (Low-Rank Adaptation) fine-tuning with the following key parameters:

  • Rank dimension: 16

  • Alpha: 64

  • Dropout: 0.05

To give you a sense of the horsepower necessary to achieve this relatively simple bit of training, I ran this training using a bespoke 150-ish lines of Python on my M1 Mac Studio Max with 64GB of RAM. The entire fine-tuning process took approximately 7 hours. I haven't bogged down a personal computer like that since my earliest days in art school when I tried to get my Apple IIci (with its 25 MHz CPU and 4MB of RAM) to do ray tracing! But not being able to use my Mac for an evening resulted in a model that maintains Gemma's general capabilities while adding enhanced function calling with a thinking step.

Why Use This Model?

There are several benefits to using a model with an explicit thinking step for function calling:

  1. Improved Accuracy: The explicit reasoning process leads to fewer errors in function selection and parameter preparation.

  2. Transparency: The thinking step provides insight into why the model is choosing a particular function and parameters, making the system more interpretable.

  3. Better Error Handling: The model can identify potential issues before making a function call, reducing the likelihood of runtime errors.

  4. Resource Efficiency: If you need a model with function calling capabilities, you can use this pre-fine-tuned model rather than investing time and resources in fine-tuning your own.

Real-World Applications

This enhanced function calling capability is particularly valuable for:

1. Virtual Assistants

Create assistants that can perform real actions like booking appointments, checking inventory, or ordering items — all while reasoning through their decisions.

2. Data Analysis Tools

Build interfaces that translate natural language queries into database operations or API calls to data services.

3. Internal Developer Tools

Create natural language interfaces to your company's internal APIs, allowing developers to prototype and test functionality through conversation.

4. Customer Service Automation

Develop systems that can not only answer customer questions but also take actions like initiating returns, checking order status, or updating account information.

Example: A Customer Service Scenario

Let's look at a more complex example where the thinking step proves valuable:

User: "I need to check if the red running shoes I ordered last week have shipped yet."

Model (thinking): "The user is asking about an order status. I need to identify the specific order. They mentioned 'red running shoes' and 'last week,' but I don't have an order number. I should use the search_orders function first with the available information, then I can use get_order_status once I have the order ID."

Model (function call): {"name": "search_orders", "arguments": {"product_description": "red running shoes", "timeframe": "last week"}}

Function result: {"orders": [{"order_id": "ORD-7891234", "date": "2025-02-20", "status": "shipped", "items": ["Red Nike Running Shoes, Size 10"]}]}

Model (thinking): "Now I have the order ID 'ORD-7891234' from the search results. I should check the detailed status using the get_order_status function."

Model (function call): {"name": "get_order_status", "arguments": {"order_id": "ORD-7891234"}}

Function result: {"order_id": "ORD-7891234", "status": "shipped", "shipped_date": "2025-02-22", "tracking_number": "UPS123456789", "estimated_delivery": "2025-02-28"}

Model: "Your red Nike running shoes (Order #ORD-7891234) were shipped on February 22nd. The package has a UPS tracking number of UPS123456789 and is estimated to be delivered by February 28th."

Without the thinking step, the model might have immediately tried to call get_order_status without first determining the order ID, resulting in an error or a request for more information.

Getting Started with the Model

My fine-tuned model is available on Hugging Face at sethderrick/gemma-2-2B-it-thinking-function_calling-V0.

Here's how you can use it in your projects:

from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the model
peft_model_id = "sethderrick/gemma-2-2B-it-thinking-function_calling-V0"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(model, peft_model_id)
model.to(torch.bfloat16)
model.eval()

# Example prompt structure
prompt = """<bos><start_of_turn>human
You are a function calling AI model. You are provided with function signatures within <tools></tools> XML tags.
You may call one or more functions to assist with the user query.
Don't make assumptions about what values to plug into functions.

Here are the available tools:
<tools>
[
  {
    "type": "function",
    "function": {
      "name": "convert_currency",
      "description": "Convert from one currency to another",
      "parameters": {
        "type": "object",
        "properties": {
          "amount": {
            "type": "number",
            "description": "The amount to convert"
          },
          "from_currency": {
            "type": "string",
            "description": "The currency to convert from"
          },
          "to_currency": {
            "type": "string",
            "description": "The currency to convert to"
          }
        },
        "required": ["amount", "from_currency", "to_currency"]
      }
    }
  }
]
</tools>

Use the following format for each tool call you will make:
<tool_call>
{"name": "function_name", "arguments": {"arg1": "value1", "arg2": "value2"}}
</tool_call>

Also, before making a call to a function take the time to plan the function to take.
Make that thinking process between <think>{your thoughts}</think>

I need to convert 500 USD to Euros. Can you help me with that?<end_of_turn><eos>
<start_of_turn>model
<think>"""

# Generate response
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
inputs = {k: v.to("cuda") for k, v in inputs.items()}
outputs = model.generate(**inputs,
                         max_new_tokens=300,
                         do_sample=True,
                         top_p=0.95,
                         temperature=0.01,
                         repetition_penalty=1.0,
                         eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))

Conclusion

Adding a thinking step to function calling is a simple yet powerful enhancement that improves the reliability and transparency of LLM-powered tools. By fine-tuning the Gemma 2-2B model with this capability, I've created a resource that can help developers build more robust conversational interfaces.

As we continue to explore the integration of LLMs into our applications, techniques like this that bridge the gap between freeform natural language and structured API calls will become increasingly valuable.

I see this work as part of my broader journey to understand the mechanics of AI and LLMs — technologies that I believe are as transformative today as the World Wide Web was in the mid-1990s.

If you're building applications that need to translate user requests into actions, I encourage you to try out the model and see how the thinking step can improve your results. And I'd really love to hear about your experiences and any further enhancements you might suggest.

Go build something!!

Resources