From Text to Tables: LLM‑Based Feature Engineering for Tabular Classification

20 March 2026 by

Suraj Barman

Introduction

Large language models are often associated with chat‑style applications, yet they can also act as a feature extractor for mixed datasets. By prompting a Groq‑hosted LLaMA model to output JSON that matches a Pydantic schema, unstructured ticket text becomes a set of structured columns ready for machine learning.

Preparing a Toy Dataset

We start with a small DataFrame that contains both numeric fields (e.g., priority level) and a free‑form description column. This hybrid layout mimics real‑world support tickets where the numeric score informs urgency while the text holds contextual clues.

LLM Prompt Design and JSON Schema

A Pydantic model defines the expected output shape, for example class TicketFeatures(BaseModel): sentiment: float = Field(...); category: str = Field(...). The prompt sent to the LLaMA endpoint asks for these fields, ensuring the response is valid JSON that can be parsed directly into the schema.

Calling the Groq LLaMA Model

Using the OpenAI‑compatible client, we instantiate client = OpenAI(base_url="https://api.groq.com/openai/v1") and send the ticket text with the prepared prompt. The model returns a JSON string the json library parses it, and Pydantic validates the types.

Merging Extracted Features with Numeric Columns

After extraction, we concatenate the new columns to the original numeric DataFrame using pd.concat. At this point each row contains a full set of numeric and text‑derived attributes, ready for model training.

Training and Evaluation with scikit‑learn

We split the engineered table with train_test_split, scale numeric features via StandardScaler, and fit a RandomForestClassifier. The final classification_report shows precision, recall, and F1 scores, confirming that the LLM‑generated features improve predictive power.

Conclusion and Next Steps

This workflow demonstrates that a pretrained LLaMA model can serve as a reliable preprocessor for text fields, turning messy strings into clean, numeric‑friendly data. Future experiments might explore alternative model providers, larger label sets, or integration with pipelines such as sklearn.pipeline.Pipeline to automate the entire process.