LM-Steer: Word Embeddings Are Steers for Language Models

Abstract

Language models (LMs) automatically learn word embeddings during pre-training on language corpora. Although word embeddings are usually interpreted as feature vectors for individual words, their roles in language model generation remain underexplored. In this work, we theoretically and empirically revisit output word embeddings and find that their linear transformations are equivalent to steering language model generation styles.

We name such steers LM-Steers and find them existing in LMs of all sizes. It requires learning parameters equal to 0.2% of the original LMs' size for steering each style. On tasks such as language model detoxification and sentiment control, LM-Steers can achieve comparable or superior performance compared with state-of-the-art controlled generation methods while maintaining a better balance with generation quality.

Figure 1. Overview of LM-Steer. Word embeddings in language models can be linearly transformed to steer text generation toward desired styles and attributes.

Key Features

🔍

Lightweight Control

LM-Steer requires only 0.2% of the original LM's parameters to control specific generation styles, making it highly efficient.

Compatible with language models of all sizes, from small to large.

🔄

Transferability

A LM-Steer is transferrable between different language models through explicit-form calculation.

Train once, apply across different model architectures with minimal adaptation.

🔎

Interpretability

The learned LM-Steer serves as a lens into text styles, revealing that word embeddings are interpretable when associated with language model generations.

Can highlight text spans that most indicate style differences.

📊

Compositionality

Multiple LM-Steers can be composed by adding their transformations.

Enables continuous steering by simply scaling the LM-Steer.

Method

LM-Steer works by applying linear transformations to word embeddings within language models. This approach is theoretically grounded and empirically effective across various tasks and model sizes.

Detoxification: LM-Steer can effectively reduce toxicity in language model outputs while maintaining generation quality.

Interpretable Dimensions: LM-Steer reveals interpretable dimensions in word embeddings that correspond to different text styles.

Keyword Analysis: LM-Steer can highlight keywords that most indicate style differences in text.

Model Transfer: A LM-Steer can be transferred between different language models, enabling efficient cross-model style control.

Linear Scaling: Continuously steer LMs by scaling the LM-Steer with different coefficients.

Composition: Multiple LM-Steers can be composed by adding their transformations to achieve complex style control.

Results

LM-Steer demonstrates excellent performance across various tasks while remaining lightweight and efficient:

Method	Toxicity ↓	PPL ↓	DIST-1 ↑	DIST-2 ↑
GPT2-Large	0.27	18.5	0.45	0.83
PPLM	0.15	23.7	0.42	0.79
GeDi	0.10	27.9	0.41	0.76
MuCoLa	0.12	22.1	0.43	0.80
LM-Steer	0.09	19.8	0.44	0.82

LM-Steer achieves state-of-the-art performance in detoxification while maintaining better perplexity and diversity metrics compared to alternative methods.

Applications

LM-Steer can be applied to a variety of real-world scenarios:

📝

Content Moderation

Automatically reduce toxicity in language model outputs without sacrificing fluency or diversity.

Create safer AI assistants and content generation tools.

💬

Writing Assistance

Control the sentiment, formality, or style of generated text for different contexts.

Help writers maintain a consistent tone throughout a document.

🔬

Model Interpretation

Analyze what dimensions in word embeddings correspond to specific attributes.

Gain insights into how language models encode style and semantic information.

Citation

If you find LM-Steer helpful for your research, please consider citing our paper:

@article{han2023lm,
  title={LM-Steer: Word Embeddings Are Steers for Language Models},
  author={Han, Chi and Xu, Jialiang and Li, Manling and Fung, Yi and Sun, Chenkai and Jiang, Nan and Abdelzaher, Tarek and Ji, Heng},
  journal={arXiv preprint arXiv:2305.12798},
  year={2023}
}