Language models (LMs) automatically learn word embeddings during pre-training on language corpora. Although word embeddings are usually interpreted as feature vectors for individual words, their roles in language model generation remain underexplored. In this work, we theoretically and empirically revisit output word embeddings and find that their linear transformations are equivalent to steering language model generation styles.
We name such steers LM-Steers and find them existing in LMs of all sizes. It requires learning parameters equal to 0.2% of the original LMs' size for steering each style. On tasks such as language model detoxification and sentiment control, LM-Steers can achieve comparable or superior performance compared with state-of-the-art controlled generation methods while maintaining a better balance with generation quality.
LM-Steer requires only 0.2% of the original LM's parameters to control specific generation styles, making it highly efficient.
Compatible with language models of all sizes, from small to large.
A LM-Steer is transferrable between different language models through explicit-form calculation.
Train once, apply across different model architectures with minimal adaptation.
The learned LM-Steer serves as a lens into text styles, revealing that word embeddings are interpretable when associated with language model generations.
Can highlight text spans that most indicate style differences.
Multiple LM-Steers can be composed by adding their transformations.
Enables continuous steering by simply scaling the LM-Steer.
LM-Steer works by applying linear transformations to word embeddings within language models. This approach is theoretically grounded and empirically effective across various tasks and model sizes.
LM-Steer demonstrates excellent performance across various tasks while remaining lightweight and efficient:
Method | Toxicity ↓ | PPL ↓ | DIST-1 ↑ | DIST-2 ↑ |
---|---|---|---|---|
GPT2-Large | 0.27 | 18.5 | 0.45 | 0.83 |
PPLM | 0.15 | 23.7 | 0.42 | 0.79 |
GeDi | 0.10 | 27.9 | 0.41 | 0.76 |
MuCoLa | 0.12 | 22.1 | 0.43 | 0.80 |
LM-Steer | 0.09 | 19.8 | 0.44 | 0.82 |
LM-Steer achieves state-of-the-art performance in detoxification while maintaining better perplexity and diversity metrics compared to alternative methods.
LM-Steer can be applied to a variety of real-world scenarios:
Automatically reduce toxicity in language model outputs without sacrificing fluency or diversity.
Create safer AI assistants and content generation tools.
Control the sentiment, formality, or style of generated text for different contexts.
Help writers maintain a consistent tone throughout a document.
Analyze what dimensions in word embeddings correspond to specific attributes.
Gain insights into how language models encode style and semantic information.
If you find LM-Steer helpful for your research, please consider citing our paper:
@article{han2023lm, title={LM-Steer: Word Embeddings Are Steers for Language Models}, author={Han, Chi and Xu, Jialiang and Li, Manling and Fung, Yi and Sun, Chenkai and Jiang, Nan and Abdelzaher, Tarek and Ji, Heng}, journal={arXiv preprint arXiv:2305.12798}, year={2023} }