Signup | Past Issues | Follow on X | Read on Web

AlphaSignal

Hey ,

Welcome to today's edition of AlphaSignal.

Whether you are a researcher, engineer, developer, or data scientist, our summaries are there to keep you up-to-date with the latest breakthroughs in AI.

Let's get into it,

Lior

IN TODAY'S SIGNAL

📰 Top News

Viral paper: Matrix Multiplication-Free LLMs cuts memory usage by 10x, boosts training speed by 25.6%

📌 Snorkel AI Webinar

Learn how to fine-tune LLMs to perform specialized tasks accurately

⚡️ Top 5 Signals

Whisper can run real-time transcription locally on your browser
PGVector PostgreSQL is faster and 75% cheaper than Pinecone
Yandex's new tool pre-trains LLMs 20% faster
New repo collects 15,140 ChatGPT prompts Reddit, Discord, and more
Apple introduces ~3B LLMs using fine-tuned LoRA Adapters

🛠️ Top Papers

Autoguidance: guides diffusion models with a smaller version, achieving FID 1.25 on ImageNet-512.
Vision-LSTM: uses xLSTM for vision tasks, achieving 77.3% accuracy, reducing computational costs.
Short Circuiting: manipulates model representations to prevent harmful outputs, reducing harmful outputs by 90%.

🧠 Tutorial

How to run a 100% local, fully private LLM with llama.cpp

Read Time: 4 min 18 sec

Enjoying this newsletter?
Please forward it to a friend or colleague. It helps us keep this content free.

TOP NEWS

Architecture

Eliminating matrix multiplication (MatMul) from LLMs

⇧ 5,210 Likes

What's New

The paper "Scalable MatMul-free Language Modeling" went completely viral on Twitter, generating 2.3 million impressions due to its innovative approach that eliminates matrix multiplication (MatMul) from LLMs.

Traditionally, MatMul is essential for processing dense layers and implementing self-attention mechanisms in neural networks, but it demands substantial computational power and memory.

Problem Addressed

LLMs typically require MatMul for their operations, which significantly limits their deployment to environments equipped with high-end hardware due to the high computational and memory demands.

Solution

The research introduces a method that replaces MatMul with simpler computational techniques, dramatically reducing resource consumption while maintaining model performance.

How It Works

In Dense Layers: The method substitutes MatMul with ternary accumulations where the weights are only -1, 0, or +1. This reduces the complexity of calculations.
For Self-Attention Mechanisms: It utilizes a MatMul-free Linear Gated Recurrent Unit (MLGRU) that operates solely on element-wise products.
In Channel Mixing: It employs modified Gated Linear Units (GLUs) that integrate BitLinear layers with ternary weights, efficiently managing data integration across channels with reduced computational overhead.

Impact of the Innovation

Removing MatMul from the calculations in large language models means these models don't need powerful computers to run. This change allows them to work on simpler devices, like smaller servers or even some personal computers, making advanced AI tools available to more people and places.

Performance Metrics

Memory Reduction: Memory usage during inference sees a reduction by more than 10 times compared to unoptimized models.
Efficiency Gains: Training speed increases by 25.6%, and overall memory requirements drop by 61% relative to conventional approaches.
Hardware Optimization: Custom FPGA accelerators demonstrate the practicality of this method by processing billion-parameter models with just 13 watts of power.

READ THE PAPER

Webinar: How to fine-tune LLMs to perform specialized tasks accurately

The key to transforming foundation models such as Meta's Llama 3 into specialized LLMs is high-quality training data, which can be applied via fine-tuning and alignment.

In this webinar, Snorkel AI will provide an overview of fine-tuning methods such as DPO, ORPO and SPIN, explain how to curate high-quality instruction and preference data 10-100x faster (and at scale) and give a live demo showing how we fine-tune, align and evaluate LLMs.

Join for a live demo and to learn more about:

Curating high-quality training data 10-100x faster
Emerging LLM fine-tuning and alignment methods
Evaluating LLM accuracy for production deployment

Can’t attend live? All registrants will receive an on-demand recording.

partner with us

TRENDING SIGNALS

Audio

Whisper can now do real-time transcription locally on your browser (open-source)

⇧ 2103 Likes

Database

PGVector, an open-source PostgreSQL, now faster and 75% cheaper than pinecone

⇧ 358 Likes

LLMs

Yandex releases a new open-source tool to pre-train LLMs 20% faster

⇧ 59 Likes

Prompting

New repo collects 15,140 ChatGPT prompts from Reddit, Discord, websites, and open-source datasets

⇧ 402 Likes

Industry

Apple introduces on-device and server Foundation Models, ~3B LLM using fine-tuned LoRA Adapters

⇧ 1514 Likes

TOP PAPERS

Image Generation

Guiding a Diffusion Model with a Bad Version of Itself

⇧ 1224

Problem

Diffusion models for image generation often struggle with maintaining image diversity and quality, especially in lower-probability regions of the data distribution. Existing methods like classifier-free guidance (CFG) increase prompt alignment and image quality but reduce variation.

Solution
The paper introduces autoguidance, a method where a diffusion model is guided by a less trained or smaller version of itself. This approach aims to improve control over image quality without compromising image diversity, unlike traditional CFG.

Results
Autoguidance achieved state-of-the-art results on ImageNet-512 with a Fréchet Inception Distance (FID) of 1.25. It also set new benchmarks in ImageNet-64 with an FID of 1.01, significantly enhancing image quality while preserving diversity.

Vision

Vision-LSTM: xLSTM as Generic Vision Backbone

⇧ 898

Problem
Transformers, while effective in computer vision, suffer from high computational costs due to quadratic complexity, especially with high-resolution images.

Solution
Vision-LSTM (ViL) adapts the xLSTM architecture for vision tasks, using a sequence of alternating bi-directional mLSTM blocks to process image patch tokens efficiently with linear computational complexity.

Results
ViL outperforms standard vision transformers on ImageNet-1K classification. ViL-T achieves 77.3% accuracy, outdoing DeiT-T at 72.2%. Even in heavily optimized transformer setups, ViL demonstrates competitive performance, with ViL-B reaching 81.6% accuracy versus DeiT-B's 81.8%.

Alignment

Improving Alignment and Robustness with Short Circuiting

⇧ 532

Problem
AI models are vulnerable to adversarial attacks, which compromise model outputs, posing a significant reliability and safety issue. Current defenses like adversarial training fail to generalize against novel attacks and often degrade model performance.

Solution
The paper introduces "Short Circuiting," a technique that manipulates internal model representations to prevent harmful outputs without specific attack training. This method, based on representation engineering, disrupts harmful processes by rerouting them towards safe states, effectively making the model attack-agnostic.

Results
Short Circuiting demonstrated a significant reduction in compliance to harmful requests by up to 90% on Llama-3-8B-Instruct models, with minimal performance impact (less than 1% decrease in capability tests). The technique outperforms traditional refusal and adversarial training, maintaining robustness against a wide range of unseen adversarial attacks.

TUTORIAL

How to run a 100% local, fully private LLM with llama.cpp

llama.cpp is a powerful, MIT-licensed framework designed to run large language models (LLMs) like Meta's LLaMA locally in pure C/C++, ensuring full privacy and data security.

Requirements

Operating Systems: macOS, Linux, Windows (via CMake), Docker, FreeBSD
Hardware: Support for Apple silicon, AVX architectures, and GPUs via custom CUDA kernels and Vulkan backend

Step 1: Installation

Install llama.cpp using Homebrew:
brew install llama.cpp

Step 2: Setting Up the Server

Start the server with the desired model:
llama-server --hf-repo microsoft/Phi-3-mini-4k-instruct-gguf --hf-file Phi-3-mini-4k-instruct-q4.gguf

Step 3: Interacting with the Model

Send requests to the model using curl:
curl localhost:8080/v1/chat/completions

llama.cpp REPO

Stop receiving emails here.

AlphaSignal, 214 Barton Springs RD, Austin, Texas 94123, United States