Signup | Past Issues | Follow on X | Read on Web

AlphaSignal

Hey ,

Welcome to today's edition of AlphaSignal, a newsletter for developers by developers.

We identify and summarize the top 1% news, papers, models, and repos in the AI industry.

IN TODAY'S SIGNAL

Read time: 3 min 34 sec

🎖️ Top News

FlashAttention-3 speeds up transformers 16x, reaching 1.2 PFLOPS on H100 GPUs

📌 Gretel

Free Webinar: Learn how to accelerate AI development with synthetic data

⚡️ Trending Signals

GPT-2 can be trained for $672 on one 8xH100 GPU node for 24 hours.
New text-to-speech feature in OpenAI Playground.
Claude generates prompts, tests variables, shows side-by-side outputs.
AWS introduces AI-development App Studio.
OpenAI has developed a system to evaluate how close we are to AGI.

🛠️ Top Repos

Stanford Storm: Write Wikipedia-like articles with LLMs and retrieval methods.
llm-graph-builder: Create knowledge graphs from unstructured data using LLMs.
NVIDIA Warp: High-performance Python for simulations, integrates with PyTorch and JAX.

🧠 Tutorial

How to embed 100M docs with 300mb of memory

If you're enjoying AlphaSignal please forward this email to a colleague.

It helps us keep this content free.

TOP NEWS

Language Models

FlashAttention-3: Making Attention 16x faster for Language Models

⇧ 1914 Likes

What's New

FlashAttention is widely used to accelerate Transformers, already making attention 4-8x faster, but has yet to take advantage of modern GPUs.

FlashAttention-3: achieves a 1.5-2x speedup and reaching up to 740 TFLOPS on FP16 and nearly 1.2 PFLOPS on FP8. This increases GPU utilization to 75% of the theoretical maximum on H100 GPUs, up from 35%.

Core Innovations and Techniques

FlashAttention-3 introduces three main techniques to boost performance:

Overlapping computation and data movement
Interleaving matrix multiplications (matmul)
Softmax operations, and using low-precision FP8

These techniques leverage new hardware features of Hopper GPUs like WGMMA (Warpgroup Matrix Multiply-Accumulate) and TMA (Tensor Memory Accelerator), which enhance throughput and efficiency.

Overlapping Operations for Increased Efficiency

FlashAttention-3 uses asynchrony to perform multiple operations simultaneously. By overlapping the main operations (matrix multiplications and softmax), it ensures the GPU stays busy and efficient.

Warp specialization means different groups of GPU threads (warps) handle different tasks at the same time.

Key Highlights and Performance Metrics

Speedup: 1.5-2 times faster than FlashAttention-2
Throughput: Up to 740 TFLOPS on FP16, nearly 1.2 PFLOPS on FP8
GPU Utilization: Increased to 75% on H100 GPUs
Quantization Error Reduction: Up to 2.6 times lower with incoherent processing

Read the blog post

CHECK THE REPO

Webinar: Learn How To Accelerate AI Development with Synthetic Data

Discover how synthetic data is revolutionizing AI development by providing access to high-quality synthetic data!

In this upcoming webinar, Alex Watson, Co-Founder and CPO of Gretel, and Gretel’s product team will dive into the pivotal role synthetic data plays in generative AI and preview the latest Gretel platform additions that make data generation easier than before.

This webinar will conclude with a live Q&A and open discussion with genAI and synthetic data experts.

partner with us

TRENDING SIGNALS

Language Models

Karpathy: You can now train your own GPT-2 for ~$672, running on one 8XH100 GPU node for 24 hours

⇧ 6104 Likes

Text-to-Speech

New text-to-speech feature launches in OpenAI Playground

⇧ 260 Likes

Anthropic

Claude now lets you generate prompts, create test variables, and show you the outputs of prompts side by side.

⇧ 1429 Likes

No-Code

AWS introduces App Studio to speed up application development with AI

⇧ 166 Likes

AGI

OpenAI has developed a system to evaluate how close we are to AGI

⇧ 402 Likes

TOP OF GITHUB

Knowledge Curation

Stanford Storm

STORM helps you write Wikipedia-like articles by researching topics, generating outlines, and creating full-length reports with citations using large language models and retrieval methods like You.com and Bing Search.

☆ 5666

Graphs

llm-graph-builder

Turn Unstructured data (pdfs,docs,txt,youtube video,web pages,etc.) into a knowledge graph stored in Neo4j. It utilizes the power of Large language models (OpenAI,Gemini,etc.) to extract nodes.

☆ 556

Simulations

NVIDIA Warp

Warp helps you write high-performance Python code for simulations and graphics, running efficiently on CPU or GPU. It supports physics, robotics, geometry, and integrates with PyTorch and JAX for ML.

☆ 3701

TUTORIAL

How to Embed 100M Docs with 300mb of Memory

GPU-poor and Memory-poor, and not having 500GB of memory to embed & index 100M docs?

DiskVectorIndex is a vector search solution for memory-constrained environments.

Traditional vector databases require about 500GB of memory for such tasks, but DiskVectorIndex uses advanced vector compression techniques, such as Product Quantization (PQ), to reduce this requirement to around 300MB.

It is ideal for applications with limited GPU and memory resources, enabling efficient participation in large-scale search tasks like TREC-RAG 2024. The system leverages faiss for memory-mapped Inverted File (IVF) indexing, ensuring fast search performance with minimal memory overhead.


# pip install DiskVectorIndex
from DiskVectorIndex import DiskVectorIndex

# 114M embeddings with just 100MB of RAM
index = DiskVectorIndex("Cohere/trec-rag-2024-index")

while True:
    query = input("

Enter a question: ")
    docs = index.search(query, top_k=3)
    for doc in docs:
        print(doc)
        print("========")