Issue #8: How Apple built its Apple Intelligence Foundation Models

Plus Llama3, protecting copyrighted data, AI that debugs code, agents tool usage, and eliminating hallucinations

Hello readers! In this issue we cover:

  1. How Apple designed and built its Apple Intelligence Foundation Models

  2. How LLMs can prevent generating copyrighted material

  3. How LLMs can be better at debugging code

  4. The paper introducing the Llama 3 405B and family of models

  5. Cost effectively mitigating hallucinations

  6. How to give agents a large suite of tools

🍎 How Apple built its Apple Intelligence Foundation Models

Apple Intelligence Architecture

Apple details how 2 key foundation models, AFM-on-device and AFM-server (AFM stands for Apple Foundation Model) are built and adapted to perform specialized tasks.

They discuss some core principles they follow, such as privacy protection, as well as the architecture and training steps. They use a bot, called Applebot, to crawl the web for data, crawl Github for code, and use public datasets to train their models. They also talk about their compute requirements including 8192 TPUv4 chips.

A fascinating read if you want to underneath the hood of Apple Intelligence.

🦙 The Llama 3 Herd of Models

This paper introduces Llama 3, a new set of foundation models for artificial intelligence. Llama 3 is described as a "herd" of language models designed to support multiple languages, coding, reasoning, and tool usage. The largest model in this set is a dense Transformer with 405 billion parameters and can process up to 128,000 tokens in its context window.

The authors report that Llama 3 performs comparably to leading language models like GPT-4 across various tasks. They are publicly releasing several versions of Llama 3, including pre-trained and post-trained versions of the 405B parameter model, as well as Llama Guard 3 for input and output safety.

The paper also details experiments integrating image, video, and speech capabilities into Llama 3 using a compositional approach. This integrated version reportedly performs competitively with state-of-the-art models on image, video, and speech recognition tasks. However, these multimodal versions are still under development and not yet ready for broad release.

✍️ Preventing LLMs from Generating Copyrighted Materials

Researchers introduce a novel approach to address the challenge of language models inadvertently reproducing copyrighted material from their training data. The authors propose an algorithm named Copyright-Protecting Fusion (CP-Fuse), as an effective solution to safeguard against copyright infringement.

CP-Fuse is designed to adaptively combine language models to minimize the reproduction of protected materials. It builds upon the recently proposed Near-Access Free (NAF) framework and incorporates a balancing property that prevents the reproduction of memorized training data.

The authors' results indicate that CP-Fuse significantly reduces the memorization of copyrighted content while maintaining high-quality text and code generation capabilities.

👾 Nvidia researchers improve how LLMs can debug code

BESTER algorithm

LLMS are show potential in code generation but struggle with debugging programs iteratively. This research proposes an algorithm to enhance LLMs' debugging capabilities through self-reflection and search, where the model identifies its previous mistakes. The key ideas in the paper are:

1. A best-first tree search algorithm with self-reflections (BESTER) that achieves state-of-the-art scores in three code generation benchmarks and maintains high pass rates considering the additional inference costs.

2. A novel interpretability study on self-reflections in buggy programs and their impact on bug fixes, offering deeper insights into the debugging process.

3. An extensive study on the effectiveness of self-reflections in finding bugs.

👻 AWS Researchers Introduce Cost Effective Hallucination Detection for LLMs

Proposed hallucination detection approach.

This paper addresses hallucinations in LLMs by proposing a three-step pipeline: generating a confidence score, calibrating it based on inputs and responses, and applying a threshold for detection. They benchmark various scoring methods across different tasks and LLMs, emphasizing the importance of score calibration. Finding that no single method excels in all situations, they introduce a multi-scoring framework that combines different scores, achieving top performance across datasets. The researchers also develop a cost-effective version of this approach, which maintains high performance while reducing computational overhead.

🌷 Giving agents a large tool library (with code)

Tulip agent architecture and information flow

Researchers at Honda develop the tulip agent, new architecture for LLM-based autonomous agents that can efficiently manage a large, extensible tool library. Unlike current systems that include all tool descriptions in the prompt, tulip agent uses a recursive search method to find suitable tools in a vector store. This approach reduces inference costs, allows for larger tool libraries, and enables the agent to adapt its toolset. The architecture was evaluated through mathematics-based ablation studies and a robotics application, demonstrating its effectiveness and versatility. The authors have made the implementation and benchmark publicly available on GitHub here.

🤯 Today I Learned

Every issue, we highlight new AI concepts and terminology to help educate our readers. This issue we learned about:

Adaptive Model Fusion

An adaptive fusion model combines information from multiple sources (like text, images, and audio) in a flexible way, adjusting how it integrates this information based on the specific input. This approach improves performance by leveraging different types of data effectively. It's used in areas like multimodal learning, sensor fusion in autonomous vehicles, and medical diagnosis. While these models offer better accuracy and versatility, they are also more complex to design and require large amounts of diverse data.

Dense Decoder-Only Model

A dense decoder-only transformer model is a specific type of neural network architecture used in natural language processing and other sequence-based tasks. Here are the key aspects:

  1. Transformer-based: It uses the transformer architecture, which relies on self-attention mechanisms to process input sequences.

  2. Decoder-only: Unlike encoder-decoder models, it only uses the decoder part of the transformer. This means it's primarily designed for tasks that generate output sequences, such as text generation.

  3. Dense: The model uses dense (fully connected) layers throughout its architecture, as opposed to sparse models that might use techniques like mixture-of-experts.

  4. Autoregressive: It generates output tokens one at a time, each based on all previously generated tokens.

  5. Typically used for: Language modeling, text generation, and completion tasks.

  6. Examples: GPT (Generative Pre-trained Transformer) family of models are well-known examples of dense decoder-only transformers.

These models are powerful for generating coherent and contextually relevant text, but they can be computationally intensive due to their dense nature. They've shown impressive results in various natural language tasks, particularly when scaled to large sizes.

Log-likelihood

Log likelihood is a fundamental concept in statistics and machine learning, often used in parameter estimation and model evaluation. Here's a concise explanation:

Log likelihood is the natural logarithm of the likelihood function. The likelihood function measures how well a statistical model fits observed data for different values of the model's parameters. By taking the logarithm of this function, we get the log likelihood.

Key points about log likelihood:

  1. It's used to simplify calculations, as multiplication becomes addition when using logarithms.

  2. It's often maximized to find the best parameter estimates for a model (Maximum Likelihood Estimation).

  3. Higher log likelihood values generally indicate better model fit.

  4. It's useful in comparing different models' performance on the same data.

  5. In machine learning, negative log likelihood is commonly used as a loss function.

Log likelihood is particularly useful because it can handle very small probability values without running into computational underflow issues that might occur with regular likelihood calculations.

Tool

In AI agents, a "tool" refers to external resources or capabilities that enhance the agent's performance and task-solving ability. These tools can include APIs, databases, web browsing, plugins, knowledge graphs, mathematical tools, NLP tools, and vision and speech recognition. AI agents use these tools for tasks like data retrieval, enhanced understanding, and improved user interaction. Examples include chatbots using APIs for order information and virtual assistants controlling smart devices. Tools increase versatility, accuracy, and efficiency but can pose integration, dependency, and security challenges.