1. Background and Evolution of Large Language Model (LLM)
LLM is a Generative model that can input large set of unstructured data and generate large volume of textual output.
1.1 A timeline of LLM evolution :-
The development of LLM is not new, It went through a gradual evolution since 2002-3:
- 2003 - Bag of words : ML to do natural Language Processing NLP
- 2008 - TF-IDF : Multi-task Learning
- 2013 - Co-occurrence Matrix : Word embeddings
- 2013 - Word to Vec/G Love: NLP Neural Nets
- 2014 - Seq to Seq Learning
- 2015 - Transformer Models , Attention
And then comes the explosion of development on LLM:
- 2019 - ELMOs/BERT/XLNet : Pre-trained Models
- Nov 2022 - OpenAIs GPT3.5
- Dec 2022 - Google's MedPaLM
- Feb 2023 - Amazon's Multimodal-CoT
- Feb 2023 - Meta's LLaMA
- Feb 2023 - Microsoft's Kosmos-1
- Mar2023 - Salesforce's einstien GPT
- Mar 2023 - OpenAI's GPT-4
- Mar 2023 - Google's Bard
- Mar 2023- Bloomberg's LLM
- Apr 2023 - Amazon's Bedrock
1.2 Genesis of Transformer Model : (Ref Research Paper : Google's Attention is all you need, 2017)
Earlier to 2016, Deep learning models were using Recurring Neural Network (RNN), or Neural Network deep-learning based. These were not easy to scale, architecture was linear, sequential, computing one output to pass into next input. Google got in transformer blocks, where it essentially modelled non sequentially. example a sentence to be processed word by word, transformer uses Attention to build a relationship to other words in the input sequence as a block. This makes thinking paralelly, scale much faster, revolution in architecture. Volumes of inputs increased manifold from GPT1, 2 and now 3 and 4 where corpus of data to train models kept increasing with billions of data sets. Transformers brought in the key revolution to LLM, in that whie it still implements encoder-decoder architecture, it does not rely on the use of recurrent neural networks..
The transformer architecture dispenses of any recurrence and instead relies solely on a self-attention (or intra-attention) mechanism.
In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d …
– Advanced Deep Learning with Python, 2019.
Transformers can capture global/long range dependencies between input and output, support parallel processing, require minimal inductive biases (prior knowledge), demonstrate scalability to large sequences and datasets, and allow domain-agnostic processing of multiple modalities (text, images, speech) using similar processing blocks.
1.3 Three basic sort of LLMs (as per "Attention is all you need" paper*):-
The encoder-decoder architecture has been extensively applied to sequence-to-sequence (seq2seq) tasks for language processing. Examples of such tasks within the domain of language processing include machine translation and image captioning.
The earliest use of attention was as part of RNN based encoder-decoder framework to encode long input sentences [Bahdanau et al. 2015]. Consequently, attention has been most widely used with this architecture.
– An Attentive Survey of Attention Models, 2021.
> Encoder-Only
> Decoder-only
> Encoder-Decoder.
Let's see what these are :
1.3.1 Encoder Only: Ex : GPT/OpenAI (content in same language)
Compacts/encodes one set of input into something like sentiment analysis
- Popularized via successful architectures like BERT*
- Very good for predictive use on unstructured data
Encoder-only models are still very useful for training predictive models based on text embeddings versus generating texts.
BERT (Bidirectional Encoder Representations from Transformers) is an encoder-only architecture based on the Transformer's encoder module. The BERT model is pretrained on a large text corpus using masked language modeling and next-sentence prediction tasks.]
1.3.2 Decoder Only - Ex: BERT Architecture : Lot of text data and need one particular Output like sentiment or topic of a discussion
- Popularised via original GPT models
- Driving the Gen AI market buzz
Decoder-only models are used for generative tasks including Q&A.
The GPT (Generative Pre-trained Transformer) series are decoder-only models pretrained on large-scale unsupervised text data and finetuned for specific tasks such as text classification, sentiment analysis, question-answering, and summarization. The GPT models, including GPT-2, (GPT-3 Language Models are Few-Shot Learners, 2020), and the more recent GPT-4, have shown remarkable performance in various benchmarks and are currently the most popular architecture for natural language processing.
1.3.3 Encoder-Decoder : Compacts input into an output. Ex : French to English translation
- The original paper creator transformer architecture
- Translation tasks, cross attention
Encoder-decoder models are typically used for natural language processing tasks that involve understanding input sequences and generating output sequences, often with different lengths and structures. They are particularly good at tasks where there is a complex mapping between the input and output sequences and where it is crucial to capture the relationships between the elements in both sequences. Some common use cases for encoder-decoder models include text translation and summarization.
Some notable examples of these new encoder-decoder models include
*Ref:
https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder
1.5 LLMs are based on the three types of building blocks
1.5.1 Attention:
In the context of LLM, attention is defined as a mechanism that allows the model to selectively focus on different parts of the input text. This mechanism helps the model attend to the input text’s most relevant parts and generate more accurate predictions
The use of attention in LLMs is to improve the model’s ability to understand the context of the input text and generate more coherent and relevant output. Attention mechanisms in LLMs, particularly the self-attention mechanism used in transformers, allow the model to weigh the importance of different words or phrases in a given context.
There are two types of attention mechanisms in LLMs: self-attention and cross-attention.
Self-attention is used to weigh the importance of different words or phrases within the same input text,
Cross-attention is used to weigh the importance of different words or phrases between two different input texts.
The measurement of attention in LLMs is done by calculating the attention weights assigned to each word or phrase in the input text. These weights are calculated using a softmax function, which normalizes the weights and ensures that they sum up to 1
Here are a couple of examples of how attention is used in LLMs:
- In machine translation, attention is used to align the source and target sentences and generate more accurate translations
- In question answering, attention is used to identify the most relevant parts of the input text that can help answer the question
1.5.2 Parallelism and Scalability:
LLM stands for Large Language Model. It is a machine learning model that is trained on large amounts of data to generate text. LLMs are used in various natural language processing tasks such as language translation, text summarization, and question answering .
Parallelism is used to train the model faster by distributing the workload across multiple processors or GPUs. There are two types of parallelism: data parallelism and model parallelism.
Data parallelism involves splitting the data into smaller batches and processing them in parallel across multiple processors or GPUs. This technique is useful when the model is too large to fit into a single GPU memory.
Model parallelism involves splitting the model into smaller parts and processing them in parallel across multiple processors or GPUs. This technique is useful when the model is too large to fit into a single processor or GPU memory.
Scalability is used to train the model on larger datasets or with more complex architectures. Scalability can be measured in terms of speedup and efficiency.
Speedup is the ratio of the time taken to complete a task on a single processor or GPU to the time taken to complete the same task on multiple processors or GPUs. A higher speedup indicates better scalability.
Efficiency is the ratio of the speedup to the number of processors or GPUs used. A higher efficiency indicates better scalability.
Here are a couple of examples of LLMs:
GPT-3: It is a state-of-the-art LLM developed by OpenAI that has 175 billion parameters. It is used for various natural language processing tasks such as language translation, text summarization, and question answering 1.
BERT: It is another popular LLM developed by Google that has 340 million parameters. It is used for various natural language processing tasks such as sentiment analysis, named entity recognition, and question answering 1.
1.5.3 Sequence Modeling:
2. GenAI - Popular ones being used in 2023-24 :
In tale above, what we are saying is LLaMA is a good open source model for text output. codellama-7b-instruct has 7billion constructs that can be used for writing code.
Prompts are inputs that can be asked to LLM for desired output. You have to use inherent prompt inputs for best output. This is an engineering discipline called Prompt Engineering.
3. What does Generative AI do?
Traditional AI used to take time in development, iterations, deployment, consumption, data training etc.
Gen AI can be quickly standing in a matter of weeks.
Gen AI can basically do one of the following functions and some examples:-
3.1. Summarization : Regulatory Guidelines/ Risk reports/ UW/Claims/policy/ Corporate Functions
3.2. Reference & Co-Pilot : Extract key information like Information extraction/ Risk Analysis/ Sentiment mining/ Fraud/ Event detection/ Web mining/ CX/CJ Insights
3.3. Expansion: Automated mails/ descriptions/ qualitative reports/ Synthetic data/ Advisory-B2B/ B2C
3.4. Transformation : Change Language structure Ex Translation/ Code writing/ Data format change/ Tone change/ AI driven BI
4. Ways to use LLM APIs
- Integrate to other apps
- Virtual Assistants
- Developer Co-Pilot
- Custom Applications
5. Example Use case from Insurance:-
- Content Generation: Lead generation/ Onboarding/ Customer Management/ Delinquency & Foreclosures
- Workflow management:
- Client Experience and Interaction
- Security Compliance
- Workflow Optimization
6. There are 2 broad patterns in which use cases fall:-
6.1. Retrieval Augmented Generation (RAG)- Retrieve and answer in context (in-context learning)
Agent based architecture/ Fine tuning for optimization/
Question/Task > LLM > (Indexing) Indexed Query > Vector Store > Give Context > Contextual Prompt Creation > LLM> Output > Output parsing> Answer
LLM is fixed/OOB domain agnostic GPT3/4. You can give context to ask the question.
Context can be given in zero shot learning/ few shot learning (like examples of desired output).
No data leakage problem.
If you pass data directly (fine-tuning -> given this context, I need examples of right answer, then model aligns) into LLM, it can give direct answers. but it has a security challenge. Fine tuning increases accuracy and alignment. But data can become stale, as it's not always live. It is recommended to go with RAG which is outside the LLM model, instead of fine tuning the model through your own data.
So in the Underwriting example:-
a) Agent-Assistance : smart bot to enable agent answer diff customer/ prospect questions, Tap into existing contractual docs, guidelines, benefits, calculations, research & insights ex Knowledge management, chatbot,...
b) Underwriting/Risk Co-pilots for Mortgage : The co-pilot helps underwriter go through several steps to assess risk, summarize the qualifying income through reviewing and identifying various sources of income, do some appraisals, contextualize credit and assets. LLM can curate much better than analytic AI. Automatically analyze live data, write output, and generally help the underwriter
6.2. Multi-Hop/ Multi-stage Problem Solving :
Insight Agents: conversational business intelligence, data quality assurance, analytics and insights co-pilot, decision support agent.
Multi-Hop problem solving is different from RAG. Some large LLMs have logic and reasoning built. You can build agents to build say conversational BI instead of conversational Bot, which is fully textual. Multi-hop can build insights, cross match data, chain of thoughts etc. Reason and Act - REACT is a sequence of steps to accomplish a task before output can be given. More complex reasoning can also be tried. You can need graphical analysis example concentration of claims that can point to one garage , or relationship between claims... Knowledge graphs between data sets can throw up unique intelligence based on simple textual question prompts. Can help in decision support systems.
7. Some concepts to know here:-
- May require fine tuning or reform of the RAG : To change the model as per your data. Here, the use your own data sets to train the model. Private Trainership is available from GPT4.
- SFT and RLHF
In the context of LLMs, SFT stands for Supervised Fine-Tuning. It is a technique used to fine-tune a pre-trained LLM on a specific task by providing it with labeled examples.
RLHF stands for Reinforcement Learning from Human Feedback. It is a method used to train LLMs to align their output with human intentions and values. RLHF involves teaching an LLM to understand human preferences by assigning scores to different responses from the base model. The goal is to use the preference model to alter the behavior of the base model in response to a prompt.
Here are some examples of how RLHF is used in LLMs:
ChatGPT: It is a state-of-the-art LLM developed by AssemblyAI that uses RLHF to learn human preferences and provide a more controlled user experience.
NVIDIA SteerLM: It is a technique developed by NVIDIA that uses RLHF to customize LLMs during inference.
- Foundational Model shift
- If trainings are not proper, LLM can hallucinate, missing context.
8. SUMMARY of ARCHITECTURES
For completeness of text, let me mention that these are the main Attention-based Architectures:-
8.1. Encoder-Decoder (RNN based)
8.2. Transformer Model (non-RNN)
8.3. Graph Neural Networks (GNN)
8.4. Memory Augmented Neural Networks
A graph is a versatile data structure that lends itself well to the way data is organized in many real-world scenarios. We can think of an image as a graph, where each pixel is a node, directly connected to its neighboring pixels …
– Advanced Deep Learning with Python, 2019.
Of particular interest are the Graph Attention Networks (GAT) that employ a self-attention mechanism within a graph convolutional network (GCN), where the latter updates the state vectors by performing a convolution over the nodes of the graph.
In the encoder-decoder attention-based architectures, the set of vectors that encode the input sequence can be considered external memory, to which the encoder writes and from which the decoder reads. However, a limitation arises because the encoder can only write to this memory, and the decoder can only read.
Memory-Augmented Neural Networks (MANNs) are recent algorithms that aim to address this limitation.
The Neural Turing Machine (NTM) is one type of MANN. It consists of a neural network controller that takes an input to produce an output and performs read and write operations to memory. Examples of applications for MANNs include question-answering and chat bots, where an external memory stores a large database of sequences (or facts) that the neural network taps into.
9. Policy and Principles around GenAI:-
Guardrails, policy enforcements, or Governance regarding inputs to GenAI is an open subject. There could be LLM tools or technology that can enforce these policy guardrails that can ensure security,privacy etc.
No comments :
Post a Comment
Comments will appear on the post after moderation.