About Gloo AI Gateway
Explore the capabilities of Gloo AI Gateway and how you can use it to access and consume AI services from multiple LLM providers.
This feature is an Enterprise-only feature that requires a Gloo Gateway Enterprise license. Additionally, to access the AI Gateway capabilities, your Gloo Gateway Enterprise license must include the AI Gateway add-on. Contact your account representative to obtain an updated license key.
Interested in trying out the capabilities of Gloo AI Gateway? Check out the Gloo AI Gateway tutorials.
About Large Language Models (LLMs)
A Large Language Model (LLM) is a type of artificial intelligence (AI) model that is designed to understand, generate, and manipulate human language in a way that is both coherent and contextually relevant. In recent years, the number of LLM providers or open source LLM projects increased significantly, such as OpenAI, Llama2, and Mistral. These providers distribute LLMs in various ways, such as through APIs and cloud-based platforms.
Because the AI technology landscape is fragmented, access to LLMs can vary a lot. Developers must learn each AI API and AI platform, and implement provider-specific code so that the app can consume the AI services of each LLM provider. This redundancy can significantly decrease developer efficiency and make it difficult to scale and upgrade the app or integrate it with other platforms.
About Gloo AI Gateway
Gloo AI Gateway unleashes developer productivity and accelerates AI innovation by providing a unified API interface that developers can use to access and consume AI services from multiple LLM providers. Because the API is part of the gateway proxy, you can leverage and apply additional traffic management, security, and resiliency policies to the requests to your LLM provider. This set of policies allows you to centrally govern, secure, control, and audit access to your LLM providers.
Key capabilities
Learn more about the key capabilities of Gloo AI Gateway.
Centralized credential management
With Gloo AI Gateway, you can centrally secure and store the API keys for accessing your AI provider in a Kubernetes secret in the cluster. The gateway proxy uses these credentials to authenticate with the AI provider and consume AI services. To further secure access to the AI credentials, you use fine-grained RBAC controls.
Access control
Controlling access is crucial to prevent unauthorized access to your LLM provider, protect sensitive data, maintain model integrity, and audit trails. With Gloo AI Gateway, you can leverage security policies, such as access logging, JSON Web Tokens (JWT), or external auth to ensure that only authenticated and authorized users can access the AI API. For example, you can integrate a JWT token provider to authenticate users. In addition, you can extract claims from the JWT token to enforce fine-grained access controls and restrict access to the AI API based on claims. That way, you can ensure that access to the LLM provider is granted only if the user is allowed to use the LLM or is part of a specific role, group, and organization.
Prompt enrichment
Prompts are basic building blocks for guiding LLMs to produce relevant and accurate responses. By effectively managing both system prompts, which set initial guidelines, and user prompts, which provide specific context, you can significantly enhance the quality and coherence of the model’s outputs. Gloo AI Gateway allows you to pre-configure and refactor system and user prompts, extract common AI provider settings so that you can reuse them across requests, dynamically append or prepend prompts to where you need them, and overwrite default settings on a per-route level.
Prompt guards
Prompt guards are mechanisms that ensure that prompt-based interactions with a language model are secure, appropriate, and aligned with the intended use. These mechanisms help to filter, block, monitor, and control LLM inputs and outputs to filter offensive content, prevent misuse, and ensure ethical and responsible AI usage. With Gloo AI Gateway, you can set up prompt guards to block unwanted requests to the LLM provider and mask sensitive data.
Rate limiting
Rate limiting on LLM provider token usage is primarily related to cost management, security and service stability. LLM providers charge based on the number of input (user prompts and system prompts) and output (responses from the model) tokens, which can make uncontrolled usage very expensive. With Gloo AI Gateway, you can configure rate limiting based on LLM usage so that organizations can enforce budget constraints across groups, teams, departments, and individuals, and ensure that their usage remains within predictable bounds. That way, you can avoid unexpected costs and prevent malicious attacks to your LLM provider.
Retrieval augmented generation (RAG)
Retrieval augmented generation (RAG) is a technique of providing relevant context by retrieving relevant data from one or more datasets and augmenting the prompt with the retrieved information. This approach helps LLMs to generate more accurate and relevant responses and to a certain point prevent hallucinations.
Gloo AI Gateway’s RAG feature allows you to configure the system to retrieve data from a specified datastore and use it to augment the prompt before sending it to the model. This can be particularly useful in scenarios where the model requires additional context to generate accurate responses.
Semantic caching
Semantic caching is a feature in Gloo AI Gateway that caches semantically similar queries. This means that if two prompts are semantically similar, the LLM response from the first prompt can be reused for the second prompt, without sending a request to the LLM. This reduces the number of requests made to the LLM provider, improves the response time, and reduces the cost.
Supported LLM providers
The examples throughout the Gloo AI Gateway docs use OpenAI as the LLM provider, but you can use other providers supported by Gloo AI Gateway.
Gloo Gateway supports the following AI providers:
For the full list of currently supported providers, see the AI options in the Upstream reference.
Note the following differences in how AI Gateway features function for each provider.
RAG
Retrieval augmented generation (RAG) is currently not supported for the Gemini and Vertex AI providers.
Chat streaming
Gloo AI Gateway supports chat streaming, which allows the LLM to stream out tokens as they are generated. Some providers, such as OpenAI, send the is-streaming
boolean as part of the request to determine whether or not a request should receive a streamed response. However, the Gemini and Vertex AI providers change the path to determine streaming, such as the streamGenerateContent
segment of the path in the Vertex AI streaming endpoint https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-latest:streamGenerateContent?key=<key>
. To prevent the path you defined in your HTTPRoute from being overwritten by this streaming path, you instead indicate chat streaming for Gemini and Vertex AI by setting spec.options.ai.routeType=CHAT_STREAMING
in your RouteOptions resource.