Robert Terhaar
- Jun 12
- 4 min read

The Rise of AI: Unveiling the Future of Technology

Tech Blog

How to Improve AI App Performance

In this post, we'll explain how proxies and caching proxies can improve your app's performance and how semantic caching can reduce traffic to Large Language Model (LLM) providers.

What Are Proxies and Caching Proxies?

A proxy server acts as an intermediary between your application and another server, handling network traffic and providing a layer of separation for your app. Caching proxies take this further by storing copies of frequently requested data, making retrieval faster (and cheaper.)

Different Types of Proxies

When discussing network architecture, the term 'proxy' can encompass various elements. These include forward proxies, such as Squid, and reverse proxies like HAProxy, Envoy, and Nginx. Forward proxies can be further subdivided into two primary categories: explicit and MiTM proxies. A MiTM proxy operates covertly, intercepting all content to and from the client application without necessitating any specific configuration. Conversely, an explicit proxy requires configuration within each client application to route requests through the proxy, rather than directly to the target server.

Corporate infrastructure teams may deploy multiple types of forward proxies or gateways. For instance, large-scale organizations might utilize a MiTM proxy to decrypt and analyze all HTTPS encrypted web traffic. The chain of HTTPS trust is extended by installing a custom root certificate on all client machines.

Reducing Network Connection Overhead

Network connection overhead is the extra time and resources needed to establish and maintain network connections. This can slow down AI applications that frequently exchange data with external servers over the internet. Proxati helps in the following ways:

Connection Reuse: Proxati keeps connections with servers open, so it doesn't need to start a new connection for each request. This reduces latency and speeds up data transfers.
Load Balancing: Proxati can dispatch requests to multiple APIs. If one API is performing slowly, Proxati can be configured to connect to an alternative provider.
Compression & TCP Optimization: Proxati uses HTTP compression and sets various TCP optimizations such as TCP_NODELAY and buffer size adjustments to speeds up transfers and reduce latency.

Speeding Up AI Software with Caching Proxies

Caching proxies can enhance AI applications by quickly retrieving frequently accessed data. Proxati's caching features include:

Literal Caching: Stores exact copies of frequently requested data, allowing fast retrieval without re-fetching from the API.
Semantic Caching: Proxati also has an advanced caching mode that uses multiple layers of cache for efficient data retrieval. It employs a local, fast L1 key-value (k/v) cache for immediate access to frequently used data. For more complex queries and larger datasets, Proxati leverages a remote vector database as an L2 cache.

Semantic Caching Overview

Semantic caching stores and retrieves data based on its meaning and context rather than just its literal form. Proxati's implementation of semantic caching uses a multi-layered approach, incorporating both a local L1 cache and a remote vector database as an L2 cache. Here's an example:

Scenario: Handling Zero-Shot Queries for an LLM

Imagine you're using an AI application that relies on a Large Language Model API to answer customer queries. These queries can be highly variable, often requiring zero-shot responses where the model hasn't been specifically trained on the exact question.

Step-by-Step Process:

1. Initial Query Handling

A user asks, "How can I reset my password?"
The query is first checked against the L1 cache, a fast key-value store. If an exact or closely matching response is found, it is returned immediately.

2. Vector Database L2 Cache

If the L1 cache doesn't have a suitable response, Proxati uses a vector database as an L2 cache.
The query is converted into an embedding, a high-dimensional vector that represents the semantic meaning of the query.
This embedding is then used to search the vector database for semantically similar embeddings of previous queries and responses.

3. Finding Semantically Similar Responses

The vector database might find an embedding for a similar query, such as "How do I recover my account password?"
Even though the exact words differ, the meaning is similar. The L2 cache returns the response associated with this similar query: "To reset your password, go to the settings page, click 'Forgot Password,' and follow the instructions."

4. Zero-Shot Query Advantage:

This approach is particularly useful for zero-shot queries where exact matches are unlikely. By understanding the semantic content, Proxati can provide relevant answers even if the exact question hasn't been seen before.
For example, a new query like "What should I do if I can't remember my password?" can be effectively matched with the response for "How can I reset my password?" using the embedding technique.

5. Caching the New Response:

Once a response is retrieved and served, it can be added to the L1 cache for faster future access, improving the performance for similar queries.

Real-World Impact: Faster AI Apps

Integrating Proxati into your AI software can lead to noticeable performance improvements. Here are some examples:

Chatbots and Virtual Assistants: Proxati caches common responses, reducing response times and improving user experience.
Data-Intensive Applications: Apps that frequently access large datasets benefit from faster data retrieval and processing with Proxati's caching.
LLM-Based Services: Services relying on LLMs for natural language understanding and generation see fewer requests to LLM providers, lowering costs and speeding up responses.

Conclusion and Future Work

Proxati is a powerful tool designed to boost your custom AI software's performance. By reducing network connection overhead and implementing advanced caching strategies, Proxati ensures your applications run faster and more efficiently. We're also working on improving semantic caching by adding options to control the similarity threshold for responses. Additionally, we are implementing deduplication and grounding/moderation scanning to enhance the quality and reliability of cached results.