How to run language models and other AI tools locally on your computer | Kaspersky official blog

Credit to Author: Stan Kaminsky| Date: Fri, 16 Feb 2024 11:08:41 +0000

Many people are already experimenting with generative neural networks and finding regular use for them, including at work. For example, ChatGPT and its analogs are regularly used by almost 60% of Americans (and not always with permission from management). However, all the data involved in such operations — both user prompts and model responses — are stored on servers of OpenAI, Google, and the rest. For tasks where such information leakage is unacceptable, you don’t need to abandon AI completely — you just need to invest a little effort (and perhaps money) to run the neural network locally on your own computer – even a laptop.

Cloud threats

The most popular AI assistants run on the cloud infrastructure of large companies. It’s efficient and fast, but your data processed by the model may be accessible to both the AI service provider and completely unrelated parties, as happened last year with ChatGPT.

Such incidents present varying levels of threat depending on what these AI assistants are used for. If you’re generating cute illustrations for some fairy tales you’ve written, or asking ChatGPT to create an itinerary for your upcoming weekend city break, it’s unlikely that a leak will lead to serious damage. However, if your conversation with a chatbot contains confidential info — personal data, passwords, or bank card numbers — a possible leak to the cloud is no longer acceptable. Thankfully, it’s relatively easy to prevent by pre-filtering the data — we’ve written a separate post about that.

However, in cases where either all the correspondence is confidential (for example, medical or financial information), or the reliability of pre-filtering is questionable (you need to process large volumes of data that no one will preview and filter), there’s only one solution: move the processing from the cloud to a local computer. Of course, running your own version of ChatGPT or Midjourney offline is unlikely to be successful, but other neural networks working locally provide comparable quality with less computational load.

What hardware do you need to run a neural network?

You’ve probably heard that working with neural networks requires super-powerful graphics cards, but in practice this isn’t always the case. Different AI models, depending on their specifics, may be demanding on such computer components as RAM, video memory, drive, and CPU (here, not only the processing speed is important, but also the processor’s support for certain vector instructions). The ability to load the model depends on the amount of RAM, and the size of the “context window” — that is, the memory of the previous conversation — depends on the amount of video memory. Typically, with a weak graphics card and CPU, generation occurs at a snail’s pace (one to two words per second for text models), so a computer with such a minimal setup is only appropriate for getting acquainted with a particular model and evaluating its basic suitability. For full-fledged everyday use, you’ll need to increase the RAM, upgrade the graphics card, or choose a faster AI model.

As a starting point, you can try working with computers that were considered relatively powerful back in 2017: processors no lower than Core i7 with support for AVX2 instructions, 16GB of RAM, and graphics cards with at least 4GB of memory. For Mac enthusiasts, models running on the Apple M1 chip and above will do, while the memory requirements are the same.

When choosing an AI model, you should first familiarize yourself with its system requirements. A search query like “model_name requirements” will help you assess whether it’s worth downloading this model given your available hardware. There are detailed studies available on the impact of memory size, CPU, and GPU on the performance of different models; for example, this one.

Good news for those who don’t have access to powerful hardware — there are simplified AI models that can perform practical tasks even on old hardware. Even if your graphics card is very basic and weak, it’s possible to run models and launch environments using only the CPU. Depending on your tasks, these can even work acceptably well.

GPU throughput tests

Examples of how various computer builds work with popular language models

Choosing an AI model and the magic of quantization

A wide range of language models are available today, but many of them have limited practical applications. Nevertheless, there are easy-to-use and publicly available AI tools that are well-suited for specific tasks, be they generating text (for example, Mistral 7B), or creating code snippets (for example, Code Llama 13B). Therefore, when selecting a model, narrow down the choice to a few suitable candidates, and then make sure that your computer has the necessary resources to run them.

In any neural network, most of the memory strain is courtesy of weights — numerical coefficients describing the operation of each neuron in the network. Initially, when training the model, the weights are computed and stored as high-precision fractional numbers. However, it turns out that rounding the weights in the trained model allows the AI tool to be run on regular computers while only slightly decreasing the performance. This rounding process is called quantization, and with its help the model’s size can be reduced considerably — instead of 16 bits, each weight might use eight, four, or even two bits.

According to current research, a larger model with more parameters and quantization can sometimes give better results than a model with precise weight storage but fewer parameters.

Armed with this knowledge, you’re now ready to explore the treasure trove of open-source language models, namely the top Open LLM leaderboard. In this list, AI tools are sorted by several generation quality metrics, and filters make it easy to exclude models that are too large, too small, or too accurate.

List of language models sorted by filter set

List of language models sorted by filter set

After reading the model description and making sure it’s potentially a fit for your needs, test its performance in the cloud using Hugging Face or Google Colab services. This way, you can avoid downloading models which produce unsatisfactory results, saving you time. Once you’re satisfied with the initial test of the model, it’s time to see how it works locally!

Required software

Most of the open-source models are published on Hugging Face, but simply downloading them to your computer isn’t enough. To run them, you have to install specialized software, such as LLaMA.cpp, or — even easier — its “wrapper”, LM Studio. The latter allows you to select your desired model directly from the application, download it, and run it in a dialog box.

Another “out-of-the-box” way to use a chatbot locally is GPT4All. Here, the choice is limited to about a dozen language models, but most of them will run even on a computer with just 8GB of memory and a basic graphics card.

If generation is too slow, then you may need a model with coarser quantization (two bits instead of four). If generation is interrupted or execution errors occur, the problem is often insufficient memory — it’s worth looking for a model with fewer parameters or, again, with coarser quantization.

Many models on Hugging Face have already been quantized to varying degrees of precision, but if no one has quantized the model you want with the desired precision, you can do it yourself using GPTQ.

This week, another promising tool was released to public beta: Chat With RTX from NVIDIA. The manufacturer of the most sought-after AI chips has released a local chatbot capable of summarizing the content of YouTube videos, processing sets of documents, and much more — provided the user has a Windows PC with 16GB of memory and an NVIDIA RTX 30th or 40th series graphics card with 8GB or more of video memory. “Under the hood” are the same varieties of Mistral and Llama 2 from Hugging Face. Of course, powerful graphics cards can improve generation performance, but according to the feedback from the first testers, the existing beta is quite cumbersome (about 40GB) and difficult to install. However, NVIDIA’s Chat With RTX could become a very useful local AI assistant in the future.

The code for the game "Snake", written by the quantized language model TheBloke/CodeLlama-7B-Instruct-GGUF

The code for the game “Snake”, written by the quantized language model TheBloke/CodeLlama-7B-Instruct-GGUF

The applications listed above perform all computations locally, don’t send data to servers, and can run offline so you can safely share confidential information with them. However, to fully protect yourself against leaks, you need to ensure not only the security of the language model but also that of your computer – and that’s where our comprehensive security solution comes in. As confirmed in independent tests, Kaspersky Premium has practically no impact on your computer’s performance — an important advantage when working with local AI models.


https://blog.kaspersky.com/feed/