By J Rice — Jan 27, 2025

LLM Comparison Background

I promised a comparison between OpenAI ChatGPT and DeepSeek to show how they are incredibly similar. But, before that I wanted to give some background on my testing method.

Given my deep background in security, I am always less interested in when something functions properly. I always hunt for its edges, where a system breaks down or fails.

LLMs are no different and those points are always at the edges of its training set. An LLM doesn’t know what it doesn’t know. A naïve way to look at an LLM is as a superb syllable learner and predictor. It is trained on lots and lots and lots and lots, did I mention lots, of data and the more often it sees a combination of syllables, the more often it will predict that combination.

Its “failures” lie at the edges of its training, where it has a very limited set of combinations it has been trained on. What is the best way to find these edges you ask? Well, you want to first start with something that requires significant precision. Natural English language is flexible. Humans tolerate, read through, and sometimes ignore various syntax and grammatical errors, including punctuation issues. What isn’t flexible? Programming languages they are incredibly fragile compared to written interpretable text. This is how I like to test the accuracy of an LLM.

LLMs can be very impressive in handling programming languages with immense training sets like Javascript, Python, etc. The quality nosedives on more esoteric languages or rarely used libraries in Javascript and Python. A fun one for me is Pinescript. Pinescript is a language used for programming a trading platform called TradingView. Relatively speaking, it is rarely used, has limited examples on the internet, and as a result, LLMs can be awful at it.

Just a few examples. If you ask ChatGPT for acquiring data for several tickers it will create a loop and will put the ‘request.security(“Ticker”, “D”, close)’ command in a loop. Pinescript doesn’t allow a data request like this in a loop. It blows up. It fairly consistently botches switch statements in Pinescript and flip-flops the order of columns vs rows in the ‘table.cell(…)’ command.

That is some background. I am sure you can see where this is going. I use the same prompt for each LLM I test and I watch it create garbage. GIGO

Subscribe to Tinsel AI