本文最后更新于:2023年7月11日 上午

From Twitter of Yam Peleg.

GPT-4's details are leaked.

It is over. Everything is here:

Parameters count

GPT-4 is more than 10x the size of GPT-3. We believe it has a total of ~1.8 trillion parameters across 120 layers.


Mixture Of Experts - Confirmed

OpenAl was able to keep costs reasonable by utilizing a mixture of experts(MoE) model.

They utilizes 16 experts within their model, each is about ~111B parameters for MLP 2 of these experts are routed to per forward pass.


他们的模型利用了16个专家,每个专家大约拥有1110亿个参数,其中每次前向传递(forward pass)使用其中的2个专家。

MoE Routing

While the literature talks a lot about advanced routing algorithms for choosing which experts to route each token to, OpenAl's is allegedly quite simple, for the current GPT-4 model.

There roughly ~55B shared parameters for attention.




Each forward pass inference (generation of 1 token) only utilizes ~280B parameters and ~560 TFLOPs. This contrasts with the ~1.8 trillion parameters and ~3,700 TFLOP that would be required per forward pass of a purely dense model.

每次前向传递的推理(生成一个标记)只使用了大约280亿个参数和560 TFLOPS的计算量。这与纯密集模型每次前向传递所需的约1,800亿个参数和3,700 TFLOPS形成了对比。


GPT-4 is trained on ~13T tokens. These are not unique tokens, they count the epochs as more tokens as well.

Epoch number: 2 epochs for text-based data and 4 for code-based data.

There is millions of rows of instruction fine-tuning data from ScaleAl & internally.




GPT-4 32K

There was an 8k context length (seq len) for the pre-training phase. The 32k seq len version of GPT-4 is based on fine-tuning of the 8k after the pre-training.


Batch Size

The batch size was gradually ramped up over a number of days on the cluster, but by the end, OpenAl was using a batch size of 60 million! This, of course, is “only” a batch size of 75 million tokens per expert due to not every expert seeing all tokens.


For the real batch size

Divide this number by the seq len to get the real batch size. Just stop with this misleading numbers already.


Parallelism Strategies

To parallelize across all their A100s GPUs

They utilized 8-way tensor parallelism as that is the limit for NVLink. Beyond that, they are using 15-way pipeline parallelism.

(likely used ZeRo Stage is possiblethey used block-level FSDP)

为了在所有A100 GPU上实现并行计算,他们采用了8路张量并行(tensor parallelism),因为这是NVLink的限制。此外,他们还使用了15路管道并行(pipeline parallelism)。

很可能他们使用了ZeRo阶段1,同时也可能使用了块级FSDP(Fully Sharded Data Parallelism)。

Training Cost

OpenAl's training FLOPS for GPT-4 is ~2.15e25,

on~25,000 A100s for 90 to 100 days at about 32% to 36% MFU.

Part of this extremely low utilization is due to an absurd number of failures requiring checkpoints that needed to be restarted from.

If their cost in the cloud was about $1 per A100 hour, the training costs for this run alone would be about $63 million

(Today, the pre-training could be done with ~8,192 H100 in ~55 days for $21.5 million at $2 per H100 hour.)


在大约25,000个A100 GPU上进行训练,持续时间为90至100天,利用率约为32%至36%。



(现在,使用大约8,192个H100 GPU进行预训练,需要大约55天时间,成本为2150万美元,每个H100每小时2美元。)

Mixture of Expert Tradeoffs

There are multiple MoE tradeoffs taken:

For example, MoE is incredibly difficult to deal with on inference because not every part of the model is utilized on every token generation.

This means parts may sit dormant when other parts are being used. When serving users, this really hurts utilization rates.

Researchers have shown that using 64 to 128 experts achieves better loss than 16 experts, but that's purely research.

There are multiple reasons to go with fewer experts. One reason for OpenAI choosing 16 experts is because more experts are difficult to generalize at many tasks. More experts can also be more difficult to achieve convergence with.

With such a large training run, OpenAI Instead chose to be more conservative onthe number of experts.







GPT-4 Inference Cost

GPT-4 costs 3x that of the 175B parameter Davinchi.

This is largely due to the larger clusters required for GPT-4 and much lower utilization achieved.

AN estimate of it's costs is $0.0049 cents per 1k tokens for 128 A100s to inference GPT-4 8k seq len and $0.0021cents per 1k tokens for 128 H100's to inference GPT-4 8k seq len. It should be noted, we assume decent high utilization,and keeping batch sizes high.



据估计,使用128个A100 GPU进行GPT-4 8,000个序列长度的推理,每1,000个标记的成本约为0.0049美分;使用128个H100 GPU进行GPT-4 8,000个序列长度的推理,每1,000个标记的成本约为0.0021美分。需要注意的是,我们假设了良好的高利用率,并保持了较高的批量大小。

Multi-Query Attention

OpenAl are using MQA just like everybody else.

Because of that only 1 head is needed and memory capacity can be significantly reduced for the KV cache. Even then, the 32k seq len GPT-4 definitely cannot run on 40GB A10Os, and the 8k is capped onmax bsz.

OpenAI也像其他人一样使用了MQA(Multi-QKV Attention)。


Continuous batching

OpenAl implements both variable batch sizes and continuous batching. This is so as to allow some level of maximum latency as well optimizing the inference costs.


Vision Multi-Modal

It is a separate vision encoder from the text encoder, with cross-attention. The architecture is similar to Flamingo. This adds more parameters on top of the 1.8T of GPT-4. lt is fine-tuned with another ~2 trillion tokens, after the text only pre-training.

On the vision model, OpenAl wanted to train it from scratch, but it wasn't mature enough, so they wanted to derisk it by starting with text.

One of the primary purposes of this vision capability is for autonomous agents able to read web pages and transcribe what's in images and video.

Some of the data they train on is joint data (rendered LaTeX/text), screen shots of web page, youtube videos: samplingframes, and run Whisper around it to get transcript.





Speculative Decoding

OpenAl might be using speculative decoding on GPT-4's inference. (not sure100%)

The idea is to use a smaller faster model to decode several tokens in advance, and then feeds them into a large oracle model as a single batch.

lf the small model was right about its predictions-the larger model agrees and we can decode several tokens in a single batch.

But if the larger model rejects the tokens predicted by the draft model then the rest of the batch is discarded. And wecontinue with the larger model.

The conspiracy theory that the new GPT-4 quality had been deteriorated might be simply because they are letting the oracle model accept lower probability sequences from the speculative decoding model.

OpenAI可能正在使用GPT-4推理中的推测解码(speculative decoding),但无法确定。






Inference Architecture

The inference runs on a cluster of 128 GPUs.

There are multiple of these clusters in multiple datacenters in different locations.

It is done in 8-way tensor parallelism and 16-way pipeline parallelism.

Each node of 8 GPUs has only ~130B parameters, or less than 30GB per GPU at FP16 and less than 15GB at FP8/int8.

The model has 120, so it fits in 15 different nodes. [Possibly the there are less layers on the first node since it needs to also compute the embeddings]

According to these numbers: OpenAl should have trained on 2x the tokens if they were trying to go by chinchilla'soptimal.

[let alone surpass it like we do]

This goes to show that they are strugglingto get high quality data.








Why no FSDP?

A possible reason for this could be that some of the hardware infra they secured is of an older generation.

This is pretty common at local compute clusters as the organisationusually upgrade the infra in several "waves" to avoid a complete pause ofoperation.

With such a high amount of pipeline parallelism it is very likely that just like the rest of us they suffer from the "batch bubble": slight idle timebetween batches.

Again: There is no magic.

They know what they are doing but it is not magic.