Path: blob/main/a5/mingpt-demo/mingpt/__pycache__/model.cpython-310.pyc
3764 views
o
M��c! � @ s� d Z ddlZddlZddlZddlmZ ddlmZ e�e �Z
G dd� d�ZG dd� de�ZG dd � d ej
�ZG d
d� dej
�ZG dd
� d
ej
�ZdS )a�
GPT model:
- the initial stem consists of a combination of token encoding and a positional encoding
- the meat of it is a uniform sequence of Transformer blocks
- each Transformer is a sequential combination of a 1-hidden-layer MLP block and a self-attention block
- all blocks feed into a central residual pathway similar to resnets
- the final decoder is a linear projection into a vanilla Softmax classifier
� N)�
functionalc @ s$ e Zd ZdZdZdZdZdd� ZdS )� GPTConfigz4 base GPT config, params common to all GPT versions g�������?c K s. || _ || _|�� D ]
\}}t| ||� q
d S �N)�
vocab_size�
block_size�items�setattr)�selfr r �kwargs�k�v� r
�J/Users/yimingwang/Desktop/cs224n/assignment/a5/mingpt-demo/mingpt/model.py�__init__ s
�zGPTConfig.__init__N)�__name__�
__module__�__qualname__�__doc__�
embd_pdrop�resid_pdrop�
attn_pdropr r
r
r
r r s r c @ s e Zd ZdZdZdZdZdS )�
GPT1Configz( GPT-1 like network roughly 125M params � i N)r r r r �n_layer�n_head�n_embdr
r
r
r r s
r c s* e Zd ZdZ� fdd�Zddd�Z� ZS )�CausalSelfAttentionz�
A vanilla multi-head masked self-attention layer with a projection at the end.
It is possible to use torch.nn.MultiheadAttention here but I am including an
explicit implementation here to show that there is nothing too scary here.
c s� t � �� |j|j dksJ �t�|j|j�| _t�|j|j�| _t�|j|j�| _t� |j
�| _t� |j�| _
t�|j|j�| _| �dt�t�|j|j���dd|j|j�� |j| _d S )Nr �mask� )�superr r r �nn�Linear�key�query�value�Dropoutr � attn_dropr �
resid_drop�proj�register_buffer�torch�tril�onesr �view�r �config�� __class__r
r r , s
�zCausalSelfAttention.__init__Nc C s$ |� � \}}}| �|��||| j|| j ��dd�}| �|��||| j|| j ��dd�}| �|��||| j|| j ��dd�}||�dd� dt�|� d�� } | � | j
d d �d d �d |�d |�f dktd��} tj
| dd�} | �| �} | | }
|
�dd��� �|||�}
| �| �|
��}
|
S ) Nr � ����������� �?r z-inf)�dim)�sizer"