Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
yiming-wange
GitHub Repository: yiming-wange/cs224n-2023-solution
Path: blob/main/a5/mingpt-demo/mingpt/__pycache__/model.cpython-310.pyc
1004 views
o

M��c!�@s�dZddlZddlZddlZddlmZddlmZe�e	�Z
Gdd�d�ZGdd�de�ZGdd	�d	ej
�ZGd
d�dej
�ZGdd
�d
ej
�ZdS)a�
GPT model:
- the initial stem consists of a combination of token encoding and a positional encoding
- the meat of it is a uniform sequence of Transformer blocks
    - each Transformer is a sequential combination of a 1-hidden-layer MLP block and a self-attention block
    - all blocks feed into a central residual pathway similar to resnets
- the final decoder is a linear projection into a vanilla Softmax classifier
�N)�
functionalc@s$eZdZdZdZdZdZdd�ZdS)�	GPTConfigz4 base GPT config, params common to all GPT versions g�������?cKs.||_||_|��D]
\}}t|||�q
dS�N)�
vocab_size�
block_size�items�setattr)�selfrr�kwargs�k�v�r
�J/Users/yimingwang/Desktop/cs224n/assignment/a5/mingpt-demo/mingpt/model.py�__init__s
�zGPTConfig.__init__N)�__name__�
__module__�__qualname__�__doc__�
embd_pdrop�resid_pdrop�
attn_pdroprr
r
r
rrsrc@seZdZdZdZdZdZdS)�
GPT1Configz( GPT-1 like network roughly 125M params �iN)rrrr�n_layer�n_head�n_embdr
r
r
rrs
rcs*eZdZdZ�fdd�Zddd�Z�ZS)�CausalSelfAttentionz�
    A vanilla multi-head masked self-attention layer with a projection at the end.
    It is possible to use torch.nn.MultiheadAttention here but I am including an
    explicit implementation here to show that there is nothing too scary here.
    c	s�t���|j|jdksJ�t�|j|j�|_t�|j|j�|_t�|j|j�|_t�	|j
�|_t�	|j�|_
t�|j|j�|_|�dt�t�|j|j���dd|j|j��|j|_dS)Nr�mask�)�superrrr�nn�Linear�key�query�value�Dropoutr�	attn_dropr�
resid_drop�proj�register_buffer�torch�tril�onesr�view�r	�config��	__class__r
rr,s
�zCausalSelfAttention.__init__NcCs$|��\}}}|�|��|||j||j��dd�}|�|��|||j||j��dd�}|�|��|||j||j��dd�}||�dd�dt�|�d��}	|	�	|j
dd�dd�d|�d|�fdktd��}	tj
|	dd�}	|�|	�}	|	|}
|
�dd����|||�}
|�|�|
��}
|
S)	Nr�������������?rz-inf)�dim)�sizer"r-r�	transposer#r$�math�sqrt�masked_fillr�float�F�softmaxr&�
contiguousr'r()r	�xZ
layer_past�B�T�Cr�qrZatt�yr
r
r�forward=s&&&$2
zCausalSelfAttention.forwardr�rrrrrrF�
__classcell__r
r
r0rr%srcs(eZdZdZ�fdd�Zdd�Z�ZS)�Blockz! an unassuming Transformer block csrt���t�|j�|_t�|j�|_t|�|_t�	t�
|jd|j�t��t�
d|j|j�t�|j
��|_dS)N�)rrr �	LayerNormr�ln1�ln2r�attn�
Sequentialr!�GELUr%r�mlpr.r0r
rrTs



�zBlock.__init__cCs,||�|�|��}||�|�|��}|Sr)rNrLrQrM)r	r@r
r
rrF`sz
Block.forwardrGr
r
r0rrIQsrIcsBeZdZdZ�fdd�Zdd�Zdd�Zdd	�Zd
dd�Z�Z	S)�GPTzA  the full GPT language model, with a context size of block_size cs�t���t��j�j�|_t�t�	d�j
�j��|_t��j
�|_tj�fdd�t�j�D��|_t��j�|_tj�j�jdd�|_�j
|_
|�|j�t�dtdd�|��D���dS)	Nrcsg|]}t���qSr
)rI)�.0�_�r/r
r�
<listcomp>p�z GPT.__init__.<locals>.<listcomp>F)�biasznumber of parameters: %ecss�|]}|��VqdSr)�numel)rS�pr
r
r�	<genexpr>xs�zGPT.__init__.<locals>.<genexpr>)rrr �	Embeddingrr�tok_emb�	Parameterr*�zerosr�pos_embr%r�droprO�ranger�blocksrK�ln_fr!�head�apply�
_init_weights�logger�info�sum�
parametersr.r0rUrrhs
 "zGPT.__init__cCs|jSr)r)r	r
r
r�get_block_sizezszGPT.get_block_sizecCs�t|tjtjf�r)|jjjddd�t|tj�r%|jdur'|jj��dSdSdSt|tj	�r>|jj��|jj�
d�dSdS)N�g{�G�z�?)�mean�stdr5)�
isinstancer r!r\�weight�data�normal_rX�zero_rK�fill_)r	�moduler
r
rrg}s��zGPT._init_weightscs�t�}t�}tjjf}tjjtjjf}|��D]A\}}|��D]8\}}	|r+d||fn|}
|�d�r8|�	|
�q|�d�rHt
||�rH|�	|
�q|�d�rWt
||�rW|�	|
�qq|�	d�dd�|��D��||@}||B}t|�dks~Jdt|�f��t��
�|�dks�Jd	t��
�|�f���fd
d�tt|��D�|jd��fd
d�tt|��D�dd�g}
tjj|
|j|jd�}|S)ay
        This long function is unfortunately doing something very simple and is being very defensive:
        We are separating out all parameters of the model into two buckets: those that will experience
        weight decay for regularization and those that won't (biases, and layernorm/embedding weights).
        We are then returning the PyTorch optimizer object.
        z%s.%srXrqr`cSsi|]\}}||�qSr
r
)rS�pnrZr
r
r�
<dictcomp>�sz,GPT.configure_optimizers.<locals>.<dictcomp>rz4parameters %s made it into both decay/no_decay sets!z@parameters %s were not separated into either decay/no_decay set!c�g|]}�|�qSr
r
�rSrw��
param_dictr
rrV�rWz,GPT.configure_optimizers.<locals>.<listcomp>)�params�weight_decaycryr
r
rzr{r
rrV�rWrm)�lr�betas)�setr*r r!rKr\�
named_modules�named_parameters�endswith�addrp�len�str�keys�sorted�listr~�optim�AdamWZ
learning_rater�)r	Ztrain_configZdecayZno_decayZwhitelist_weight_modulesZblacklist_weight_modules�mn�mrwrZZfpnZinter_paramsZunion_paramsZoptim_groups�	optimizerr
r{r�configure_optimizers�s8	


��
� �zGPT.configure_optimizersNc
Cs�|��\}}||jksJd��|�|�}|jdd�d|�dd�f}|�||�}|�|�}|�|�}|�|�}d}	|durNt�	|�
d|�d��|�
d��}	||	fS)Nz.Cannot forward, model block size is exhausted.r4)r7rr]r`rarcrdrer=�
cross_entropyr-)
r	�idx�targets�b�tZtoken_embeddingsZposition_embeddingsr@�logits�lossr
r
rrF�s



 zGPT.forwardr)
rrrrrrlrgr�rFrHr
r
r0rrRes	.rR)rr9�loggingr*�torch.nnr rr=�	getLoggerrrhrr�ModulerrIrRr
r
r
r�<module>s	
,