Path: blob/main/a5/mingpt-demo/mingpt/__pycache__/model.cpython-310.pyc
1004 views
o M��c! � @ s� d Z ddlZddlZddlZddlmZ ddlmZ e�e �Z G dd� d�ZG dd� de�ZG dd � d ej �ZG d d� dej �ZG dd � d ej �ZdS )a� GPT model: - the initial stem consists of a combination of token encoding and a positional encoding - the meat of it is a uniform sequence of Transformer blocks - each Transformer is a sequential combination of a 1-hidden-layer MLP block and a self-attention block - all blocks feed into a central residual pathway similar to resnets - the final decoder is a linear projection into a vanilla Softmax classifier � N)� functionalc @ s$ e Zd ZdZdZdZdZdd� ZdS )� GPTConfigz4 base GPT config, params common to all GPT versions g�������?c K s. || _ || _|�� D ] \}}t| ||� q d S �N)� vocab_size� block_size�items�setattr)�selfr r �kwargs�k�v� r �J/Users/yimingwang/Desktop/cs224n/assignment/a5/mingpt-demo/mingpt/model.py�__init__ s �zGPTConfig.__init__N)�__name__� __module__�__qualname__�__doc__� embd_pdrop�resid_pdrop� attn_pdropr r r r r r s r c @ s e Zd ZdZdZdZdZdS )� GPT1Configz( GPT-1 like network roughly 125M params � i N)r r r r �n_layer�n_head�n_embdr r r r r s r c s* e Zd ZdZ� fdd�Zddd�Z� ZS )�CausalSelfAttentionz� A vanilla multi-head masked self-attention layer with a projection at the end. It is possible to use torch.nn.MultiheadAttention here but I am including an explicit implementation here to show that there is nothing too scary here. c s� t � �� |j|j dksJ �t�|j|j�| _t�|j|j�| _t�|j|j�| _t� |j �| _t� |j�| _ t�|j|j�| _| �dt�t�|j|j���dd|j|j�� |j| _d S )Nr �mask� )�superr r r �nn�Linear�key�query�value�Dropoutr � attn_dropr � resid_drop�proj�register_buffer�torch�tril�onesr �view�r �config�� __class__r r r , s �zCausalSelfAttention.__init__Nc C s$ |� � \}}}| �|��||| j|| j ��dd�}| �|��||| j|| j ��dd�}| �|��||| j|| j ��dd�}||�dd� dt�|� d�� } | � | j d d �d d �d |�d |�f dktd��} tj | dd�} | �| �} | | } | �dd��� �|||�} | �| �| ��} | S ) Nr � ����������� �?r z-inf)�dim)�sizer"