Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
yiming-wange
GitHub Repository: yiming-wange/cs224n-2023-solution
Path: blob/main/a4/__pycache__/utils.cpython-310.pyc
1003 views
o

u�!d�
�@s~dZddlZddlmZddlZddlZddlmZddl	mm
ZddlZddl
Zdd�Zddd�Zd	d
�Zddd
�ZdS)z�
CS224N 2022-23: Homework 4
utils.py: Utility Functions
Pencheng Yin <[email protected]>
Sahil Chopra <[email protected]>
Vera Lin <[email protected]>
Siyan Li <[email protected]>
�N)�Listcs.g}tdd�|D�����fdd�|D�}|S)a% Pad list of sentences according to the longest sentence in the batch.
        The paddings should be at the end of each sentence.
    @param sents (list[list[str]]): list of sentences, where each sentence
                                    is represented as a list of words
    @param pad_token (str): padding token
    @returns sents_padded (list[list[str]]): list of sentences where sentences shorter
        than the max length sentence are padded out with the pad_token, such that
        each sentences in the batch now has equal length.
    cSsg|]}t|��qS���len)�.0�sentrr�7/Users/yimingwang/Desktop/cs224n/assignment/a4/utils.py�
<listcomp>&�zpad_sents.<locals>.<listcomp>cs"g|]
}|�g�t|��qSrr)rZsentence��
max_length�	pad_tokenrrr	's")�max)Zsentsr
Zsents_paddedrrr�	pad_sentss
r��	cCs�g}t��}|�d�|��t|ddd��$}|D]}|�|�}|dkr+dg|dg}|�|�qWd�|S1s<wY|S)	aU Read file, where each sentence is dilineated by a `
`.
    @param file_path (str): path to file containing corpus
    @param source (str): "tgt" or "src" indicating whether text
        is of the source language or target language
    @param vocab_size (int): number of unique subwords in
        vocabulary when reading and tokenizing
    z{}.model�r�utf8)�encoding�tgt�<s>�</s>N)�spmZSentencePieceProcessor�load�format�openZencode_as_pieces�append)�	file_path�sourceZ
vocab_size�data�sp�f�lineZsubword_tokensrrr�read_corpus-s
�
��r"cCsBg}t|�D]}t�|�}|dkrdg|dg}|�|�q|S)z� Read file, where each sentence is dilineated by a `
`.
    @param file_path (str): path to file containing corpus
    @param source (str): "tgt" or "src" indicating whether text
        is of the source language or target language
    rrr)r�nltkZ
word_tokenizer)rrrr!rrrr�autograder_read_corpusDs
r$Fc
#s��t�t��|�}ttt����}|rtj�|�t|�D]3}||||d|�}�fdd�|D�}t|dd�dd�}dd�|D�}d	d�|D�}	||	fVqd
S)a5 Yield batches of source and target sentences reverse sorted by length (largest to smallest).
    @param data (list of (src_sent, tgt_sent)): list of tuples containing source and target sentence
    @param batch_size (int): batch size
    @param shuffle (boolean): whether to randomly shuffle the dataset
    �csg|]}�|�qSrr)r�idx�rrrr	cr
zbatch_iter.<locals>.<listcomp>cSst|d�S)Nrr)�errr�<lambda>eszbatch_iter.<locals>.<lambda>T)�key�reversecS�g|]}|d�qS)rr�rr(rrrr	fr
cSr,)r%rr-rrrr	gr
N)	�math�ceilr�list�range�np�random�shuffle�sorted)
r�
batch_sizer4Z	batch_numZindex_array�i�indicesZexamples�	src_sents�	tgt_sentsrr'r�
batch_iterUs��r;)r)F)�__doc__r.�typingr�numpyr2�torch�torch.nn�nnZtorch.nn.functionalZ
functional�Fr#Z
sentencepiecerrr"r$r;rrrr�<module>s