Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
yiming-wange
GitHub Repository: yiming-wange/cs224n-2023-solution
Path: blob/main/a4/__pycache__/vocab.cpython-310.pyc
1003 views
o

���c�$�@s"dZddlmZddlmZddlmZddlZddlZddlm	Z	ddl
mZmZddl
ZGdd	�d	e�ZGd
d�de�Zdd
�Zedkr�ee�Zeded�eded�eedddd�Zeedddd�Ze�ee�Zedee�ee�f�e�ed�eded�dSdS)aF
CS224N 2022-23: Homework 4
vocab.py: Vocabulary Generation
Pencheng Yin <[email protected]>
Sahil Chopra <[email protected]>
Vera Lin <[email protected]>
Siyan Li <[email protected]>

Usage:
    vocab.py --train-src=<file> --train-tgt=<file> [options] VOCAB_FILE

Options:
    -h --help                  Show this screen.
    --train-src=<file>         File of training source sentences
    --train-tgt=<file>         File of training target sentences
    --size=<int>               vocab size [default: 50000]
    --freq-cutoff=<int>        frequency cutoff [default: 2]
�)�Counter)�docopt)�chainN)�List)�read_corpus�	pad_sentsc@s�eZdZdZd!dd�Zdd�Zdd�Zd	d
�Zdd�Zd
d�Z	dd�Z
dd�Zdd�Zdd�Z
deeedejdejfdd�Zed"dd��Zedd ��ZdS)#�
VocabEntryzW Vocabulary Entry, i.e. structure containing either
    src or tgt language terms.
    NcCsb|r||_nt�|_d|jd<d|jd<d|jd<d|jd<|jd|_d	d
�|j��D�|_dS)ze Init VocabEntry Instance.
        @param word2id (dict): dictionary mapping words 2 indices
        r�<pad>�z<s>�z</s>�z<unk>cSsi|]\}}||�qS�r
)�.0�k�vr
r
�7/Users/yimingwang/Desktop/cs224n/assignment/a4/vocab.py�
<dictcomp>2�z'VocabEntry.__init__.<locals>.<dictcomp>N)�word2id�dict�unk_id�items�id2word)�selfrr
r
r�__init__%s



zVocabEntry.__init__cCs|j�||j�S)z� Retrieve word's index. Return the index for the unk
        token if the word is out of vocabulary.
        @param word (str): word to look up.
        @returns index (int): index of word 
        )r�getr�r�wordr
r
r�__getitem__4szVocabEntry.__getitem__cCs
||jvS)z� Check if word is captured by VocabEntry.
        @param word (str): word to look up
        @returns contains (bool): whether word is contained    
        )rrr
r
r�__contains__<�
zVocabEntry.__contains__cCstd��)z; Raise error, if one tries to edit the VocabEntry.
        zvocabulary is readonly)�
ValueError)r�key�valuer
r
r�__setitem__CszVocabEntry.__setitem__cCs
t|j�S)zj Compute number of words in VocabEntry.
        @returns len (int): number of words in VocabEntry
        )�lenr�rr
r
r�__len__Hs
zVocabEntry.__len__cCsdt|�S)zS Representation of VocabEntry to be used
        when printing the object.
        zVocabulary[size=%d])r%r&r
r
r�__repr__NszVocabEntry.__repr__cCs
|j|S)z� Return mapping of index to word.
        @param wid (int): word index
        @returns word (str): word corresponding to index
        �r)r�widr
r
rrTr zVocabEntry.id2wordcCs0||vrt|�}|j|<||j|<|S||S)z� Add word to VocabEntry, if it is previously unseen.
        @param word (str): word to add to VocabEntry
        @return index (int): index that the word has been assigned
        )r%rr)rrr*r
r
r�add[s

zVocabEntry.addcs4t|d�tkr�fdd�|D�S�fdd�|D�S)a Convert list of words or list of sentences of words
        into list or list of list of indices.
        @param sents (list[str] or list[list[str]]): sentence(s) in words
        @return word_ids (list[int] or list[list[int]]): sentence(s) in indices
        rcsg|]}�fdd�|D��qS)c�g|]}�|�qSr
r
�r�wr&r
r�
<listcomp>n�z7VocabEntry.words2indices.<locals>.<listcomp>.<listcomp>r
)r�sr&r
rr/nsz,VocabEntry.words2indices.<locals>.<listcomp>cr,r
r
r-r&r
rr/pr0)�type�list)r�sentsr
r&r�
words2indicesgszVocabEntry.words2indicescs�fdd�|D�S)z� Convert list of indices into words.
        @param word_ids (list[int]): list of word ids
        @return sents (list[str]): list of words
        csg|]}�j|�qSr
r))rZw_idr&r
rr/wrz,VocabEntry.indices2words.<locals>.<listcomp>r
)r�word_idsr
r&r�
indices2wordsrszVocabEntry.indices2wordsr4�device�returncCs4|�|�}t||d�}tj|tj|d�}t�|�S)aE Convert list of sentences (words) into tensor with necessary padding for 
        shorter sentences.

        @param sents (List[List[str]]): list of sentences (words)
        @param device: device on which to load the tesnor, i.e. CPU or GPU

        @returns sents_var: tensor of (max_sentence_length, batch_size)
        r	)�dtyper8)r5r�torch�tensor�long�t)rr4r8r6Zsents_tZ	sents_varr
r
r�to_input_tensorys
	
zVocabEntry.to_input_tensorrcsxt�}tt|����fdd����D�}td�t���t|���t|�fdd�dd�d|�}|D]}|�|�q2|S)	ak Given a corpus construct a Vocab Entry.
        @param corpus (list[str]): corpus of text produced by read_corpus function
        @param size (int): # of words in vocabulary
        @param freq_cutoff (int): if word occurs n < freq_cutoff times, drop the word
        @returns vocab_entry (VocabEntry): VocabEntry instance produced from provided corpus
        csg|]
\}}|�kr|�qSr
r
)rr.r)�freq_cutoffr
rr/�sz*VocabEntry.from_corpus.<locals>.<listcomp>zEnumber of word types: {}, number of word types w/ frequency >= {}: {}cs�|S�Nr
)r.)�	word_freqr
r�<lambda>�sz(VocabEntry.from_corpus.<locals>.<lambda>T)r"�reverseN)	rrrr�print�formatr%�sortedr+)�corpus�sizer@�vocab_entryZvalid_wordsZtop_k_wordsrr
)r@rBr�from_corpus�s�zVocabEntry.from_corpuscCst�}|D]}|�|�q|SrA)rr+)Zsubword_listrJZsubwordr
r
r�from_subword_list�szVocabEntry.from_subword_listrA)r)�__name__�
__module__�__qualname__�__doc__rrrr$r'r(rr+r5r7r�strr;r8�Tensorr?�staticmethodrKrLr
r
r
rr!s"
"rc@sLeZdZdZdedefdd�Zeddd��Zd	d
�Zedd��Z	d
d�Z
dS)�Vocabz3 Vocab encapsulating src and target langauges.
    �	src_vocab�	tgt_vocabcCs||_||_dS)z� Init Vocab.
        @param src_vocab (VocabEntry): VocabEntry for source language
        @param tgt_vocab (VocabEntry): VocabEntry for target language
        N)�src�tgt)rrUrVr
r
rr�s
zVocab.__init__r9cCs.td�t�|�}td�t�|�}t||�S)z� Build Vocabulary.
        @param src_sents (list[str]): Source subwords provided by SentencePiece
        @param tgt_sents (list[str]): Target subwords provided by SentencePiece
        zinitialize source vocabulary ..zinitialize target vocabulary ..)rErrLrT)�	src_sents�	tgt_sentsrWrXr
r
r�build�s



zVocab.buildcCsPt|d��}tjt|jj|jjd�|dd�Wd�dS1s!wYdS)zb Save Vocab to file as JSON dump.
        @param file_path (str): file path to vocab file
        r.)�src_word2id�tgt_word2idr)�indentN)�open�json�dumprrWrrX)r�	file_path�fr
r
r�save�s""�z
Vocab.savecCs2t�t|d��}|d}|d}tt|�t|��S)z� Load vocabulary from JSON dump.
        @param file_path (str): file path to vocab file
        @returns Vocab object loaded from JSON dump
        �rr\r])r`�loadr_rTr)rb�entryr\r]r
r
rrf�sz
Vocab.loadcCsdt|j�t|j�fS)zN Representation of Vocab to be used
        when printing the object.
        z'Vocab(source %d words, target %d words))r%rWrXr&r
r
rr(�szVocab.__repr__N)r9rT)rMrNrOrPrrrSr[rdrfr(r
r
r
rrT�s
rTcsHtjj|||d�t�����d�|���fdd�t����D�}|S)z� Use SentencePiece to tokenize and acquire list of unique subwords.
    @param file_path (str): file path to corpus
    @param source (str): tgt or src
    @param vocab_size: desired vocabulary size
    )�inputZmodel_prefix�
vocab_sizez{}.modelcsg|]}��|��qSr
)�	IdToPiece)rZpiece_id��spr
rr/�rz"get_vocab_list.<locals>.<listcomp>)�spm�SentencePieceTrainer�Train�SentencePieceProcessor�LoadrF�range�GetPieceSize)rb�sourceriZsp_listr
rkr�get_vocab_list�s
ru�__main__zread in source sentences: %sz--train-srczread in target sentences: %sz--train-tgtrWiR)rtrirXi@z6generated vocabulary, source %d words, target %d wordsZ
VOCAB_FILEzvocabulary saved to %s)rP�collectionsrr�	itertoolsrr`r;�typingr�utilsrr�
sentencepiecerm�objectrrTrurM�argsrErYrZr[�vocabr%rdr
r
r
r�<module>s07�