CoCalc -- a2_latex_template.tex

GitHub Repository: yiming-wange/cs224n-2023-solution
Path: blob/main/a2/a2_latex_template/a2_latex_template.tex
¹⁰⁰³ views
1
    \documentclass{article}
2

3
\newif\ifanswers
4
\answerstrue % comment out to hide answers
5

6
\usepackage[compact]{titlesec}
7
\usepackage{fancyhdr} % Required for custom headers
8
\usepackage{lastpage} % Required to determine the last page for the footer
9
\usepackage{extramarks} % Required for headers and footers
10
\usepackage[usenames,dvipsnames]{color} % Required for custom colors
11
\usepackage{graphicx} % Required to insert images
12
\usepackage{listings} % Required for insertion of code
13
\usepackage{courier} % Required for the courier font
14
\usepackage{lipsum} % Used for inserting dummy 'Lorem ipsum' text into the template
15
\usepackage{enumerate}
16
\usepackage{enumitem}
17
\usepackage{subfigure}
18
\usepackage{booktabs}
19
\usepackage{amsmath, amsthm, amssymb}
20
\usepackage{caption}
21
\usepackage{hyperref}
22
\captionsetup[table]{skip=4pt}
23
\usepackage{framed}
24
\usepackage{bm}
25
\usepackage{minted}
26
\usepackage{soul}
27

28
\usepackage{tikz}
29
\usetikzlibrary{positioning,patterns,fit}
30

31
% Margins
32
\topmargin=-0.45in
33
\evensidemargin=0in
34
\oddsidemargin=0in
35
\textwidth=6.5in
36
\textheight=9.0in
37
\headsep=0.25in
38

39
\linespread{1.1} % Line spacing
40

41
% Set up the header and footer
42
\pagestyle{fancy}
43
\rhead{\hmwkAuthorName} % Top left header
44
\lhead{\hmwkClass: \hmwkTitle} % Top center head
45
\lfoot{\lastxmark} % Bottom left footer
46
\cfoot{} % Bottom center footer
47
\rfoot{Page\ \thepage\ of\ \protect\pageref{LastPage}} % Bottom right footer
48
\renewcommand\headrulewidth{0.4pt} % Size of the header rule
49
\renewcommand\footrulewidth{0.4pt} % Size of the footer rule
50

51
\setlength\parindent{0pt} % Removes all indentation from paragraphs
52

53
\newenvironment{answer}{
54
    % Uncomment this if using the template to write out your solutions.
55
    {\bf Answer:} \sf \begingroup\color{red}
56
}{\endgroup}%
57
%----------------------------------------------------------------------------------------
58
%	CODE INCLUSION CONFIGURATION
59
%----------------------------------------------------------------------------------------
60

61
\definecolor{MyDarkGreen}{rgb}{0.0,0.4,0.0} % This is the color used for comments
62
\definecolor{shadecolor}{gray}{0.9}
63

64
\lstloadlanguages{Python} % Load Perl syntax for listings, for a list of other languages supported see: ftp://ftp.tex.ac.uk/tex-archive/macros/latex/contrib/listings/listings.pdf
65
\lstset{language=Python, % Use Perl in this example
66
        frame=single, % Single frame around code
67
        basicstyle=\footnotesize\ttfamily, % Use small true type font
68
        keywordstyle=[1]\color{Blue}\bf, % Perl functions bold and blue
69
        keywordstyle=[2]\color{Purple}, % Perl function arguments purple
70
        keywordstyle=[3]\color{Blue}\underbar, % Custom functions underlined and blue
71
        identifierstyle=, % Nothing special about identifiers
72
        commentstyle=\usefont{T1}{pcr}{m}{sl}\color{MyDarkGreen}\small, % Comments small dark green courier font
73
        stringstyle=\color{Purple}, % Strings are purple
74
        showstringspaces=false, % Don't put marks in string spaces
75
        tabsize=5, % 5 spaces per tab
76
        %
77
        % Put standard Perl functions not included in the default language here
78
        morekeywords={rand},
79
        %
80
        % Put Perl function parameters here
81
        morekeywords=[2]{on, off, interp},
82
        %
83
        % Put user defined functions here
84
        morekeywords=[3]{test},
85
       	%
86
        morecomment=[l][\color{Blue}]{...}, % Line continuation (...) like blue comment
87
        numbers=left, % Line numbers on left
88
        firstnumber=1, % Line numbers start with line 1
89
        numberstyle=\tiny\color{Blue}, % Line numbers are blue and small
90
        stepnumber=5 % Line numbers go in steps of 5
91
}
92

93
% Creates a new command to include a perl script, the first parameter is the filename of the script (without .pl), the second parameter is the caption
94
\newcommand{\perlscript}[2]{
95
\begin{itemize}
96
\item[]\lstinputlisting[caption=#2,label=#1]{#1.pl}
97
\end{itemize}
98
}
99

100
%----------------------------------------------------------------------------------------
101
%	NAME AND CLASS SECTION
102
%----------------------------------------------------------------------------------------
103

104
\newcommand{\hmwkTitle}{word2vec (49 Points)} % Assignment title
105
\newcommand{\hmwkClass}{CS\ 224n Assignment \#2} % Course/class
106
%\newcommand{\hmwkAuthorName}{Abigail See, Sahil Chopra} % Your name
107

108
\newcommand{\ifans}[1]{\ifanswers \color{red} \vspace{5mm} \textbf{Solution: } #1 \color{black} \vspace{5mm} \fi}
109

110
% Chris' notes
111
\definecolor{CMpurple}{rgb}{0.6,0.18,0.64}
112
\newcommand\cm[1]{\textcolor{CMpurple}{\small\textsf{\bfseries CM\@: #1}}}
113
\newcommand\cmm[1]{\marginpar{\small\raggedright\textcolor{CMpurple}{\textsf{\bfseries CM\@: #1}}}}
114

115
%----------------------------------------------------------------------------------------
116
%	TITLE PAGE
117
%----------------------------------------------------------------------------------------
118
\title{
119
\vspace{-1in}
120
\textmd{\textbf{\hmwkClass:\ \hmwkTitle} \\ \hmwkAuthorName}\\
121
}
122
\author{}
123
%\date{\textit{\small Updated \today\ at \currenttime}} % Insert date here if you want it to appear below your name
124
\date{}
125

126
\setcounter{section}{0} % one-indexing
127
\begin{document}
128

129
\maketitle
130
\vspace{-.7in}
131

132
\begin{center}
133
    \large{\textbf{Due on} Tuesday Jan. 24, 2023 by \textbf{4:30pm (before class)}}
134
\end{center}
135

136
\section{Written: Understanding word2vec (31 points)}
137
Recall that the key insight behind {\tt word2vec} is that \textit{`a word is known by the company it keeps'}. Concretely, consider a `center' word $c$ surrounded before and after by a context of a certain length. We term words in this contextual window `outside words' ($O$). For example, in Figure~\ref{fig:word2vec}, the context window length is 2, the center word $c$ is `banking', and the outside words are `turning', `into', `crises', and `as':
138

139
\begin{figure}[h]
140
    \centering
141
    \includegraphics[width=0.6\textwidth]{word2vec.png}
142
    \caption{The word2vec skip-gram prediction model with window size 2}
143
    \label{fig:word2vec}
144
\end{figure}
145

146
Skip-gram {\tt word2vec} aims to learn the probability distribution $P(O|C)$. 
147
Specifically, given a specific word $o$ and a specific word $c$, we want to predict $P(O=o|C=c)$: the probability that word $o$ is an `outside' word for $c$ (i.e., that it falls within the contextual window of $c$).
148
We model this probability by taking the softmax function over a series of vector dot-products: % I added the word "softmax" here because I bet a lot of students will have forgotten what softmax is and why the loss fn is called naive softmax. but if this is too wordy we can just take it out
149

150
\begin{equation}
151
 P(O=o \mid C=c) = \frac{\exp(\bm u_{o}^\top \bm v_c)}{\sum_{w \in \text{Vocab}} \exp(\bm u_{w}^\top \bm v_c)}
152
 \label{word2vec_condprob}
153
\end{equation}
154

155
For each word, we learn vectors $u$ and $v$, where $\bm u_o$ is the `outside' vector representing outside word $o$, and $\bm v_c$ is the `center' vector representing center word $c$. 
156
We store these parameters in two matrices, $\bm U$ and $\bm V$.
157
The columns of $\bm U$ are all the `outside' vectors $\bm u_{w}$;
158
the columns of $\bm V$ are all of the `center' vectors $\bm v_{w}$. 
159
Both $\bm U$ and $\bm V$ contain a vector for every $w \in \text{Vocabulary}$.\footnote{Assume that every word in our vocabulary is matched to an integer number $k$. Bolded lowercase letters represent vectors. $\bm u_{k}$ is both the $k^{th}$ column of $\bm U$ and the `outside' word vector for the word indexed by $k$. $\bm v_k$ is both the $k^{th}$ column of $\bm V$ and the `center' word vector for the word indexed by $k$. \textbf{In order to simplify notation we shall interchangeably use $k$ to refer to word $k$ and the index of word $k$.}}\newline
160

161
%We can think of the probability distribution $P(O|C)$ as a prediction function that we can approximate via supervised learning. For any training example, we will have a single $o$ and $c$. We will then compute a value $P(O=o|C=c)$ and report the loss. 
162
Recall from lectures that, for a single pair of words $c$ and $o$, the loss is given by:
163

164
\begin{equation} 
165
\bm J_{\text{naive-softmax}}(\bm v_c, o, \bm U) = -\log P(O=o| C=c).
166
\label{naive-softmax}
167
\end{equation}
168

169
We can view this loss as the cross-entropy\footnote{The \textbf{cross-entropy loss} between the true (discrete) probability distribution $p$ and another distribution $q$ is $-\sum_i p_i \log(q_i)$.} between the true distribution $\bm y$ and the predicted distribution $\hat{\bm y}$, for a particular center word c and a particular outside word o. 
170
Here, both $\bm y$ and $\hat{\bm y}$ are vectors with length equal to the number of words in the vocabulary.
171
Furthermore, the $k^{th}$ entry in these vectors indicates the conditional probability of the $k^{th}$ word being an `outside word' for the given $c$. 
172
The true empirical distribution $\bm y$ is a one-hot vector with a 1 for the true outside word $o$, and 0 everywhere else, for this particular example of center word c and outside word o.\footnote{Note that the true conditional probability distribution of context words for the entire training dataset would not be one-hot.}
173
The predicted distribution $\hat{\bm y}$ is the probability distribution $P(O|C=c)$ given by our model in equation (\ref{word2vec_condprob}). \newline
174

175
\textbf{Note:} Throughout this homework, when computing derivatives, please use the method reviewed during the lecture (i.e. no Taylor Series Approximations).
176

177
\clearpage 
178
\begin{enumerate}[label=(\alph*)]
179
% Question 1-A
180
\item (2 points) 
181
Prove that the naive-softmax loss (Equation \ref{naive-softmax}) is the same as the cross-entropy loss between $\bm y$ and $\hat{\bm y}$, i.e. (note that $\bm y, \hat{\bm y}$ are vectors and $\hat{\bm y}_o$ is a scalar):
182

183
\begin{equation} 
184
-\sum_{w \in \text{Vocab}} \bm y_w \log(\hat{\bm y}_w) = - \log (\hat{\bm y}_o).
185
\end{equation}
186

187
Your answer should be one line. You may describe your answer in words.
188
\begin{shaded}
189
\begin{answer}
190

191
\end{answer}
192
\end{shaded}
193

194
% Question 1-B
195
\item (7 points) 
196
\begin{enumerate}[label=(\roman*)]
197
    \item 
198
    Compute the partial derivative of $\bm J_{\text{naive-softmax}}(\bm v_c, o, \bm U)$ with respect to $\bm v_c$. \ul{Please write your answer in terms of $\bm y$, $\hat{\bm y}$, $\bm U$, and show your work to receive full credit}.
199
    \begin{itemize} 
200
        \item \textbf{Note}: Your final answers for the partial derivative should follow the shape convention: the partial derivative of any function $f(x)$ with respect to $x$ should have the \textbf{same shape} as $x$.\footnote{This allows us to efficiently minimize a function using gradient descent without worrying about reshaping or dimension mismatching. While following the shape convention, we're guaranteed that $\theta:= \theta - \alpha\frac{\partial J(\theta)}{\partial \theta}$ is a well-defined update rule.}
201
        \item Please provide your answers for the partial derivative in vectorized form. For example, when we ask you to write your answers in terms of $\bm y$, $\hat{\bm y}$, and $\bm U$, you may not refer to specific elements of these terms in your final answer (such as $\bm y_1$, $\bm y_2$, $\dots$). 
202
    \end{itemize}
203
    \item
204
    When is the gradient you computed equal to zero? \\
205
    \textbf{Hint:} You may wish to review and use some introductory linear algebra concepts.
206
    \item
207
    The gradient you found is the difference between two terms. Provide an interpretation of how each of these terms improves the word vector when this gradient is subtracted from the word vector $v_c$.
208
    \item
209
    In many downstream applications using word embeddings, L2 normalized vectors (e.g. $\mathbf{u}/||\mathbf{u}||_2$ where $||\mathbf{u}||_2 = \sqrt{\sum_i u_i^2}$) are used instead of their raw forms (e.g. $\mathbf{u}$). Now, suppose you would like to classify phrases as being positive or negative. When would L2 normalization take away useful information for the downstream task? When would it not? 
210
Hint: Consider the case where $\mathbf{u}_x = \alpha\mathbf{u}_y$ for some words $x \neq y$ and some scalar $\alpha$.
211

212
\end{enumerate}
213

214
\begin{shaded}
215
\begin{answer}
216

217
\end{answer}
218
\end{shaded}
219

220
% Question 1-C
221
\item (5 points) Compute the partial derivatives of $\bm J_{\text{naive-softmax}}(\bm v_c, o, \bm U)$ with respect to each of the `outside' word vectors, $\bm u_w$'s. There will be two cases: when $w=o$, the true `outside' word vector, and $w \neq o$, for all other words. Please write your answer in terms of $\bm y$, $\hat{\bm y}$, and $\bm v_c$. In this subpart, you may use specific elements within these terms as well (such as $\bm y_1$, $\bm y_2$, $\dots$). Note that $\bm u_w$ is a vector while $\bm y_1, \bm y_2, \dots$ are scalars. Show your work to receive full credit.
222

223
\begin{shaded}
224
\begin{answer}
225

226
\end{answer}
227
\end{shaded}
228

229
% Question 1-D
230
\item (1 point) Write down the partial derivative of $\bm J_{\text{naive-softmax}}(\bm v_c, o, \bm U)$ with respect to $\bm U$. Please break down your answer in terms of the column vectors $\frac{\partial \bm J(\bm v_c, o, \bm U)}{\partial \bm u_1}$, $\frac{\partial \bm J(\bm v_c, o, \bm U)}{\partial \bm u_2}$, $\cdots$, $\frac{\partial \bm J(\bm v_c, o, \bm U)}{\partial \bm u_{|\text{Vocab}|}}$. No derivations are necessary, just an answer in the form of a matrix.
231

232
\begin{shaded}
233
\begin{answer}
234

235
\end{answer}
236
\end{shaded}
237

238
% Question 1-E
239
\item (2 points) The Leaky ReLU (Leaky Rectified Linear Unit) activation function is given by Equation \ref{LeakyReLU} and Figure~\ref{fig:leaky_relu}:
240
\begin{equation}
241
    \label{LeakyReLU}
242
    f(x) = \max(\alpha x, x)
243
\end{equation}
244

245
\begin{figure}[h]
246
    \centering
247
    \includegraphics[width=0.3\textwidth]{leaky_relu_graph.png}
248
    \caption{Leaky ReLU}
249
    \label{fig:leaky_relu}
250
\end{figure}
251

252
Where $x$ is a scalar and $0<\alpha <1$, please compute the derivative of $f(x)$ with respect to $x$. You may ignore the case where the derivative is not defined at 0.\footnote{If you're interested in how to handle the derivative at this point, you can read more about the notion of \hyperref[https://en.wikipedia.org/wiki/Subderivative]{subderivatives}.}
253

254
\begin{shaded}
255
\begin{answer}
256

257
\end{answer}
258
\end{shaded}
259

260
% Question 1-F
261
\item (3 points) The sigmoid function is given by Equation \ref{Sigmoid Function}:
262

263
\begin{equation}
264
    \label{Sigmoid Function}
265
    \sigma (x) = \frac{1}{1 + e^{-x}} = \frac{e^{x}}{e^{x} + 1}
266
\end{equation}
267

268
Please compute the derivative of $\sigma(x)$ with respect to $x$, where $x$ is a scalar. Please write your answer in terms of $\sigma(x)$. Show your work to receive full credit.
269

270
\begin{shaded}
271
\begin{answer}
272

273
\end{answer}
274
\end{shaded}
275

276
% Question 1-G
277
\item (6 points) Now we shall consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss.  Assume that $K$ negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as $w_1, w_2, \dots, w_K$, and their outside vectors as $\bm u_{w_1}, \bm u_{w_2}, \dots, \bm u_{w_K}$.\footnote{Note: In the notation for parts (g) and (h), we are using words, not word indices, as subscripts for the outside word vectors.} For this question, assume that the $K$ negative samples are distinct. In other words, $i\neq j$ implies $w_i\neq w_j$ for $i,j\in\{1,\dots,K\}$.
278
Note that $o\notin\{w_1, \dots, w_K\}$. 
279
For a center word $c$ and an outside word $o$, the negative sampling loss function is given by:
280

281
\begin{equation}
282
\bm J_{\text{neg-sample}}(\bm v_c, o, \bm U) = -\log(\sigma(\bm u_o^\top \bm v_c)) - \sum_{s=1}^K \log(\sigma(-\bm u_{w_s}^\top \bm v_c))
283
\end{equation}
284
for a sample $w_1, \ldots w_K$, where $\sigma(\cdot)$ is the sigmoid function.\footnote{Note: The loss function here is the negative of what Mikolov et al.\ had in their original paper, because we are doing a minimization instead of maximization in our assignment code. Ultimately, this is the same objective function.}
285

286
\begin{enumerate}[label=(\roman*)]
287
\item Please repeat parts (b) and (c), computing the partial derivatives of $\bm J_{\text{neg-sample}}$ with respect to $\bm v_c$, with respect to $\bm u_o$, and with respect to the $s^{th}$ negative sample $\bm u_{w_s}$. Please write your answers in terms of the vectors $\bm v_c$, $\bm u_o$, and $\bm u_{w_s}$, where $s \in [1, K]$. Show your work to receive full credit. \textbf{Note:} you should be able to use your solution to part (f) to help compute the necessary gradients here.
288

289
\item In lecture, we learned that an efficient implementation of backpropagation leverages the re-use of previously-computed partial derivatives. Which quantity could you reuse amongst the three partial derivatives calculated above to minimize duplicate computation? Write your answer in terms of \\ $\bm{U}_{o, \{w_1, \dots, w_K\}} = \begin{bmatrix} \bm{u}_o, -\bm{u}_{w_1}, \dots, -\bm{u}_{w_K} \end{bmatrix}$, a matrix with the outside vectors stacked as columns, and $\bm{1}$, a $(K + 1) \times 1$ vector of 1's.\footnote{Note: NumPy will automatically broadcast 1 to a vector of 1's if the computation requires it, so you generally don't have to construct $\bm{1}$ on your own during implementation.}
290
Additional terms and functions (other than $\bm{U}_{o, \{w_1, \dots, w_K\}}$ and $\bm{1}$) can be used in your solution.
291
\item Describe with one sentence why this loss function is much more efficient to compute than the naive-softmax loss.
292
\end{enumerate}
293

294
Caveat: So far we have looked at re-using quantities and approximating softmax with sampling for faster gradient descent. Do note that some of these optimizations might not be necessary on modern GPUs and are, to some extent, artifacts of the limited compute resources available at the time when these algorithms were developed.
295

296
\begin{shaded}
297
\begin{answer}
298

299
\end{answer}
300
\end{shaded}
301

302
% Question 1-H
303
\item (2 points) Now we will repeat the previous exercise, but without the assumption that the $K$ sampled words are distinct.  Assume that $K$ negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as $w_1, w_2, \dots, w_K$ and their outside vectors as $\bm u_{w_1}, \dots, \bm u_{w_K}$. In this question, you may not assume that the words are distinct. In other words, $w_i=w_j$ may be true when $i\neq j$ is true.
304
Note that $o\notin\{w_1, \dots, w_K\}$. 
305
For a center word $c$ and an outside word $o$, the negative sampling loss function is given by:
306

307
\begin{equation}
308
\bm J_{\text{neg-sample}}(\bm v_c, o, \bm U) = -\log(\sigma(\bm u_o^\top \bm v_c)) - \sum_{s=1}^K \log(\sigma(-\bm u_{w_s}^\top \bm v_c))
309
\end{equation}
310
for a sample $w_1, \ldots w_K$, where $\sigma(\cdot)$ is the sigmoid function.
311

312
Compute the partial derivative of $\bm J_{\text{neg-sample}}$ with respect to a negative sample $\bm u_{w_s}$. Please write your answers in terms of the vectors $\bm v_c$ and $\bm u_{w_s}$, where $s \in [1, K]$. Show your work to receive full credit. Hint: break up the sum in the loss function into two sums: a sum over all sampled words equal to $w_s$ and a sum over all sampled words not equal to $w_s$. Notation-wise, you may write `equal' and `not equal' conditions below the summation symbols, such as in Equation \ref{skip-gram}.
313

314
\begin{shaded}
315
\begin{answer}
316

317
\end{answer}
318
\end{shaded}
319

320
% Question 1-I
321
\item (3 points) Suppose the center word is $c = w_t$ and the context window is $[w_{t-m}$, $\ldots$, $w_{t-1}$, $w_{t}$, $w_{t+1}$, $\ldots$, $w_{t+m}]$, where $m$ is the context window size. Recall that for the  skip-gram version of {\tt word2vec}, the total loss for the context window is:
322
\begin{equation}
323
\label{skip-gram}
324
\bm J_{\textrm{skip-gram}}(\bm v_c, w_{t-m},\ldots w_{t+m}, \bm U) = \sum_{\substack{-m\le j \le m \\ j\ne 0}} \bm J(\bm v_c, w_{t+j}, \bm U)
325
\end{equation}
326

327
Here, $\bm J(\bm v_c, w_{t+j}, \bm U)$ represents an arbitrary loss term for the center word $c=w_t$ and outside word $w_{t+j}$. $\bm J(\bm v_c, w_{t+j}, \bm U)$ could be $\bm J_{\text{naive-softmax}}(\bm v_c, w_{t+j}, \bm U)$ or $\bm J_{\text{neg-sample}}(\bm v_c, w_{t+j}, \bm U)$, depending on your implementation.
328

329
Write down three partial derivatives: 
330
\begin{enumerate}[label=(\roman*)]
331
    \item ${\frac{\partial \bm J_{\textrm{skip-gram}}(\bm v_c, w_{t-m},\ldots w_{t+m}, \bm U)} {\partial \bm U}}$
332
    \item ${\frac{\partial \bm J_{\textrm{skip-gram}}(\bm v_c, w_{t-m},\ldots w_{t+m}, \bm U)} {\partial \bm v_c}}$
333
    \item ${\frac{\partial \bm J_{\textrm{skip-gram}}(\bm v_c, w_{t-m},\ldots w_{t+m}, \bm U)} {\partial \bm v_w}}$ when $w \ne c$
334
\end{enumerate}
335
Write your answers in terms of ${\frac{\partial \bm J(\bm v_c, w_{t+j}, \bm U)}{\partial \bm U}}$ and ${\frac{\partial \bm J(\bm v_c, w_{t+j}, \bm U)}{\partial \bm v_c}}$. This is very simple -- each solution should be one line.
336

337
\begin{shaded}
338
\begin{answer}
339

340
\end{answer}
341
\end{shaded}
342

343
\textit{\textbf{Once you're done:} Given that you computed the derivatives of $\bm J(\bm v_c, w_{t+j}, \bm U)$ with respect to all the model parameters $\bm U$ and $\bm V$ in parts (a) to (c), you have now computed the derivatives of the full loss function $\bm J_{\text{skip-gram}}$ with respect to all parameters. You're ready to implement \texttt{word2vec}!} % we could remove this line but I added it to make sure they understand why we did all this.
344

345
\end{enumerate}
346

347
\section{Coding: Implementing word2vec (18 points)}
348
In this part you will implement the word2vec model and train your own word vectors with stochastic gradient descent (SGD). Before you begin, first run the following commands within the assignment directory in order to create the appropriate conda virtual environment. This guarantees that you have all the necessary packages to complete the assignment. \textbf{Windows users} may wish to install the Linux Windows Subsystem\footnote{https://techcommunity.microsoft.com/t5/windows-11/how-to-install-the-linux-windows-subsystem-in-windows-11/m-p/2701207}. Also note that you probably want to finish the previous math section before writing the code since you will be asked to implement the math functions in Python. You’ll probably want to implement and test each part of this section in order, since the questions are cumulative.
349

350
\begin{minted}{bash}
351
    conda env create -f env.yml
352
    conda activate a2
353
\end{minted}
354

355
Once you are done with the assignment you can deactivate this environment by running:
356
\begin{minted}{bash}
357
    conda deactivate
358
\end{minted}
359

360
For each of the methods you need to implement, we included approximately how many lines of code our solution has in the code comments. These numbers are included to guide you. You don't have to stick to them, you can write shorter or longer code as you wish. If you think your implementation is significantly longer than ours, it is a signal that there are some \texttt{numpy} methods you could utilize to make your code both shorter and faster. \texttt{for} loops in Python take a long time to complete when used over large arrays, so we expect you to utilize \texttt{numpy} methods. We will be checking the efficiency of your code. You will be able to see the results of the autograder when you submit your code to \texttt{Gradescope}, we recommend submitting early and often.
361

362
Note: If you are using Windows and have trouble running the .sh scripts used in this part, we recommend trying \href{https://github.com/bmatzelle/gow}{Gow} or manually running commands in the scripts.
363

364
\begin{enumerate}[label=(\alph*)]
365
    % Question 2-A
366
    \item (12 points) We will start by implementing methods in \texttt{word2vec.py}. You can test a particular method by running \texttt{python word2vec.py m} where \texttt{m} is the method you would like to test. For example, you can test the sigmoid method by running \texttt{python word2vec.py sigmoid}.
367
    
368
    \begin{enumerate}[label=(\roman*)]
369
        \item Implement the \texttt{sigmoid} method, which takes in a vector and applies the sigmoid function to it.
370
        \item Implement the softmax loss and gradient in the \texttt{naiveSoftmaxLossAndGradient} method.
371
        \item Implement the negative sampling loss and gradient in the \texttt{negSamplingLossAndGradient} method.
372
        \item Implement the skip-gram model in the \texttt{skipgram} method. 
373
    \end{enumerate}
374
    
375
    When you are done, test your entire implementation by running \texttt{python word2vec.py}.
376
    
377
    % Question 2-B
378
    \item (4 points) Complete the implementation for your SGD optimizer in the \texttt{sgd} method of \texttt{sgd.py}. Test your implementation by running \texttt{python sgd.py}.
379
    
380
    % Question 2-C
381
    \item (2 points) Show time! Now we are going to load some real data and train word vectors with everything you just implemented! We are going to use the Stanford Sentiment Treebank (SST) dataset to train word vectors, and later apply them to a simple sentiment analysis task. You will need to fetch the datasets first. To do this, run \texttt{sh get\_datasets.sh}. There is no additional code to write for this part; just run \texttt{python run.py}.
382
    
383
    \emph{Note: The training process may take a long time depending on the efficiency of your implementation and the compute power of your machine \textbf{(an efficient implementation takes one to two hours)}. Plan accordingly!}
384
    
385
    After 40,000 iterations, the script will finish and a visualization for your word vectors will appear. It will also be saved as \texttt{word\_vectors.png} in your project directory. \textbf{Include the plot in your homework write up.} In at most three sentences, briefly explain what you see in the plot. This may include, but is not limited to, observations on clusters and words that you expect to cluster but do not.
386
    
387
\begin{shaded}
388
\begin{answer}
389

390
\end{answer}
391
\end{shaded}
392

393
\section{Submission Instructions}
394
You shall submit this assignment on Gradescope as two submissions -- one for ``Assignment 2 [coding]" and another for `Assignment 2 [written]":
395
\begin{enumerate}
396
    \item Run the \texttt{collect\_submission.sh} script to produce your \texttt{assignment2.zip} file.
397
    \item Upload your \texttt{assignment2.zip} file to Gradescope to ``Assignment 2 [coding]".
398
    \item Upload your written solutions to Gradescope to ``Assignment 2 [written]".
399
\end{enumerate}
400

401
\end{enumerate}
402
\end{document}
403

404
Product

Resources

Company