Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/course/zh-CN/chapter7/section6_tf.ipynb
Views: 2549
Kernel: Unknown Kernel
从头开始训练因果语言模型 (TensorFlow)
Install the Transformers, Datasets, and Evaluate libraries to run this notebook.
In [ ]:
You will need to setup git, adapt your email and name in the following cell.
In [ ]:
You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.
In [ ]:
In [ ]:
In [ ]:
False True
In [ ]:
In [ ]:
3.26% of data after filtering.
In [ ]:
DatasetDict({
train: Dataset({
features: ['repo_name', 'path', 'copies', 'size', 'content', 'license'],
num_rows: 606720
})
valid: Dataset({
features: ['repo_name', 'path', 'copies', 'size', 'content', 'license'],
num_rows: 3322
})
})
In [ ]:
'REPO_NAME: kmike/scikit-learn'
'PATH: sklearn/utils/__init__.py'
'COPIES: 3'
'SIZE: 10094'
'''CONTENT: """
The :mod:`sklearn.utils` module includes various utilites.
"""
from collections import Sequence
import numpy as np
from scipy.sparse import issparse
import warnings
from .murmurhash import murm
LICENSE: bsd-3-clause'''
In [ ]:
Input IDs length: 34
Input chunk lengths: [128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 117, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 41]
Chunk mapping: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
In [ ]:
DatasetDict({
train: Dataset({
features: ['input_ids'],
num_rows: 16702061
})
valid: Dataset({
features: ['input_ids'],
num_rows: 93164
})
})
In [ ]:
In [ ]:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
transformer (TFGPT2MainLayer multiple 124242432
=================================================================
Total params: 124,242,432
Trainable params: 124,242,432
Non-trainable params: 0
_________________________________________________________________
In [ ]:
In [ ]:
input_ids shape: (5, 128)
attention_mask shape: (5, 128)
labels shape: (5, 128)
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
# create some data
x = np.random.randn(100)
y = np.random.randn(100)
# create scatter plot with x, y
plt.scatter(x, y)
# create scatter
In [ ]:
# create some data
x = np.random.randn(100)
y = np.random.randn(100)
# create dataframe from x and y
df = pd.DataFrame({'x': x, 'y': y})
df.insert(0,'x', x)
for
In [ ]:
# dataframe with profession, income and name
df = pd.DataFrame({'profession': x, 'income':y, 'name': z})
# calculate the mean income per profession
profession = df.groupby(['profession']).mean()
# compute the
In [ ]:
# import random forest regressor from scikit-learn
from sklearn.ensemble import RandomForestRegressor
# fit random forest model with 300 estimators on X, y:
rf = RandomForestRegressor(n_estimators=300, random_state=random_state, max_depth=3)
rf.fit(X, y)
rf