Path: blob/main/ch08/logistic-regression-bag-of-words/log-reg.ipynb
1945 views
Logistic Regression Classifier for Text
Obtaining the IMDb movie review dataset
Preprocessing the movie dataset into more convenient format
Install pyprind by uncommenting the next code cell.
Collecting pyprind
Using cached PyPrind-2.11.3-py2.py3-none-any.whl (8.4 kB)
WARNING: Error parsing requirements for soupsieve: [Errno 2] No such file or directory: '/Users/sebastian/miniforge3/lib/python3.10/site-packages/soupsieve-2.3.2.post1.dist-info/METADATA'
DEPRECATION: omegaconf 2.0.6 has a non-standard dependency specifier PyYAML>=5.1.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of omegaconf or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063
Installing collected packages: pyprind
Successfully installed pyprind-2.11.3
Shuffling the DataFrame:
Optional: Saving the assembled data as CSV file:
Training a logistic regression model for document classification
Strip HTML and punctuation to speed up the GridSearch later:
Important Note about n_jobs
Please note that it is highly recommended to use n_jobs=-1 (instead of n_jobs=1) in the previous code example to utilize all available cores on your machine and speed up the grid search. However, some Windows users reported issues when running the previous code with the n_jobs=-1 setting related to pickling the tokenizer and tokenizer_porter functions for multiprocessing on Windows. Another workaround would be to replace those two functions, [tokenizer, tokenizer_porter], with [str.split]. However, note that the replacement by the simple str.split would not support stemming.