Path: blob/master/data/create_yelp_review_data.ipynb
2923 views
Create Yelp Reviews data for Sentiment Analysis and Word Embeddings
Imports & Settings
About the Data
The data consists of several files with information on the business, the user, the review and other aspects that Yelp provides to encourage data science innovation.
The data consists of several files with information on the business, the user, the review and other aspects that Yelp provides to encourage data science innovation.
We will use around six million reviews produced over the 2010-2019 period to extract text features. In addition, we will use other information submitted with the review about the user.
Getting the Data
You can download the data from here in json format after accepting the license. The 2020 version has 4.7GB (compressed) and around 10.5GB (uncompressed) of text data.
After download, extract the following two of the five .json
files into to ./yelp/json
:
the
yelp_academic_dataset_user.json
the
yelp_academic_dataset_reviews.json
Rename both files by stripping out the yelp_academic_dataset_
prefix so you have the following directory structure:
Parse json and store as parquet files
Convert json to faster parquet format:
Now you can remove the json files.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-892254c45114> in <module>
----> 1 merge_files(remove=True)
<ipython-input-8-159c5a2caa96> in merge_files(remove)
18 if remove:
19 for fname in ['user', 'reviews']:
---> 20 f = yelp_dir / fname + '.parquet'
21 if f.exists():
22 f.unlink()
TypeError: unsupported operand type(s) for +: 'PosixPath' and 'str'