Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
GRAAL-Research
GitHub Repository: GRAAL-Research/deepparse
Path: blob/main/docs/source/examples/parse_addresses.rst
871 views
.. role:: hidden
    :class: hidden-section

Parse Addresses
***************

.. code-block:: python

    import pandas as pd

    from deepparse import download_from_public_repository
    from deepparse.dataset_container import PickleDatasetContainer
    from deepparse.parser import AddressParser

Here is an example on how to parse multiple addresses. First, let's download the train and test data from the public repository.

.. code-block:: python

    saving_dir = "./data"
    file_extension = "p"
    test_dataset_name = "predict"
    download_from_public_repository(test_dataset_name, saving_dir, file_extension=file_extension)

Now let's load the dataset using one of our dataset container

.. code-block:: python

    addresses_to_parse = PickleDatasetContainer("./data/predict.p", is_training_container=False)

Let's use the ``BPEmb`` model on a GPU.

.. code-block:: python

    address_parser = AddressParser(model_type="bpemb", device=0)

.. code-block:: python

    parsed_addresses = address_parser(test_data[0:300])

    # Print one of the parsed address
    print(parsed_addresses[0])


When parsing addresses, some data quality tests are applied to the dataset.
First, it validates that no addresses to parse are empty.
Second, it validates that no addresses are whitespace-only.
The next two lines are rising a ``DataError``.

.. code-block:: python

    address_parser("")  # Raise an error
    address_parser(" ")  # Raise an error

We can also put our parsed address into a Pandas ``DataFrame`` for analysis. You can choose the fields to use or use the
default one.

.. code-block:: python

    fields = ['StreetNumber', 'StreetName', 'Municipality', 'Province', 'PostalCode']
    parsed_address_data_frame = pd.DataFrame([parsed_address.to_dict(fields=fields) for parsed_address in parsed_addresses],
                                             columns=fields)