Path: blob/main/Lesson 8 - Optional topics.ipynb
968 views
Lesson 8: Optional topics
Authors:
Yilber Fabian Bautista
Sean Tulin
Objectives:
By the end of this lecture you will be able to:
Handle exceptions using try/except/else/finally statements
Create, modify, save, and interact with HDF5 files
Fit curves to a given set of data points using curve_fit module in ** scipy.optimize** library
Do histograms using the np.histogram module
Syntax Errors and Exceptions
Up to now you have probably encounter many error types when writing a code. Here we will learn how to handle them in python.
Syntax Errors
These are perhaps the more common type of errors. For instance, when using loops:
we will get the output
which indicates we forgot the colon (:) symbol at the end of the for line. These type of error are usually easy to solve in the code.
Exceptions
Additional errors can be raised in an execution even if the code is syntactically correct. Errors detected during execution are called exceptions. Some of the most common exceptions are the ZeroDivisionError, NameError and the TypeError.
or
and
Try/except statements for handling exceptions
In python we can handle exceptions in a very easy way. This is done through the try/except statements. See for instance here. Let us see this with a simple example:
The try statement works as follows:
First, the try clause (the statement(s) between the try and except keywords) is executed.
If no exception (error in the execution) occurs, the except clause is skipped and execution of the try statement is finished.
If an exception occurs during execution of the try clause, the rest of the clause is skipped. Then, if its
typematches the exceptionnamedafter the except keyword, the except clause is executed, and then execution continues after the try/except block.If an exception occurs which does not match the exception named in the except clause, it is passed on to outer try statements; if no handler is found, it is an unhandled exception and execution stops with an error message.
Let us for imagine we want to perform the division of a number by all of the elements of a given list. If the list has zero entires, we could have error messages and our code will stop. However, we can use try/except statements to skip over the zero divisions as in the following example:
A try statement can have more than one except clause. As follows
or even a except clause may name multiple exceptions as a parenthesized tuple, for example:
else and finally statements
Further statements clauses are support once they are specified after the main try/except. They are the else and the finally statements:
else statement
the else clause, which, when present, must follow all except clauses, is useful for code that must be executed if the try clause does not raise an exception: Example
Using the else clause is more recommended than adding additional code to the try clause. This is done to avoids accidentally catching an exception that was NOT raised by the code being protected by the try/except statement.
finally statement
This provides an optional clause which is corresponds to a clean action that must be executed under all circumstances. That is, it runs whether or not the try statement produces an exception
Let us see an specific example
To learn more about error handling see here.
HDF5 files
See documentation here.
HDF stands for “Hierarchical Data Format”. An HDF5 file contains two kinds of objects: datasets, which are array-like collections of data, and groups, which are folder-like containers that hold datasets and other groups. Schematically, an HDF5 file looks as follows

Figure taken from here. The Groups and datasets can be further commentated with metadata contained in associated Attributes. We will expand on this bellow.
The fundamental thing to have in mind when dealing with HDF5 files is: Groups are used like dictionaries, whereas datasets work like NumPy arrays.
Installation
See Quick start guide for installation with conda or pip
Opening HDF5 files
Let us suppose someone handled us an HDF5 file and we want to use it. We first need to import the h5py library in python to manipulate such a file. This is done through the command
To open our existing HDF5 file we have two options:
The syntax for the first one is:
Here f is an instance of the file, which is then closed using the close() attribute, when the HDF5 file if no longer needed.
The second option is via a
withblock:
To close the HDF5 file in the second method we simply need to leave the with block.
The keyword argument r stands for read, which means we can only read but not modify the HDF5 file. In the next table we can see other keyword arguments commonly used when dealing with data files. We will use some of them bellow 
The with statement seems a bit cryptic, and indeed it is a short statement for several steps made behind scene. A generic with statement has the syntax (see here for more details):
and is translated into the following lines of code:
where (mgr, exit, value, exc) are internal variables that are not accessible to the user.
For our specific example, EXPR = h5py.File('myfile.hdf5', 'r') and VAR = f, and finally, BLOCK = do something. The with statement effectively creates an instance of the __enter__ attribute and store it into a variable called value. If VAR is specified, then value is assigned to VAR, and all operations specified in BLOCK are carried over, provided no exceptions appear. Otherwise, the with statement will handles these exceptions as shown by the use of try/except statements. Finally, once the file is not longer needed, it is closed using the __exit__ attribute.
In order to make the discussion more precise, we work explicitly with the following example:
We are handled the mytestfile.hdf5 file and we want to access all the information contained inside it. We start by opening the file using the h5py library
Keys of a HDF5 file
Likewise for dictionaries, we can as for the set of keys contained in the file by using the attribute keys(): In our example, the line
will produce the output
This means our HDF5 file has two objects labeled by the keys 'mydataset' and 'mygroup'. In principle they can correspond to either datasets or groups of our HDF5 file.
Accessing a group or data set
To explicitly know the nature of the two keys in our example, we need to call our file f with the specific key using an dictionary-like syntax:
this will produce the output
We observe therefore that "mydataset" indeed corresponds to a dataset of the HDF5 file. It contains a one dimensional array with 20 entries, whose elements are integers.
We can do the same for the remaining key
with output
This means that mygroup corresponds instead to a group of the HDF5 file, and hosts another element which can be either a dataset or a group. Recall groups are like dictionaries, and therefore, the logic to access the keys for the elements contained in the group is the same, that is, by using the key() attribute. In our example we have
with output
Accessing the information in a dataset
Now that we know how to ask for the keys in a HDF5 file, and access the groups and datasets inside them, we want to be able to use the information stored in the datasets themselves. Recall datasets are like np.arrays, and therefore we can use all of the numpy machinery on them. Let us see that with the dataset contained in 'mydataset' keyword in our previous example
We can for instance slice the array,
or even plot it
Similarly, one can now ask how to access the dataset contained in the group 'mygroup'. For that we have several options:
By assigning a new variable to
f['mygroup'], and then accessing'newdataset'using a keyword argument
with output
That means that dset is a scalar (shape() array), and therefore to access it we use the syntax
which will produce as output the float number 50.0
We avoid too many variable assignation we can simply use repeated indexing
which will have the same effect.
To avoid too many square brackets we can simply use a path-like argument:
where we have already accessed the scalar element, getting as output the number 50.0. This is the path specification for Mac and Linux. For windows you will have to use
The discussion above shows how simple it is to navigate through the HDF5 file by using keyword arguments.
Attributes of groups and data sets
The last feature to explore in our HDF5 file is to ask for the Metadata stored in attributes of the HDF5 file itself, groups and datasets. For that we use the .attrs proxy. See documentation. Let us see for instance if our HDF5 has attribute keywords:
the output is an empty list meaning that the file itself does not have attribute keywords.
We can ask the same for the groups and datasets
in this case, the group 'mygroup' has an attribute given by the keyword 'new_atribute'. We can access its content in the usual indexing-like form
namely, the attribute has just assigned the integer value 10. You can check that the datasets do not have any attribute by using the same logic
and
both of which produce as output an empty list.
Attributes have the following properties as specified in the documentation
They may be created from any scalar or NumPy array
Each attribute should be small (generally < 64k)
There is no partial I/O (i.e. slicing); the entire attribute must be read.
Closing the HDF5 file
Once we finish using the file we can simply close it using the close() attribute.
Exercise 1
Repeat the all the above discussion but opening the file using a with block. Hint: use the 'r' argument, and recall indentation is important. print statements are also useful.
Creating an HDF5 file
We might now wonder how was mytestfile.hdf5 file created. We simply used the following code:
which we now proceed to explain in detail:
The first step is of course to create the datafile with the given name, in this case
mytestfile.hdf5, and use the'w'(write) keyword, as indicated in the table above. We create the file under the aliasg. Up to here, a newHDF5file will be created in the same location of our notebook. (To specify a particular location we use the path for it. example:with h5py.File('Documents/my_files/Explore/.../mytestfile.hdf5', 'w') as g). The newly created file will be an empty file which we then proceed to fill with groups and datasetsIn the first indented line,
dset = g.create_dataset("mydataset", (20,),data = np.arange(0,100,5), dtype='i'), we have created a new dataset, whose keyword is"mydataset". It contain a one dimensional array with 20 entries, given by the20elements innp.arange(0,100,5). We further specified the data type using thedtypekeyword.In the second line,
grp2 = g.create_group('mygroup'), we have simply added a new group, as clearly indicated by the syntax.In the next line we have added a new dataset to our existing group, where now the data corresponds to a scalar object
Finally we assigned the attribute
'new_atribute'with value1.0, to the existing group.
And that's it!
Exercise 2
Create your own HDF5 file including groups, datasets and attributes.
Modifying an existing HDF5 file
In lecture 5 we learned how to modify csv files using pandas library. Existing HDF5 files can also be modified. Let us see how this work with an specific example.
First, let us create a new HDF5 file example which will be the object to manipulate
We now open the created 'test_file.hdf5' file using the r+ keyword, which stands for read and write, as shown explicitly in our table above.
In the previous block of code we have modified the existing 'test_file.hdf5' file. Finally we check that our original 'test_file.hdf5' file has saved all the changes made above by opening it using the r mode only.
which indeed has all of the changes made in the previous block code.
One has to be careful when modifying HDF5 files, since datasets added using already existing keywords will overwrite the existing datasets. Similarly, HDF5 files created with names of files already existing, will overwrite the existing files.
Notice we have interchangeably manipulate HDF5 using or not the with blocks, to show the equivalence of the different ways of handle data files.
Why using HDF5 files?
Allow to save large amount of data in a single file, using a hierarchical structure.
Integrates nicely with numpy and pandas
The hierarchical structure allows to search for a specific dataset in the file.
Allows attaching metadata to every element in the hierarchical structure, making it ideal for generating self-explanatory files.
Data is read from the hard drive only when it is needed.
Example
Let us now gain some experience using real HDF5 files, in the context of Dark Matter (DM).
The Auriga project provided us with the data file 'halo_1_DMO.hdf5' which contains a set of simulated DM particles placed between 4-15 kpc in a dark matter-only halo. In the data set, we have access to the particles' 'Coordinates' (x, y, z in kpc) , 'Mass' (in Solar masses) , and 'Velocities' (vx, vy, vz in km/s) . In this example, and Exercise 3 bellow, we aim to learn some properties of the DM halo described by this data file.
Let us start by loading the data file
We can access the keys in the file if unknown by using multiple times the key() attribute, as broadly specified above. For instance, we can access the particles cartesian positions as follows:
Here positions is an array whose entries correspond to the 3-coordinate positions for each particle (effectively a set of 1 dimensional arrays stack along a given axis, see numpy.stack ). The notation positions[:,0] works as follows: In the first entry we are selecting all of the elements in the array (the :), and in the second entry we ask for only the 0 element for each of such elements. Analogous for positions[:,1] and positions[:,2].
It is useful to do some data visualization, for instance by plotting the DM positions.
here we have used the s = ... keyword argument to vary the size of the points in a scatter plot
Density profile from simulations
We are interested in knowing the DM density distribution assuming a spherically symmetric halo. In practice we will calculate the density from the summed mass of particles within, for instance, logarithmically-spaced spherical shells. Since all DM particles have the same mass (as you can easily check in the file), the DM density in between two given shells separated by a distance , can be computed from where is the total number of particles contained in between the two concentric shells.
To compute we need first to compute all the radial positions for our particles from the norm of the position vector Then, we can count how many particles are there for given shells bins, using a histogram.
Recall datasets behave like Numpy arrays, then we can use all the numpy machinery.
Let us now compute the number of particles in between two shells. First we need to define the radial bins (shells separations) to be used. As we mentioned, we will use logarithmically spaced shells. The largest and smallest shells' radii will be the min and max values in the radial positions for our DM particles respectively.
each bin, r_bins[i+1]-r_bins[i] for ParseError: KaTeX parse error: Expected 'EOF', got '_' at position 17: …\le len(\text{r_̲bins}), corresponds to the distance between two adjacent, spherical shells. There are ParseError: KaTeX parse error: Expected 'EOF', got '_' at position 12: len(\text{r_̲bin})-1 bins in total.
To do the histogram we use the histogram function in the numpy library
Here, N give the number of DM particles, in a given radial bin (Notice r_bins = r_bin_h). However, we have to notice that len(N)=99 whereas len(r_bins)=100. As mentioned above, the number of bins is . Then, we need to compute the bins centers, which are 99 instead of 100. They are simply given by
Let us further visualize our results by plotting our counts as function of the bins centers
The last ingredient to compute the DM density distribution is the particles mass, which we extract from our halo file as follows
We now have all the elements needed to compute the DM mass density
In a plot, the DM distribution as function of the bins centers is
Curve fitting
The next task is to obtain an analytic expression, , that best describes the behaviour for the DM density as function of the radial position observed in the curve above. For that, we use curve fitting.
As name indicates, curve fitting is the process of finding the curve that best describes a series of data points. We will use the scipy.optimize.curve_fit function given in the scipy library. The curve fitting process can be split in the following steps:
Collect the data
Define objective function (The function that probably will describe well the data. It has free parameters that can be fixed from the fit)
Do the fit
Compute the goodness of the fit
1. Collect the data
We did step 1. in our previous analysis and have the data (rho,r_bin_c) ready to be used.
2. Objective function
Spherical DM density distributions are known to be well fitted by a universal function known as the NFW (Navarro, Frenk and White) density profile. See for instance here. It has the analytic form
where and are two halo dependent parameters. Notice that as . This is known as a cuspy profile and has discrepancy with observations of dwarf galaxies (See for instance this review).
Let us define our objective function
3. Doing the fit
This is a very simple step, we just need to import the curve_fit module from the scipy library. The syntax for using the function is the following:
where f is the objective function, p0 is an initial guess for the free parameters (if non specified, the default value is p0={1}), and sigma are the uncertainties in the data (xdata and ydata). Our histogram has naturally uncertainties in the counts given by the Poisson errors
This induces an uncertainty in via Let us first compute those uncertainties, since they will be used in the fitting process
The curve_fit function will return the values for the parameters that best fit the curve, and the covariance matrix. That is, the parameters that minimize the function, defined by where is the objective function evaluated at the -position, with parameters . are the data points, whose uncertainties are denoted by , and is the number of data points to be fitted in our curve.
It is well known that typical values are where dof is the number of degrees of freedom, defined by
Then, signals a bad fit, whereas indicates that the error bar in the data are wrongly estimated.
Without further ado, let us proceed to do the fitting
Let us plot our findings and compare them to the data
Goodness of the fit
The goodness of the fit is given by the function. As already indicated, for a good fit we expect to have a . Let us explicitly check this
applied to our example we have
Now let us compare it to the number of degrees of freedom
which as expected, is a number of
Exercise 3 Velocity distribution function for particles in the solar vicinity
In this exercise you will obtain the velocity distribution of DM particles in the solar vicinity, and fit it to a Maxwell-Boltzmann distribution function. The sun is positioned at . Let us as usual split this exercise in several steps
Firts we want to find all the particles whose radial position lies in the interval . Define a function
find_index(dat), that takes as input a set of data, and return a 1-dimensional numpy array whose elements are the indexes of the elements indat, that satisfy .Define the list
index, using thefind_indexfunction, applied to our particles radial positionsr, defined in the above example. It will contain the indexes of the particles in the solar vicinity.Count the number of particles in the solar vicinity, and compare the result to the total number of particles. Hint: You might find useful the function
len()Now that we know what particles are in the solar vicinity, we can ask for the velocity of those particles. Extract the particles velocity in the data set
halodefined above, and select velocities for the particles lying in the solar vicinity. Hint (Loop over the entries of the listindexdefined in step 2)compute the speeds of the particles in the solar vicinity. Recall the speed of particle i, is given by the norm of the three-vector . Hint: See the example above for computing the norm of ; the
np.dotfunction might be helpfulUsing the
np.histogramfunction, obtain the number of particles in the velocity binsv_bins = np.linspace(0,v_max,50), wherev_maxis the largest speed for particles in step 5. Plot the velocity distribution (i.e. N_v counts ad function of the speed bins centers)Normalize the velocity distribution in such a way that Since all our bins have the same size , the integral reduces to perform the sum From the
lin-space grid, find and compute the sum.At this step we have available the data
(f_v,v_bins)where f_v is normalized as indicated in the previous step. Now we can compute the Poisson error from our histogram. They will be given by
where N_v are the velocity counts from the histogram.
Now that we have the data (f_v,v_bins,sigma_v ), we can do an error bar plot for visualization. The plot should look similar to the following image

Fit a Maxell-Boltzmann function to the velocity distribution here is the parameter to be fitted . Hint: Use
p0=200as initial guess for the fitted parameterPlot your findings and compute the goodness of the fit