Path: blob/master/SDSS Classification.ipynb
54 views
<img src="https://trac.sdss.org/chrome/site/sdss.png", width=300, height=300>
I was searching for an astronomy dataset on Kaggle and came across this project (https://www.kaggle.com/lucidlenn/data-analysis-and-classification-using-xgboost). I'd like to give credit to Lennart Grosser for a couple of the plots that I use in this notebook.
Getting the data
I used the CasJobs website which offers a SQL-based interface to query the databases that contain the SDSS data. The CasJobs system allows you to save the data returned from the query in csv format.
For more information about how to get data from the SDSS see their Data Access Guide:
http://www.sdss.org/dr14/data_access/
I used the following query to retreive the data:
SELECT TOP 100000
p.objid,
p.ra,
p.dec,
p.cModelMag_u,
p.cModelMag_g,
p.cModelMag_r,
p.cModelMag_i,
p.cModelMag_z,
p.modelMag_u,
p.modelMag_g,
p.modelMag_r,
p.modelMag_i,
p.modelMag_z,
p.psfMag_u,
p.psfMag_g,
p.psfMag_r,
p.psfMag_i,
p.psfMag_z,
p.petroMag_u,
p.petroMag_g,
p.petroMag_r,
p.petroMag_i,
p.petroMag_z,
s.specobjid,
s.class,
s.z as redshift
FROM PhotoObj AS p
JOIN SpecObj AS s ON s.bestobjid = p.objid
WHERE
p.cModelMagErr_u BETWEEN 0 AND 0.5
AND g BETWEEN 0 AND 20
Import the data
Define the columns that we'll use and import the data from the csv file. The cModelMag filter columns were chosen based on information at http://www.sdss3.org/dr8/algorithms/magnitudes.php#cmodel. Specifically, under the section titled 'Which Magnitude Should I Use?', we have the following: "...the cmodel magnitude is now an adequate proxy to use as a universal magnitude for all types of objects".
Get a count of each type of class we are working to predict.
We see that our data is not balanced between the classes.
Looking at the above plot of right ascention and declination for each of our class types, we see that there is no clear seperation in the positions of our objects. We can conclude that these features will not add any real predictive value to our model and therefore we can feel confident in dropping them from our dataset.
Now are dataset is only in terms of frequency filters and redshift.
Redshift
Looking at the redshift for each class of object, we see that the further away an object is the more redshifted it appears. From these graphs, it looks like the redshift will be an important feature in classifying observations.
The plots above confirm our intuition that the further an object is from the observer, the more redshifted it will be. Redshift will be a good predictor value but unfortunately, there is a bit of overlap between the reshift of some galaxies and some QSO's.
Feature Corrolation Plot:
We can see some correlation between the cModelMagnitudes for the u,g, and r filters as well as between the i and z filters. There also seems to be some correlation between the cModelMagnitudes for the u,g, and r filters and redshift.
Split the data into test and training subsets
Encode the targets. XGBoost cannot use string classes directly
Define a basic XGBoost classifier
Fitting an XGBoost model with default parameters
Above we see the default parameters used by the classifier.
Make prediction and look at the accuracy
Use sklearn's RandomizedSearchCV to try and tune some parameters
We achieve a slight improvement by searching over a range of parameter values to try to hypertune the learning parameters of the model.
The Confusion Matrix
The results are in line with what we saw in the redshift plots for the different classes of objects. Galaxies and QSO's are much harder to differentiate.
We have used the XGBoost classifier to predict with 98 percent accuracy the class (type) of stellar object for a given SDSS observation. Further hypertuning of parameters might acheive a slight increase in accuracy but further research into the features available in the data would probably be the best way to increase accuracy of the model in its ability to differentiate galaxies from QSO's.