Last modified: 11 Dec 2023

URL: https://cxc.cfa.harvard.edu/sherpa/threads/pha_regroup/

Changing the grouping scheme of a data set within Sherpa

Sherpa Threads (CIAO 4.16 Sherpa)


Overview

Synopsis:

In order to use Gaussian statistics to fit a model to a data set, it is often necessary to "group" the data—i.e., combine channels until you have enough counts—before use. It is possible to set and change the grouping of a file after it has been read into Sherpa by using the commands: set_grouping, group, group_counts, group_snr, group_adapt, group_adapt_snr, group_bins, and group_width.

This thread shows how you can use these functions when fitting PHA data.

Last Update: 11 Dec 2023 - reviewed for CIAO 4.16; updated to point out that the default behaviour of commands like group_counts has changed to simplify filtering then grouping (that is, in most cases the tabStops argument is not needed anymore).


Contents


Getting Started

Loading the data

This thread uses the same data set as used in the Introduction to Fitting PHA Spectra thread. We load the data set twice—using load_pha—so that we can easily see the effect of changing the grouping scheme (the screen output has been omitted for clarity):

sherpa> load_pha(1, "3c273.pi")
...
sherpa> load_pha(2, "3c273.pi")
...

We note that the data in 3c273.pi was created so that each group contained at least 15 counts—by using the NUM_CTS grouptype option of dmgroup—as can be seen by using the dmhistory tool.

sherpa> !dmhistory 3c273.pi dmgroup | tr ' ' "\012"
dmgroup
infile="3c273.pi"
outfile="./3c273.tmp"
grouptype="NUM_CTS"
grouptypeval="15"
binspec=""
xcolumn="channel"
ycolumn="counts"
tabspec=""
tabcolumn=""
stopspec=""
stopcolumn=""
errcolumn=""
clobber="no"
verbose="0" 
maxlength="0" 

The command was preceded by "!" to tell Sherpa to execute it as a shell command, and "| tr ' ' "\012"" uses the tr utility to add a new-line character between parameter values (to make the screen output easier to read).

To see if a PHA dataset has been grouped then you can check the grouped field returned by get_data:

sherpa> get_data(1).grouped
True
sherpa> get_data(2).grouped
True

Grouping the data

We use the group_counts function to group the second data set by 30 counts per group, which is double that of the first data set.

sherpa> group_counts(2, 30)
dataset 2: 0.00146:14.9504 Energy (keV) (unchanged)

The plot styling is chanbed to use logarithmic scaling on both axes, set_xlog and set_ylog, and the plot edges (since each bin covers a range of energies) are also drawn thanks to get_plot_prefs:

sherpa> set_xlog()
sherpa> set_ylog()
sherpa> prefs = get_plot_prefs("data")
sherpa> prefs["linestyle"] = "solid"
sherpa> prefs["marker"] = "none"

The two data sets are then plotted:

	  
sherpa> plot("data", 1, "data", 2)

We now use Matplotlib functions to tweak the plot: in this case to hide the X-axis label from the top plot and the title of the second plot (as they overlap with the default plot size) and to set the X-axis range for both plots:

sherpa> axes = plt.gcf().axes
sherpa> axes[0].set_xlabel('')
sherpa> axes[1].set_title('')
sherpa> axes[0].set_xlim(0.1, 20)
sherpa> axes[1].set_xlim(0.1, 20)
sherpa> axes[0].text(10, 0.006, "bin by 15", ha="center")
sherpa> axes[1].text(10, 0.006, "bin by 30", ha="center")

The resulting plot (Figure 1) shows how the data looks before and after re-grouping.

Figure 1: Data grouped by 15 and 30 counts per group

[Plot of data set 1 grouped by 15 counts per bin, and data set 2 grouped by 30 counts per bin]
[Print media version: Plot of data set 1 grouped by 15 counts per bin, and data set 2 grouped by 30 counts per bin]

Figure 1: Data grouped by 15 and 30 counts per group

The top plot shows the data as read in to Sherpa—which is binned by 15 counts per group—and the bottom plot shows the data set after group_counts has been called to re-group the data to 30 counts per group.

In this thread, we use the group_counts command as an example. The other available grouping commands are:

  • group_bins — this function divides the channels into the specified number ("num") of bins.
  • group_width — this function divides the channels such that there are "num" bins in each group.
  • group_snr — this function allows the user to adaptively group PHA spectral data by signal-to-noise ratio, i.e., group bins of data until each group exceeds at least the minimum specified signal-to-noise ratio.
  • group_adapt — this function allows the user to adaptively group PHA spectral data by counts, i.e., group bins of data until the number of counts in each group exceeds the minimum number of counts specified in the 'min' argument, keeping bright features ungrouped while grouping low signal-to-noise regions.
  • group_adapt_snr — this function allows the user to adaptively group PHA spectral data by signal-to-noise ratio, i.e., group bins of data until each group exceeds at least the minimum specified signal-to-noise ratio.
[NOTE]
Note

The group_counts and related functions can also be used on data that was not grouped before being read into Sherpa; this can be useful for two reasons:

  1. You can use the functions to find the best grouping scheme for your data without having to re-run the dmgroup tool and re-load the data into Sherpa.
  2. You can fit the un-grouped data with the Cash statistic and then use the functions to make it easier to compare the fit to the data in plots.

Fitting the data

Once the data has been re-grouped you can use it just like any other data set. Here we repeat the fit made in the Introduction to fitting PHA spectra to see what difference the alternate grouping scheme makes.

sherpa> notice_id(2, 0.3, 6.0)
dataset 2: 0.00146:14.9504 -> 0.00146:6.5408 Energy (keV)

sherpa> subtract(2)
sherpa> set_source(2, xsphabs.abs1 * powlaw1d.p1)
sherpa> abs1.nh = 0.07
sherpa> freeze(abs1)
sherpa> guess(2, p1)
sherpa> fit(2)
Dataset               = 2
Method                = levmar
Statistic             = chi2gehrels
Initial fit statistic = 997.017
Final fit statistic   = 27.5501 at function evaluation 19
Data points           = 22
Degrees of freedom    = 20
Probability [Q-value] = 0.120485
Reduced statistic     = 1.3775
Change in statistic   = 969.467
   p1.gamma       2.15315      +/- 0.0724515   
   p1.ampl        0.000230433  +/- 1.3657e-05  

sherpa> show_fit()
Optimization Method: LevMar
name     = levmar
ftol     = 1.1920928955078125e-07
xtol     = 1.1920928955078125e-07
gtol     = 1.1920928955078125e-07
maxfev   = None
epsfcn   = 1.1920928955078125e-07
factor   = 100.0
numcores = 1
verbose  = 0

Statistic: Chi2Gehrels
Chi Squared with Gehrels variance.

    The variance is estimated from the number of counts in each bin,
    but unlike `Chi2DataVar`, the Gaussian approximation is not
    used. This makes it more-suitable for use with low-count data.

    The standard deviation for each bin is calculated using the
    approximation from [1]_:

        sigma(i,S) = 1 + sqrt(N(i,s) + 0.75)

    where the higher-order terms have been dropped. This is accurate
    to approximately one percent. For data where the background has
    not been subtracted then the error term is:

        sigma(i) = sigma(i,S)

    whereas with background subtraction,

        sigma(i)^2 = sigma(i,S)^2 + [A(S)/A(B)]^2 sigma(i,B)^2

    A(B) is the off-source "area", which could be
    the size of the region from which the background is extracted, or
    the length of a background time segment, or a product of the two,
    etc.; and A(S) is the on-source "area". These terms may be defined
    for a particular type of data: for example, PHA data sets A(B) to
    `BACKSCAL * EXPOSURE` from the background data set and A(S) to
    `BACKSCAL * EXPOSURE` from the source data set.

    See Also
    --------
    Chi2DataVar, Chi2ModVar, Chi2XspecVar

    Notes
    -----
    The accuracy of the error term when the background has been
    subtracted has not been determined. A preferable approach to
    background subtraction is to model the background as well as the
    source signal.

    References
    ----------

    .. [1] "Confidence limits for small numbers of events in
           astrophysical data", Gehrels, N. 1986, ApJ, vol 303,
           p. 336-346.
           http://adsabs.harvard.edu/abs/1986ApJ...303..336G


	   
Fit:Dataset               = 2
Method                = levmar
Statistic             = chi2gehrels
Initial fit statistic = 997.017
Final fit statistic   = 27.5501 at function evaluation 19
Data points           = 22
Degrees of freedom    = 20
Probability [Q-value] = 0.120485
Reduced statistic     = 1.3775
Change in statistic   = 969.467
   p1.gamma       2.15315      +/- 0.0724515   
   p1.ampl        0.000230433  +/- 1.3657e-05  

  
sherpa> covar(2)
Dataset               = 2
Confidence Method     = covariance
Iterative Fit Method  = None
Fitting Method        = levmar
Statistic             = chi2gehrels
covariance 1-sigma (68.2689%) bounds:
   Param            Best-Fit  Lower Bound  Upper Bound
   -----            --------  -----------  -----------
   p1.gamma          2.15315   -0.0758287    0.0758287
   p1.ampl       0.000230433  -1.3896e-05   1.3896e-05

which can be compared to the original results:

sherpa> notice_id(1, 0.3, 6.0)
dataset 1: 0.00146:14.9504 -> 0.2482:6.57 Energy (keV)

sherpa> subtract(1)
sherpa> set_source(1, get_source())
sherpa> reset(p1)
sherpa> guess(p1)
sherpa> fit(1)
Dataset               = 1
Method                = levmar
Statistic             = chi2gehrels
Initial fit statistic = 710.713
Final fit statistic   = 29.3673 at function evaluation 19
Data points           = 43
Degrees of freedom    = 41
Probability [Q-value] = 0.9124
Reduced statistic     = 0.716276
Change in statistic   = 681.345
   p1.gamma       2.13925      +/- 0.078812    
   p1.ampl        0.000220978  +/- 1.44103e-05 

sherpa> covar(1)
Dataset               = 1
Confidence Method     = covariance
Iterative Fit Method  = None
Fitting Method        = levmar
Statistic             = chi2gehrels
covariance 1-sigma (68.2689%) bounds:
   Param            Best-Fit  Lower Bound  Upper Bound
   -----            --------  -----------  -----------
   p1.gamma          2.13925   -0.0815616    0.0815616
   p1.ampl       0.000220978 -1.46038e-05  1.46038e-05

The following shows both fits (the extra commands are used to make the two plots have the same X-axis, and to add plot labels):

sherpa> plot("fit", 1, "fit", 2)

The following commands change the plot (e.g. the label formatting) and add labels:

sherpa> import matplotlib.ticker as mtick
sherpa> tformatter = mtick.FormatStrFormatter("%g")
sherpa> fig = plt.gcf()

sherpa> plt.sca(fig.axes[0])
sherpa> plt.xlabel("")

sherpa> plt.gca().set(xticklabels=[])
sherpa> plt.gca().yaxis.set_major_formatter(tformatter)
sherpa> plt.tick_params(which="both",direction="in",bottom=True,top=True,right=True,left=True,labelbottom=False)

sherpa> plt.sca(fig.axes[1])
sherpa> plt.title("")

sherpa> plt.gca().xaxis.set_major_formatter(tformatter)
sherpa> plt.gca().yaxis.set_major_formatter(tformatter)
sherpa> plt.tick_params(which="both",direction="in",bottom=True,top=True,right=True,left=True,labelbottom=True)

sherpa> fig.subplots_adjust(hspace=0) 

sherpa> fig.axes[0].get_shared_x_axes().joined(fig.axes[0], fig.axes[1]) 
sherpa> fig.axes[1].autoscale()  
sherpa> plt.xlim(0.1, 10)

sherpa> plt.sca(fig.axes[0])
sherpa> plt.text(0.5, 0.0006, "15 counts per group", fontsize=8)

sherpa> plt.sca(fig.axes[1])
sherpa> plt.text(0.5, 0.0006, "30 counts per group", fontsize=8)

which creates this plot (Figure 2).

Figure 2: Fit to both data sets

[Plot of fits to data set 1 grouped by 15 counts per bin, and data set 2 grouped by 30 counts per bin]
[Print media version: Plot of fits to data set 1 grouped by 15 counts per bin, and data set 2 grouped by 30 counts per bin]

Figure 2: Fit to both data sets

This plot shows the fits to the data when grouped by 15 counts per group (top plot) and 30 counts per group (bottom plot).

Note that many of the commands are used to tweak the plot dispay, and are not always needed or wanted.


Grouping within a filter

If we look at both datasets we can see that even though we were interested in the 0.3–6 keV range, the grouped and filtered ranges are different:

sherpa> get_filter()
'0.248199999332:6.570000171661'
sherpa> get_filter(2)
'0.001460000058:6.540800094604'

This has happened because the grouping has been applied to all the channels in the PHA file, which will often result in the requested boundary (such as 0.3 keV) to fall within a bin. For bright sources this does not make much of a difference, but it can greatly enhance the data that gets included in a fit, particularly if the source is hard or soft.

[NOTE]
Note

Setting the plot linestyle option to "solid" - e.g. with get_plot_prefs as shown above - for PHA data can be useful as it makes it easier to see where the bin edges are!

One solution is to change the filter from notice to ignore, since this would exclude those bins that cover the end points. An example would be:

sherpa> ignore_id(2, hi=0.3)
dataset 2: 0.00146:6.5408 -> 0.3066:6.5408 Energy (keV)

sherpa> ignore_id(2, lo=6)
dataset 2: 0.3066:6.5408 -> 0.3066:5.3728 Energy (keV)

sherpa> get_filter(2)
'0.306600004435:5.372799873352'

However, this also leads to a loss of information. In this case channels that contain information we are interested in. The recommended solution is to filter then group, and this has been made easier to do in CIAO 4.16.

The first step is to ungroup the dataset and apply the requested filter (we first clear out any filter on the dataset to ensure we are setting the right range):

sherpa> ungroup(2)
dataset 2: 0.3066:5.3728 Energy (keV) (unchanged)

sherpa> notice_id(2)
dataset 2: 0.3066:5.3728 -> 0.00146:14.9504 Energy (keV)

sherpa> notice_id(2, 0.3, 6)
dataset 2: 0.00146:14.9504 -> 0.292:6.0006 Energy (keV)

sherpa> plot_data(2, xlog=False, ylog=False, linestyle="none")

Figure 3: The ungrouped and filtered data set

[The data runs from 0.3 to 6 keV with most of the counts at 2 keV and below.]
[Print media version: The data runs from 0.3 to 6 keV with most of the counts at 2 keV and below.]

Figure 3: The ungrouped and filtered data set

Note that many plot options can be over-ridden when creating a plot: in this case the axes are drawn with a linear scale and no line is drawn showing the bin edges.

If we now run group_counts then the grouping will only be done to the data with the noticed range:

sherpa> group_counts(2, 25)
dataset 2: 0.292:6.0006 Energy (keV) (unchanged)
[NOTE]
Changes in CIAO 4.16

Prior to CIAO 4.16 you had to use the tabStops argument to the grouping calls (like group_counts) to achieve this. It is now the default behaviour, but existing code using tabStops will still work.

Setting tabStops="nofilter" will replicate the behaviour of the group commands in CIAO 4.15 and earlier.

We can plot this, using a linear scale for the X-axis, with:

sherpa> plot_fit(2, xlog=True)
sherpa> plt.axvline(0.3, color='k')
sherpa> plt.axvline(6, color='k')

Figure 4: The grouped data

[The data seems to fit between the 0.3 to 6 keV rabge.]
[Print media version: The data seems to fit between the 0.3 to 6 keV rabge.]

Figure 4: The grouped data

The choice of 25 for the binning scheme was to ensure that the last bin was incomplete (i.e. contains less that 25 counts). We can look for such bins as they have a "quality" value of 2:

sherpa> q2 = get_quality(2)
sherpa> print(q2[q2 > 0])
[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

sherpa> d2 = get_data(2)
sherpa> print(d2.channel[q2 > 0])
[395. 396. 397. 398. 399. 400. 401. 402. 403. 404. 405. 406. 407. 408.
 409. 410. 411.]

One option is to use the data as is, so with one bin that is "incomplete".

An alternative is to exclude this bin, which can be done with the ignore_bad function. Unfortunately this requires re-creating the filter, as shown below and can lead to some strange behavior (e.g. get_filter may fail):

sherpa> ignore_bad(2)
WARNING: filtering grouped data with quality flags, previous filters deleted
dataset 2: 0.292:6.0006 -> 0.00146:14.9504 Energy (keV)

sherpa> notice_id(2, 0.3, 6)
sherpa> plot_fit(2)

Figure 5: The grouped data

[The data seems to fit between the 0.3 to 6 keV range.]
[Print media version: The data seems to fit between the 0.3 to 6 keV range.]

Figure 5: The grouped data


Scripting It

The file fit.py is a Python script which performs the primary commands used above; it can be executed by typing %run -i fit.py on the Sherpa command line.

The Sherpa script command may be used to save everything typed on the command line in a Sherpa session:

sherpa> script(filename="sherpa.log", clobber=False)

(Note that restoring a Sherpa session from such a file could be problematic since it may include syntax errors, unwanted fitting trials, et cetera.)


Summary

This thread shows how you can use the grouping functionality in Sherpa to change the grouping scheme of a PHA file once it has been read into Sherpa. This allows you to see how sensitive the fit results are to the grouping scheme by changing the number of counts per group or using a different method for grouping the data.


History

14 Dec 2004 updated for CIAO 3.2: script version and path
17 Jun 2005 updated information in Get Started on loading the script
21 Dec 2005 reviewed for CIAO 3.3: no changes
01 Dec 2006 reviewed for CIAO 3.4: no changes
14 Dec 2008 updated for CIAO 4.1: sherpa_utils.sl script replaced by new Sherpa 4.1 grouping functionality
29 Apr 2009 new script command is available with CIAO 4.1.2
17 Dec 2009 updated for CIAO 4.2: new group_bins and group_width commands
13 Jul 2010 updated for CIAO 4.2 Sherpa v2: removal of S-Lang version of thread.
15 Dec 2010 updated for CIAO 4.3: new set_xlog and set_ylog commands are available for setting the axis scale of plots to logarithmic
15 Dec 2011 reviewed for CIAO 4.4: added a warning about filtering/grouping source and background data sets
13 Dec 2012 updated for CIAO 4.5: noted that grouping a data set in Sherpa no longer clears the existing data filter; removed an outdated warning about filtering/grouping source and background data sets, as the associated bug has been fixed.
03 Dec 2013 reviewed for CIAO 4.6: no changes
02 Dec 2015 reviewed for CIAO 4.8: no content change.
09 Nov 2016 reviewed for CIAO 4.9: updated screen outputs and moved closing notes to admonition block in "Grouping the data" section; no new content.
12 Apr 2018 reviewed for CIAO 4.10: no content change.
06 Dec 2018 reviewed for CIAO 4.11: screen output revised.
12 Dec 2019 reviewed for CIAO 4.12: replaced ChIPS with matplotlib syntax.
15 Dec 2020 reviewed for CIAO 4.13: plots updated to the latest scheme; noticed energy range has been updated to 0.3 to 6 keV; added a section on using tab stops to group the data.
03 Mar 2022 reviewed for CIAO 4.14; typos fixed.
02 Dec 2022 reviewed for CIAO 4.15; updated output. Note of ignore_bad bugginess that makes it not necessarily interact well with other functions even though the filter is applied.
11 Dec 2023 reviewed for CIAO 4.16; updated to point out that the default behaviour of commands like group_counts has changed to simplify filtering then grouping (that is, in most cases the tabStops argument is not needed anymore).