AGASC-Gaia Cross-match

All files are in the /data/aca/analysis/agasc1p8 directory.

The main data products of the AGASC-Gaia cross-match are the following (files are big, so it might not be a good idea to just click on them in the browser):

agasc1p8rc11.h5 . The updated AGASC catalog with positions and mag_ACA estimates based on Gaia photometry
offset_lookup_1p8rc11.h5 . An update of the expected offsets in the presence of spoilers.

Other interesting data products are (see below for more details):

agasc-gaia-x-match.h5. The cross-match between AGASC and Gaia.
agasc-gaia-x-match-all.h5. All the candidate matches that were considered.
agasc-gaia-x-match-difficult.h5. Stars that were "difficult" because two or more AGASC stars were matched to the same Gaia star.
agasc1p8rc11-duplicates.fits . Stars that might be duplicates in the AGASC catalog.

Datasets

This section describes what is done in agasc_gaia.datasets. All this is done before any cross-match with Gaia (even if the issues were found after a cross-match)

We have used a few datasets other than Gaia and AGASC. All these are available through agasc_gaia.datasets:

Tycho2 (get_tycho).
TDSC (get_tdsc).
GSC 1.1 (get_gsc11).
GSC 2.3 (get_gsc23).
Gaia "minimum bias" (get_gaia_min_bias_train_test). This is just a random sample of Gaia stars.

There is a function to produce expanded AGASC catalog, which includes all columns as the AGASC plus some extra information that could be of use later on:

Agasc summary (get_agasc_summary)

The initial intention was to use off-the-shelf cross-matches with Gaia, because Gaia DR3 already includes cross-matches with GSC2.3 and with Tycho2-TDSC. The AGASC catalog includes Tycho2 and GSC1.1 IDs, and in principle it should be possible to match these to Tycho2-TDSC and GSC2.3 respectively. This proved to be problematic.

AGASC issues

Repeated entry. The AGASC catalog has one repeated ID (AGASC ID 154534513). We kept the more recent of the two entries.

Potential duplicates (get_potential_duplicates). Looking at cross-matches with Gaia, it became apparent that there might be duplicate stars in the AGASC catalog. This can happen when the union of two catalogs is taken without accounting for stars present in both catalogs with slightly different positions and magnitudes. On the other hand, a number of known binaries/multiples have the same exact position in AGASC, and they could have similar magnitudes. In general it is difficult to know for sure given the complex update history of the catalog.

We added a check for duplicates in the AGASC summary. The criteria for marking duplicates are:

two stars that are very close in position (< 1 arcsec)
and from two different catalogs (the older catalog is marked as possible duplicate)

Tycho2 Issues

Tycho2-TDSC IDs. The IDs in Tycho2 and in TDSC are NOT the same, and the TDSC paper did not include a mapping from one ID set to the other. In order to do the cross-match AGASC-Tycho2-Tycho2TDSC-Gaia, we needed to match Tycho2 stars with Tycho2-TDSC stars. This is done in get_tycho2_tdsc, and is based on a radial cross-match using the probabilities in tycho2_tdsc_separation_prob and tycho2_tdsc_mag_diff_prob. This cross-match is not relevant for the final result, because we ended up doing a cross-match between AGASC and Gaia directly.

Missing Tycho2-TDSC IDs. A few stars were removed from the Tycho2-TDSC cross-match:

2673-3845-1, 2673-3845-2, 2673-3845-3
6461-1120-1, 6461-1120-2, 6461-1120-3

Wrong proper motion. When we did the direct AGASC-Gaia cross-match, we noticed that some AGASC stars had no Gaia counterpart. In a number of these cases, it could be attributed to wrong proper motions in Tycho2. For this reason, we decided to not use the RAmdeg and DEmdeg fields but use RA(ICRS) and DE(ICRS) instead. RAmdeg and DEmdeg are the ICRS position at epoch 2000, and therefore can be affected by wrong proper motions.

The positions of Tycho2 stars in the AGASC are given by RAmdeg and DEmdeg, so these positions are changed when producing the AGASC summary.

GSC Issues

Issues with the GSC catalogs are irrelevant in the end, because we ended up doing a cross-match between AGASC and Gaia DR3 directly, without using GSC.

The first issue was with GSC catalogs. We noticed that, after matching AGASC-GSC1.1-GSC2.3, there were some stars with large differences in AGASC and GSC2.3 positions. These matches were removed from the downstream dataset.

We still do not know the origin of the discrepancy.

Cross-match

Algorithms related to the cross-matches are kept in agasc_gaia.gaia_queries (functions to query Gaia) and agasc_gaia.cross_match (functions that take the results from Gaia queries and produce a list of matches).

We did two separate cross-matches between AGASC and Gaia:

Direct cross-match. Stars in AGASC were directly compared to Gaia stars. See below.
Indirect cross-match:

match AGASC to the intermediate datasets (Tycho2-TDSC, GSC2.3) using the Tycho2 and GSC1.1 IDs in AGASC (done in the AGASC summary),
take an off-the-shelf cross-match between the intermediate datasets and Gaia,
and take the union of the results (cross_match.get_agasc_tycho_gsc_gaia_x_match).

The final result uses the direct cross-match, but we derived insights from comparing the two.

Direct cross-match

The cross-match happens in two steps:

Get preliminary matches between AGASC and Gaia (in gaia_queries.run_full_preliminary_cross_match), the union of:

all Gaia stars within 15 arcsec of each AGASC star.
all Gaia stars within 15 arcsec of each AGASC star's position shifted to a 2016 observation date, according to the AGASC proper motion.
all Gaia stars with high proper motion ((abs(pm_dec) > 300) | (abs(pm_ra) > 300)) within 1200 arcsec of each AGASC star.

Compute the match probability of all AGASC-Gaia preliminary matches. The match probability is calculated assuming the Gaia proper motion and the AGASC proper motion. Then:
- If Gaia has any proper motion information, use that.
- If Gaia has no proper motion information, and AGASC has proper motion in RA and dec, take the best p-match value of the two
- Otherwise, take the value using the Gaia proper motion set to zero.
For each AGASC star, select the Gaia star with the largest match probability (cross_match.get_agasc_gaia_x_match). Choosing this probability is one of the big questions.
Keep only the matches above a probability threshold. The idea is to remove random chance matches. Examples of random chance matches include: an AGASC star with a wrong position (or an artifact in AGASC) that is matched to a random Gaia star nearby, or an AGASC star that has no counterpart because Gaia is not complete and gets matched with the second best. Choosing this threshold is one of the big questions.

NOTE: pm_ra in the condition for high-PM above should have been pm_ra*cos(dec), but using pm_ra is conservative and faster.

How to determine p_match

This can be done in an iterative process.

First Iteration

The first iteration was the indirect AGASC-GSC-Tycho-Gaia cross-match. This produce a set of matches that are close. The vast majority of matches are clear.

Second Iteration

The second iteration was the result of looking at magnitude outliers (which prompted us to do the direct cross-match). We used a reasonable guess for the probability distribution (cross_match.agasc_gaia_match_probability_prelim). This used a Gaussian distribution in both magnitude and position.

Third Iteration

The third iteration was the result of comparing the two cross-matching algorithms. The difficult cases were two stars in Gaia with a single AGASC star in between, which actually corresponds to a blend of the two Gaia stars. What to do in these cases was another question: Should we not update it or should we match it to one of the Gaia stars? The decision was to match it to one of the stars, and in this case we need a better estimate of probability. The matching probability in this case has fatter tails to include these.

NOTE: An alternative would have been to use a Gaussian distribution with a width given by a combination of the uncertainties in AGASC and Gaia. I think what we have is good enough.

How to determine the cut on p_match

To determine the cut on p_match we calculated the p-value of each match pair (see notes/04-Understanding-p-value.ipynb). Roughly speaking, the p-value is the probability that we find a star with a more extreme p_match, assuming the match probability distribution is correct.

If the match probability used to define p-value describes reality correctly, the distribution of p-value should be a uniform distribution in (0,1).

We chose the cut value to exclude a spike that occurs around 0. This spike can be caused by:

random chance matches:

an AGASC star with a wrong position (or an artifact in AGASC) that is matched to a random Gaia star nearby,
or an AGASC star that has no counterpart because Gaia is not complete, and gets matched with the second best

discrepancy between p_match and reality (e.g.: the uncertainty in AGASC is underestimated, non-gaussian tails)

Difficult Stars

"Difficult" stars are AGASC stars that can be matched to the same Gaia star as other AGASC star(s). These stars are grouped with all the stars it could be confused with. The matches in each group are recomputed to guarantee there are no two repeated AGASC or Gaia IDs. The process is to:

Select the match with the highest probability.
Remove the corresponding AGASC and Gaia IDs from the candidate matches.
Repeat until there are no candidate matches left.

For each of these groups, we defined "latest_pos_cat" as the POS_CATID value with the highest precedence from the AGASC entries in the group. The precedence is, in decreasing order: 5, 6, 4, 3, 2, 1. The catalog precedence IS NOT considered when recomputing the matches.

The entries with POS_CATID different than latest_pos_cat could be considered duplicates, although this is not guaranteed to be the case.

For example, AGASC 102499594 and 102499593 are two stars in Tycho2 that form a binary system (?). AGASC 102499594 is a star in GSC2.3 that lies within 0.5 arcsec from them, and has a magnitude consistent with it being a blend of the other two. Based on the catalog precedence mentioned above, 102499594 appears to be a duplicate. Gaia IDs 3151414218873077760 and 3151414218874713600 are two resolved stars in Gaia that are matched to AGASC 102499594 and 102499593 respectively. AGASC 102499594 is matched to the nearest star other than these, which happens to be 13 arcsec away, and this match is marked as background based on p-value.

ACA Magnitude Model

The ACA magnitude model fit in agasc_gaia.gaia_model.get_gaia_model. The model is implemented as a two classes:

MissingValueFiller with fit and fill_missing_values methods.
GaiaModel with fit, predict and uncertainty methods.

The reason for separating the missing-value-filling and the magnitude model is that they can be used on different datasets: To estimate the missing Gaia values, one does not need the star to be observed by ACA. We first fit the missing value filler, using the training sample from the minimum bias Gaia dataset. Then we fit the magnitude model using the observed stars training set.

The ad-hoc model

A preliminary model selection was done in notes/05-agasc-gaia-model-select-1.ipynb, where we compared several models, including simple linear models, a random forest and an ad-hoc model. The ad-hoc model was chosen.

The main features of the ad-hoc model are:

It is based on the principal components in the space spanned by the Gaia magnitudes (mag_G, mag_Rp, mag_Bp).
It is linear in the first principal component and polynomial on the second principal component.
It includes a multiplicative instrument response factor to account for a systematic bias of the magnitude determination.

The instrument response factor is 1 for magnitudes below a given threshold, and is monotonically increasing above the threshold. It is implemented as a cubic spline interpolation between linear functions below/above the magnitude threshold in gaia_model.Broken.

The Simple Color Model

After some discussion, the magnitude model was simplified to guarantee scaling. That is, considering mag_ACA as a function of Gaia magnitudes, the model should satisfy: \[\text{mag}_{ACA}(\text{mag}_{G}+\delta, \text{mag}_{Rp}+\delta, \text{mag}_{Bp}+\delta) = \text{mag}_{ACA}(\text{mag}_{G}, \text{mag}_{Rp}, \text{mag}_{Bp}) + \delta\].

This model:

Includes the same multiplicative instrument response factor from the ad-hoc model.
Is of the form: \[ \begin{eqnarray} \text{mag}_{ACA} &=& \text{mag}_{G} + Pol(color, N=2) \\ color &=& \text{mag}_{Bp} - \text{mag}_{Rp} \end{eqnarray} \]

A comparison between the ad-hoc model and the simple color model is in notes/05-agasc-gaia-model-select-2.ipynb.

Even though the ad-hoc model performs better for bright stars, the simple color model is chosen because it guarantees the scaling expected from physical principles. The largest deviation in ACA magnitude is \(\approx 0.05\) and is sufficient for operational purposes.

Variance Model

To estimate the uncertainty in the magnitude model, we implemented a variance model with several additive contributions to the variance:

base uncertainty. Determined for bright stars, below the magnitude threshold, and applies to all stars.
instrument uncertainty. Zero below the magnitude threshold, monotonically increasing above threshold. Implemented the same way as the instrument response factor, a cubic spline interpolation between linear functions below/above the magnitude threshold.
variability. Gaia's Median Absolute Deviation (MAD) for G FoV transits (vari_summary.mad_mag_g_fov)
missing magnitudes. The mean variance increase in mag_ACA for stars that have complete Gaia magnitude information (mag_G, mag_Rp, mag_Bp), when estimating mag_ACA assuming some Gaia magnitudes are missing (and filling their values using the other Gaia magnitudes). Interpolated at the resulting mag_ACA value.

The final uncertainty is then given by \[ \sigma = \sqrt{\text{var}_{base} + \text{var}_{instrument} + \text{var}_{variable} + \text{var}_{missing}} \]

Outlier Analysis

In order to do outlier analysis, we produced a few reports. Unless otherwise stated, the reports were produced in the 06-cross_match_comparison_direct-indirect.ipynb notebook:

Observed mag_ACA outliers and the corresponding AGASC supplement report of stars with observed ACA magnitude that differ significantly with the magnitude from Gaia. 04-agasc-gaia-x-match-performance.ipynb.
Proper motion outliers - worst p-values. These are stars with large proper motion difference between 1p7 and 1p8. These include a number of Tycho2 stars. 04-agasc-gaia-x-match-performance.ipynb
Proper motion outliers - large PM differences. These are stars with large proper motion difference between 1p7 and 1p8. These include a number of Tycho2 stars. 04-agasc-gaia-x-match-performance.ipynb
Proper motion outliers - random. These are stars with large proper motion difference between 1p7 and 1p8. These include a number of Tycho2 stars. 04-agasc-gaia-x-match-performance.ipynb
Differences between direct and indirect methods - Observed.All observed stars which differ between the two cross-matching algorithms.
Differences between direct and indirect methods - Candidates. A sample of candidate stars which differ between the two cross-matching algorithms. This is the justification to ditch the indirect method.
Differences between direct and indirect methods - Large separation. A sample of stars with angular separation between the direct and indirect matches larger than 3 arcsec.
ACA magnitude outliers. Stars with large mag_ACA difference between 1p7 and 1p8.
Difficult stars
Candidates. Candidate guide or acq stars with 0.01 < p-value < 0.022. These are the worst currently in 1p8 RC
Candidates. Candidate guide or acq stars with 0.003 < p-value < 0.005. These are not in 1p8 RC.

NOTE: Here is a Legend detailing the meaning of the various fields in the reports.

ASPQ1 Update

ASPQ1 is a short integer spoiler code using in star selection. It is an estimate, in 50milliarcsec units, of the worst centroid offset caused by any star within 80arcsec. The values over a grid of brightness difference dm, and radial positional separation dr are calculated once and stored in data/offset_lookup_1p8rc11.h5. The values for each star are found by interpolating the stored grid values. A comparison of offsets in 1p7 and 1p8 is in 07.1-agasc-update-aspq.ipynb.

AGASC Update

Only stars with CLASS 0, 2 or 6 are updated. These correspond to star, blend or member of incorrectly resolved blend and known multiple system.

Based on this cross-match, the AGASC catalog will be updated in the following fields. The fields are updated only for the stars that have updated magnitude, unless otherwise stated:

XREF_ID1, XREF_ID5. The ID of the Gaia counterpart (a 64-bit integer). The highest significant 32 bits of the Gaia ID are stored in XREF_ID1, and the lowest significant 32 bits are stored in XREF_ID5. The default Gaia ID value is -9999, which translates to the following default values XREF_ID1=-1 and XREF_ID5=-9999. The full Gaia ID can be reconstructed as follows:
```
agasc["XREF_ID1"] * (1 << 32) + agasc["XREF_ID5"].astype(dtype=np.uint32)
```
RSV4. Long integer which is the star number within the region in the AGASC Version 1.0 (= GSC1.1). This is not a unique identifier. Default value of -9999. In AGASC version 1p7 and earlier, this was stored in XREF_ID1.

MAG, MAG_ERR, MAG_BAND, MAG_CATID. Gaia magnitude and its error.

MAG_BAND = 23, 24 or 25, depending on whether it is the G, Rp or Bp magnitude.
MAG_CATID = 7

MAG_ACA, MAG_ACA_ERR. Estimated ACA magnitude and its error.
```
MAG_ACA_ERR = mag_aca_err_gaia * 100
```

COLOR1, COLOR2, COLOR1_ERR, COLOR2_ERR, C1_CATID, C2_CATID. Estimated B-R color and its error.

COLOR1 = bp_mag - rp_mag
COLOR1_ERR = sqrt(bp_mag_error**2 + rp_mag_error**2)
COLOR2 = bp_mag - rp_mag
COLOR2_ERR = sqrt(bp_mag_error**2 + rp_mag_error**2)
C1_CATID = 7
C2_CATID = 7

RA, DEC, EPOCH, POS_CATID, POS_ERR. Position and its error.

RA = gaia_ra
DEC = gaia_dec
EPOCH = 2016.0
POS_CATID = 7
POS_ERR = sqrt((gaia_ra_error/cos(gaia_dec))**2 + gaia_dec_error**2) * 1000

PM_RA, PM_DEC, PM_CATID. Proper motion and its error.
```
PM_RA = gaia_pm_ra
PM_DEC = gaia_pm_dec
PM_CATID = 7
```
ASPQ1. Updated for all entries in catalog

ASPQ2. Boolean cast as int:

pm = np.sqrt(
    (agasc_gaia["pm_ra"] / np.cos(np.deg2rad(agasc_gaia["DEC"]))) ** 2
    + agasc_gaia["pm_dec"] ** 2
)
ASPQ2 = np.where(pm[sel] > 500, 1, 0)

VAR. only for stars that are flagged as variables in Gaia.

VAR[(mad_mag_g_fov < 0.2)] = 5
VAR[(mad_mag_g_fov >= 0.2) & (mad_mag_g_fov < 2)] = 3
VAR[(mad_mag_g_fov >= 2)] = 4
VAR_CATID = CAT_ID