sqlsql-serverrandommatch

How do I get matched pairs of results (1 male and 1 female) with the same age at random?


I have a database/table with this structure:

Year Age Gender OrderID
2012 18 M 4268
2021 75 M 7569
2015 56 F 5381
2018 29 M 2876
2014 33 F 3749

What I am trying to acheive is that I want 400 records/table rows pulled at random to form a smaller sample, but I need 200 male and 200 female records. On top of this I need each male record to give a female record with the same Age value, so I essentially end up with 200 pairs of results - each pair having a male and female of the same Age.

I have already produced and tried the following code:

DROP TABLE IF EXISTS #SampleTableM
DROP TABLE IF EXISTS #SampleTableF

SELECT TOP (200) [Year],[Age],[Gender],[OrderID]
INTO #SampleTableM
  FROM [database.name]
  WHERE Age <=90 AND Sex = 'M'
  ORDER BY NEWID()

SELECT TOP (200) [Year],[Age],[Gender],[OrderID]
INTO #SampleTableF
  FROM [database.name]
  WHERE Age <=90 AND Sex = 'F'
  ORDER BY NEWID()

SELECT * FROM #SampleTableM
UNION
SELECT * FROM #SampleTableF;

However, this just gets me 200 random Male results and 200 random Female results without each result being matched to one of the opposite Gender with the same age.


Solution

  • Based on the comments, I first selected random 200 male sample and then matched the age for female gender corresponding to the collected 200 male samples.Unsure how your output should look like as you did not share the expected output, you can adjust the columns in output as required.

    Here is an example

        WITH male_sample AS (
        SELECT TOP (200) [Year], [Age], [Gender], [OrderID]
        FROM test
        WHERE Age <= 90 AND Gender = 'M'
        ),
    female_sample AS (
        SELECT TOP (200) m.[Year] AS male_year, m.[Age] AS age, m.[Gender] AS male_gender, m.[OrderID] AS male_OrderID,
               f.[Year] AS female_year, f.[Gender] AS female_gender, f.[OrderID] AS female_OrderID
        FROM male_sample m
        INNER JOIN test f ON m.Age = f.Age
        WHERE f.Gender = 'F'
    )
    
    SELECT *
    FROM female_sample;
    

    Fiddle

    Male_Year Age Male_Gender Male_OrderID Female_Year Female_Gender Female_OrderID
    2012 18 M 4268 2013 F 4269
    2021 75 M 4269 2020 F 4270
    2018 29 M 4271 2019 F 4272
    2016 56 M 4273 2015 F 4270
    2014 33 M 4274 2014 F 4272
    2022 40 M 4001 2017 F 4002
    2023 50 M 5001 2011 F 5002
    2024 60 M 6001 2010 F 6002

    EDIT : As per comment we can get the matching pairs in subsequent rows using UNION ALL and then order by age, gender. There may or not be equal distribution of age for each gender but atleast similar age will appear together.

    Note : Visually this might still look uneven for when the number of male and female are more than 2 for an age.

    WITH male_sample AS (
        SELECT TOP (200) [Year], [Age], [Gender], [OrderID]
        FROM test
        WHERE Age <= 90 AND Gender = 'M'
        ),
    female_sample AS (
        SELECT TOP (200) m.[Year] AS male_year, m.[Age] AS age, m.[Gender] AS male_gender, m.[OrderID] AS male_OrderID,
               f.[Year] AS female_year, f.[Gender] AS female_gender, f.[OrderID] AS female_OrderID
        FROM male_sample m
        INNER JOIN test f ON m.Age = f.Age
        WHERE f.Gender = 'F'
    ),
    combined_sample AS (
        SELECT [male_year] AS [Year], [Age], [male_gender] AS [Gender], [male_OrderID] AS [OrderID]
        FROM female_sample
        UNION ALL
        SELECT [female_year] AS [Year], [Age], [female_gender] AS [Gender], [female_OrderID] AS [OrderID]
        FROM female_sample
    )
    SELECT [Year], [Age], [Gender], [OrderID]
    FROM combined_sample
    ORDER BY age, Gender  ;
    

    Fiddle

    Output

    Year Age Gender OrderID
    2013 18 F 4269
    2012 18 M 4268
    2019 29 F 4272
    2018 29 M 4271
    2014 33 F 4272
    2014 33 M 4274
    2017 40 F 4002
    2022 40 M 4001
    2011 50 F 5002
    2023 50 M 5001
    2015 56 F 4270
    2016 56 M 4273
    2010 60 F 6002
    2024 60 M 6001
    2020 75 F 4270
    2021 75 M 4269