I have a database/table with this structure:
Year | Age | Gender | OrderID |
---|---|---|---|
2012 | 18 | M | 4268 |
2021 | 75 | M | 7569 |
2015 | 56 | F | 5381 |
2018 | 29 | M | 2876 |
2014 | 33 | F | 3749 |
What I am trying to acheive is that I want 400 records/table rows pulled at random to form a smaller sample, but I need 200 male and 200 female records. On top of this I need each male record to give a female record with the same Age value, so I essentially end up with 200 pairs of results - each pair having a male and female of the same Age.
I have already produced and tried the following code:
DROP TABLE IF EXISTS #SampleTableM
DROP TABLE IF EXISTS #SampleTableF
SELECT TOP (200) [Year],[Age],[Gender],[OrderID]
INTO #SampleTableM
FROM [database.name]
WHERE Age <=90 AND Sex = 'M'
ORDER BY NEWID()
SELECT TOP (200) [Year],[Age],[Gender],[OrderID]
INTO #SampleTableF
FROM [database.name]
WHERE Age <=90 AND Sex = 'F'
ORDER BY NEWID()
SELECT * FROM #SampleTableM
UNION
SELECT * FROM #SampleTableF;
However, this just gets me 200 random Male results and 200 random Female results without each result being matched to one of the opposite Gender with the same age.
Based on the comments, I first selected random 200 male sample and then matched the age for female gender corresponding to the collected 200 male samples.Unsure how your output should look like as you did not share the expected output, you can adjust the columns in output as required.
Here is an example
WITH male_sample AS (
SELECT TOP (200) [Year], [Age], [Gender], [OrderID]
FROM test
WHERE Age <= 90 AND Gender = 'M'
),
female_sample AS (
SELECT TOP (200) m.[Year] AS male_year, m.[Age] AS age, m.[Gender] AS male_gender, m.[OrderID] AS male_OrderID,
f.[Year] AS female_year, f.[Gender] AS female_gender, f.[OrderID] AS female_OrderID
FROM male_sample m
INNER JOIN test f ON m.Age = f.Age
WHERE f.Gender = 'F'
)
SELECT *
FROM female_sample;
Male_Year | Age | Male_Gender | Male_OrderID | Female_Year | Female_Gender | Female_OrderID |
---|---|---|---|---|---|---|
2012 | 18 | M | 4268 | 2013 | F | 4269 |
2021 | 75 | M | 4269 | 2020 | F | 4270 |
2018 | 29 | M | 4271 | 2019 | F | 4272 |
2016 | 56 | M | 4273 | 2015 | F | 4270 |
2014 | 33 | M | 4274 | 2014 | F | 4272 |
2022 | 40 | M | 4001 | 2017 | F | 4002 |
2023 | 50 | M | 5001 | 2011 | F | 5002 |
2024 | 60 | M | 6001 | 2010 | F | 6002 |
EDIT : As per comment we can get the matching pairs in subsequent rows using UNION ALL and then order by age, gender. There may or not be equal distribution of age for each gender but atleast similar age will appear together.
Note : Visually this might still look uneven for when the number of male and female are more than 2 for an age.
WITH male_sample AS (
SELECT TOP (200) [Year], [Age], [Gender], [OrderID]
FROM test
WHERE Age <= 90 AND Gender = 'M'
),
female_sample AS (
SELECT TOP (200) m.[Year] AS male_year, m.[Age] AS age, m.[Gender] AS male_gender, m.[OrderID] AS male_OrderID,
f.[Year] AS female_year, f.[Gender] AS female_gender, f.[OrderID] AS female_OrderID
FROM male_sample m
INNER JOIN test f ON m.Age = f.Age
WHERE f.Gender = 'F'
),
combined_sample AS (
SELECT [male_year] AS [Year], [Age], [male_gender] AS [Gender], [male_OrderID] AS [OrderID]
FROM female_sample
UNION ALL
SELECT [female_year] AS [Year], [Age], [female_gender] AS [Gender], [female_OrderID] AS [OrderID]
FROM female_sample
)
SELECT [Year], [Age], [Gender], [OrderID]
FROM combined_sample
ORDER BY age, Gender ;
Output
Year | Age | Gender | OrderID |
---|---|---|---|
2013 | 18 | F | 4269 |
2012 | 18 | M | 4268 |
2019 | 29 | F | 4272 |
2018 | 29 | M | 4271 |
2014 | 33 | F | 4272 |
2014 | 33 | M | 4274 |
2017 | 40 | F | 4002 |
2022 | 40 | M | 4001 |
2011 | 50 | F | 5002 |
2023 | 50 | M | 5001 |
2015 | 56 | F | 4270 |
2016 | 56 | M | 4273 |
2010 | 60 | F | 6002 |
2024 | 60 | M | 6001 |
2020 | 75 | F | 4270 |
2021 | 75 | M | 4269 |