I'm looking to make a list of toneless pinyin combinations/permutations.
import pandas as pd
data = pd.read_csv('chinese_tones.txt', sep=" ", header=None)
data.columns = ["pinyin", "character"]
data['pinyin'] = data['pinyin'].str.replace('\d+', '')
The current format of the data is:
| pinyin| character|
|------|----|---|---|---|
| cang | 仓 | | | |
| cang | 藏 | | | |
| cao | 操 | | | |
| cao | 曹 | | | |
| cao | 草 | | | |
The expected result would be a list like:
cangcang
cangcao
caocang
caocao
I can dedupe the list and clean myself. I'm just trying to include every combination in every order of two pinyin.
You can drop_duplicates
, and then use an outer
addition to get all combinations.
import numpy as np
import pandas as pd
s = df['pinyin'].drop_duplicates().to_numpy()
pd.Series(np.add.outer(s, s).ravel())
#0 cangcang
#1 cangcao
#2 caocang
#3 caocao
#dtype: object
If you want to add back the original words just add `s` back to this outer addition.
pd.Series(s.tolist() + np.add.outer(s, s).ravel().tolist())
#0 cang
#1 cao
#2 cangcang
#3 cangcao
#4 caocang
#5 caocao
#dtype: object
If you want to have the individual words also then we can accomplish a similar thing with a merge, instead of dropping down to numpy. drop_duplicates
again and assign a temporary key to accomplish the entire merge, then add the strings.
s = df[['pinyin']].drop_duplicates().assign(key=1)
res = s.merge(s, on='key').drop(columns='key')
res['combined'] = res['pinyin_x'] + res['pinyin_y']
# pinyin_x pinyin_y combined
#0 cang cang cangcang
#1 cang cao cangcao
#2 cao cang caocang
#3 cao cao caocao