pythonpandaslist-comprehensionpinyin

All Python Permutations and Combinations of Pinyin (Mandarin Romanization)


I'm looking to make a list of toneless pinyin combinations/permutations.

import pandas as pd
data = pd.read_csv('chinese_tones.txt', sep=" ", header=None)
data.columns = ["pinyin", "character"]
data['pinyin'] = data['pinyin'].str.replace('\d+', '')

The current format of the data is:

| pinyin| character|
|------|----|---|---|---|
| cang | 仓 |   |   |   |
| cang | 藏 |   |   |   |
| cao  | 操 |   |   |   |
| cao  | 曹 |   |   |   |
| cao  | 草 |   |   |   |

The expected result would be a list like:

cangcang
cangcao
caocang
caocao

I can dedupe the list and clean myself. I'm just trying to include every combination in every order of two pinyin.


Solution

  • You can drop_duplicates, and then use an outer addition to get all combinations.

    import numpy as np
    import pandas as pd
    
    s = df['pinyin'].drop_duplicates().to_numpy()
    pd.Series(np.add.outer(s, s).ravel())
    
    #0    cangcang
    #1     cangcao
    #2     caocang
    #3      caocao
    #dtype: object
    
    If you want to add back the original words just add `s` back to this outer addition.
    
    pd.Series(s.tolist() + np.add.outer(s, s).ravel().tolist())
    #0        cang
    #1         cao
    #2    cangcang
    #3     cangcao
    #4     caocang
    #5      caocao
    #dtype: object
    

    If you want to have the individual words also then we can accomplish a similar thing with a merge, instead of dropping down to numpy. drop_duplicates again and assign a temporary key to accomplish the entire merge, then add the strings.

    s = df[['pinyin']].drop_duplicates().assign(key=1)
    res = s.merge(s, on='key').drop(columns='key')
    res['combined'] = res['pinyin_x'] + res['pinyin_y']
    
    #  pinyin_x pinyin_y  combined
    #0     cang     cang  cangcang
    #1     cang      cao   cangcao
    #2      cao     cang   caocang
    #3      cao      cao    caocao