awktext-processingunix-text-processing

Issue converting github.com/*/raw/* URLs to raw.githubusercontent.com URLS using AWK


Given the following example URLs:

urls.txt

https://github.com/2RDLive/Pi-Hole/raw/master/Blacklist.txt
https://github.com/34730/asd/raw/master/adaway-export
https://github.com/568475513/secret_domain/raw/master/filter.txt
https://github.com/BlackJack8/iOSAdblockList/raw/master/Regular%20Hosts.txt
https://github.com/CipherOps/MiscHostsFiles/raw/master/MiscAdTrackingHostBlock.txt
https://github.com/DK-255/Pi-hole-list-1/raw/main/Ads-Blocklist
https://github.com/DRSDavidSoft/additional-hosts/raw/master/domains/blacklist/adservers-and-trackers.txt
https://github.com/DRSDavidSoft/additional-hosts/raw/master/domains/blacklist/unwanted-iranian.txt
https://github.com/DandelionSprout/adfilt/raw/master/Alternate%20versions%20Anti-Malware%20List/AntiMalwareHosts.txt
https://github.com/DavidTai780/AdGuard-Home-Private-Rules/raw/master/hosts.txt
https://github.com/DivineEngine/Profiles/raw/master/Quantumult/Filter/Guard/Advertising.list
https://github.com/Hariharann8175/Indicators-of-Compromise-IOC-/raw/master/Ransomware%20URL's
https://github.com/JumbomanXDA/host/raw/main/hosts
https://github.com/Kees1958/W3C_annual_most_used_survey_blocklist/raw/master/EU_US%2Bmost_used_ad_and_tracking_networks
https://github.com/KurzGedanke/kurzBlock/raw/master/kurzBlock.txt
https://github.com/MajkiIT/polish-ads-filter/raw/master/polish-adblock-filters/adblock.txt
https://github.com/MitaZ/Better_Filter/raw/master/Quantumult_X/Filter.list
https://github.com/MrWaste/Ad-BlockList-2019-08-31/raw/master/Pi-Hole%20BackUps/Black%20List/All%20Server%20Black%20List
https://github.com/Neo23x0/signature-base/raw/master/iocs/c2-iocs.txt
https://github.com/Pentanium/ABClientFilters/raw/master/ko/korean.txt
https://github.com/Phentora/AdguardPersonalList/raw/master/blocklist.txt
https://github.com/ShadowWhisperer/BlockLists/raw/master/Lists/Malware
https://github.com/SlashArash/adblockfa/raw/master/adblockfa.txt
https://github.com/SukkaW/Surge/raw/master/List/domainset/reject_sukka.conf
https://github.com/Th3M3/blocklists/raw/master/tracking%26ads.list
https://github.com/TonyRL/blocklist/raw/master/hosts
https://github.com/UnbendableStraw/samsungnosnooping/raw/master/README.md
https://github.com/UnluckyLuke/BlockUnderRadarJunk/raw/master/blockunderradarjunk-list.txt
https://github.com/VernonStow/Filterlist/raw/master/Filterlist.txt
https://github.com/What-Zit-Tooya/Ad-Block/raw/main/Main-Blocklist/Ad-Block-HOSTS.txt
https://github.com/XionKzn/PiHole-Lists/raw/master/PiHole/Blocklist_HOSTS.txt
https://github.com/YanFung/Ads/raw/master/Mobile
https://github.com/Yuki2718/adblock/raw/master/adguard/tracking-plus.txt
https://github.com/Yuki2718/adblock/raw/master/japanese/jp-filters.txt
https://github.com/ZYX2019/host-block-list/raw/master/Custom.txt
https://github.com/abc45628/hosts/raw/master/hosts
https://github.com/aleclee/DNS-Blacklists/raw/master/AdHosts.txt
https://github.com/angelics/pfbng/raw/master/ads/ads-domain-list.txt
https://github.com/blocklistproject/Lists/raw/master/ransomware.txt
https://github.com/cchevy/macedonian-pi-hole-blocklist/raw/master/hosts.txt
https://github.com/craiu/mobiletrackers/raw/master/list.txt
https://github.com/curutpilek12/adguard-custom-list/raw/main/custom
https://github.com/damengzhu/banad/raw/main/jiekouAD.txt
https://github.com/deletescape/noads/raw/master/lists/add-switzerland.txt
https://github.com/doadin/Pi-Hole-Blocklist/raw/main/block.list
https://github.com/dreammjow/MyFilters/raw/main/src/filters.txt
https://github.com/durablenapkin/block/raw/master/streaming.txt
https://github.com/eEIi0A5L/adblock_filter/archive/master.zip
https://github.com/easylist-thailand/easylist-thailand/raw/master/subscription/easylist-thailand.txt
https://github.com/fandagroupofficial/hosts/raw/main/pihole/ads
https://github.com/fandagroupofficial/hosts/raw/main/pihole/log
https://github.com/fandagroupofficial/hosts/raw/main/pihole/trackers
https://github.com/faralai/Pihole-Rules/raw/master/Fara-Popups_Head
https://github.com/faralai/Pihole-Rules/raw/master/Fara-Xiaomi-info
https://github.com/farrokhi/adblock-iran/raw/master/filter.txt
https://github.com/fskreuz/blocklists/raw/dev/domains.txt
https://github.com/ftpmorph/ftprivacy/raw/master/regex-blocklists/smartphone-and-general-ads-analytics-regex-blocklist-ftprivacy.txt
https://github.com/hell-sh/Evil-Domains/raw/master/evil-domains.txt
https://github.com/hosts-file/BulgarianHostsFile/raw/master/bhf.txt
https://github.com/igorskyflyer/ad-void/raw/main/AdVoid.Core.txt
https://github.com/jackrabbit335/UsefulLinuxShellScripts/raw/master/Hosts%20%26%20sourcelist/blacklist.txt
https://github.com/jakdev121/AMS2/raw/master/pi_indo_ads.txt
https://github.com/jakejarvis/ios-trackers/raw/master/blocklist.txt
https://github.com/jasirfayas/jBlocklist/raw/master/domains.lst
https://github.com/javabean/dnsmasq-antispy/raw/master/dnsmasq.ghostery_bugs.conf
https://github.com/javabean/dnsmasq-antispy/raw/master/dnsmasq.zz-extra-servers-manual.conf
https://github.com/jdlingyu/ad-wars/raw/master/hosts
https://github.com/jlonborg/piblacklist/raw/main/blacklist.txt
https://github.com/joaopinto14/PiHole/raw/main/adverts
https://github.com/kang49/kang49regexblacklistproject/raw/main/blacklist
https://github.com/lesong/Surge/raw/main/rule/BanProgramAD.list
https://github.com/lhie1/Rules/raw/master/Auto/REJECT.conf
https://github.com/mayesidevel/PiHoleLists/raw/master/MiscBlocklist
https://github.com/meinhimmel/hosts/raw/master/hosts
https://github.com/mhhakim/pihole-blocklist/raw/master/custom-blocklist.txt
https://github.com/migueldemoura/ublock-umatrix-rulesets/raw/master/Hosts/ads-tracking
https://github.com/minoplhy/filters/raw/main/Resources/blocked.txt
https://github.com/monojp/hosts_merge/raw/master/hosts_blacklist.txt
https://github.com/mtbnunu/ad-blocklist/raw/master/kr-list.txt
https://github.com/mtxadmin/ublock/raw/master/hosts/_telemetry
https://github.com/mullvad/dns-adblock/raw/main/lists/doh/adblock/custom
https://github.com/muxcc/AdsBlockLists/raw/master/aumm.hosts
https://github.com/nimasaj/uBOPa/raw/master/uBOPa.txt
https://github.com/notracking/hosts-blocklists/raw/master/dnscrypt-proxy/dnscrypt-proxy.blacklist.txt
https://github.com/npljy/npljy.github.io/raw/main/blocks/dns.txt
https://github.com/npljy/npljy.github.io/raw/main/blocks/filter.txt
https://github.com/olegwukr/polish-privacy-filters/raw/master/adblock.txt
https://github.com/parseword/nolovia/raw/master/skel/hosts-government-malware.txt
https://github.com/parseword/nolovia/raw/master/skel/hosts-nolovia.txt
https://github.com/pathforwardit/BlockList/raw/main/DomainList
https://github.com/pirat28/IHateTracker/raw/master/iHateTracker.txt
https://github.com/sa-ki13/jmsf/raw/master/japanese_mobile_site_dns_filter.txt
https://github.com/saurane/Turkish-Blocklist/raw/master/Blocklist/domains.txt
https://github.com/scomper/surge-list/raw/master/reject.list
https://github.com/sirsunknight/QuantumultX/raw/master/Filter/Radical-Advertising
https://github.com/smed79/blacklist/raw/master/hosts.txt
https://github.com/soteria-nou/domain-list/archive/master.zip
https://github.com/stamparm/maltrail/raw/master/trails/static/suspicious/pua.txt
https://github.com/sutchan/dnsmasq_ads_filter/raw/main/dnsmasq-ads-filter-list.txt
https://github.com/svetlyobg/svet-custom-domains/raw/master/ads-domains
https://github.com/tomzuu/blacklist-named/raw/master/ad.sites.conf
https://github.com/tomzuu/blacklist-named/raw/master/phishing.sites.conf
https://github.com/tomzuu/blacklist-named/raw/master/pushing.sites.conf
https://github.com/uBlockOrigin/uAssets/raw/master/filters/badware.txt
https://github.com/uBlockOrigin/uAssets/raw/master/filters/filters.txt
https://github.com/uBlockOrigin/uAssets/raw/master/filters/privacy.txt
https://github.com/unchartedsky/adguard-kr/raw/master/adguard-kr.txt
https://github.com/unflac/adFILTER/raw/master/filter.txt
https://github.com/vokins/ad/raw/main/ad.list
https://github.com/willianreis89/ADsBlock/raw/master/list.txt
https://github.com/wrysunny/ad_list/raw/master/adlist.txt
https://github.com/xOS/Config/raw/Her/Surge/RuleSet/Advertising.list
https://github.com/xinggsf/Adblock-Plus-Rule/raw/master/rule.txt
https://github.com/xlimit91/xlimit91-block-list/raw/master/blacklist.txt
https://github.com/xylagbx/ADBLOCK/raw/master/BLOCK/customadblockdomain.txt
https://github.com/ziozzang/adguard/raw/master/filter.txt
https://github.com/zznidar/BAR/raw/master/BAR-list

I'm using this command:

awk 'BEGIN{FS=OFS="/"}{if ($6~/^raw$/){$3="raw.githubusercontent.com"; for(i=0;i<=NF;++i) if (i!=6) {printf("%s%s",$i,(i==NF)?"\n":OFS)}}}' urls.txt

To produce this desired output:

https://raw.githubusercontent.com/2RDLive/Pi-Hole/master/Blacklist.txt
https://raw.githubusercontent.com/34730/asd/master/adaway-export
https://raw.githubusercontent.com/568475513/secret_domain/master/filter.txt
https://raw.githubusercontent.com/BlackJack8/iOSAdblockList/master/Regular%20Hosts.txt
...

But it yields this output:

https://raw.githubusercontent.com/2RDLive/Pi-Hole/raw/master/Blacklist.txt/https://raw.githubusercontent.com/2RDLive/Pi-Hole/master/Blacklist.txt
https://raw.githubusercontent.com/34730/asd/raw/master/adaway-export/https://raw.githubusercontent.com/34730/asd/master/adaway-export
https://raw.githubusercontent.com/568475513/secret_domain/raw/master/filter.txt/https://raw.githubusercontent.com/568475513/secret_domain/master/filter.txt
https://raw.githubusercontent.com/BlackJack8/iOSAdblockList/raw/master/Regular%20Hosts.txt/https://raw.githubusercontent.com/BlackJack8/iOSAdblockList/master/Regular%20Hosts.txt
...

Why is it printing a semblance of the original URL before the correct output?


Here is the above code formatted legibly with gawk -o-:

BEGIN {
    FS = OFS = "/"
}

{
    if ($6 ~ /^raw$/) {
        $3 = "raw.githubusercontent.com"
        for (i = 0; i <= NF; ++i) {
            if (i != 6) {
                printf "%s%s", $i, (i == NF) ? "\n" : OFS
            }
        }
    }
}

Solution

  • Your only real problem is that awk fields, arrays, and strings all start at 1, not 0, so your loop should have started at 1, not 0. As written first time through your loop print $i is doing print $0.

    Having said that, I think what you want is the following with a couple of other things tidied up:

    $ cat tst.awk
    BEGIN { FS=OFS="/" }    
    sub(/^raw$/,RS,$6) && sub(OFS RS,"") {
        $3 = "raw.githubusercontent.com"
        print
    }
    

    $ awk -f tst.awk urls.txt
    https://raw.githubusercontent.com/2RDLive/Pi-Hole/master/Blacklist.txt
    https://raw.githubusercontent.com/34730/asd/master/adaway-export
    https://raw.githubusercontent.com/568475513/secret_domain/master/filter.txt
    https://raw.githubusercontent.com/BlackJack8/iOSAdblockList/master/Regular%20Hosts.txt
    https://raw.githubusercontent.com/CipherOps/MiscHostsFiles/master/MiscAdTrackingHostBlock.txt
    https://raw.githubusercontent.com/DK-255/Pi-hole-list-1/main/Ads-Blocklist
    https://raw.githubusercontent.com/DRSDavidSoft/additional-hosts/master/domains/blacklist/adservers-and-trackers.txt
    https://raw.githubusercontent.com/DRSDavidSoft/additional-hosts/master/domains/blacklist/unwanted-iranian.txt
    https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate%20versions%20Anti-Malware%20List/AntiMalwareHosts.txt
    https://raw.githubusercontent.com/DavidTai780/AdGuard-Home-Private-Rules/master/hosts.txt
    https://raw.githubusercontent.com/DivineEngine/Profiles/master/Quantumult/Filter/Guard/Advertising.list
    https://raw.githubusercontent.com/Hariharann8175/Indicators-of-Compromise-IOC-/master/Ransomware%20URL's
    https://raw.githubusercontent.com/JumbomanXDA/host/main/hosts
    https://raw.githubusercontent.com/Kees1958/W3C_annual_most_used_survey_blocklist/master/EU_US%2Bmost_used_ad_and_tracking_networks
    https://raw.githubusercontent.com/KurzGedanke/kurzBlock/master/kurzBlock.txt
    https://raw.githubusercontent.com/MajkiIT/polish-ads-filter/master/polish-adblock-filters/adblock.txt
    https://raw.githubusercontent.com/MitaZ/Better_Filter/master/Quantumult_X/Filter.list
    https://raw.githubusercontent.com/MrWaste/Ad-BlockList-2019-08-31/master/Pi-Hole%20BackUps/Black%20List/All%20Server%20Black%20List
    https://raw.githubusercontent.com/Neo23x0/signature-base/master/iocs/c2-iocs.txt
    https://raw.githubusercontent.com/Pentanium/ABClientFilters/master/ko/korean.txt
    https://raw.githubusercontent.com/Phentora/AdguardPersonalList/master/blocklist.txt
    https://raw.githubusercontent.com/ShadowWhisperer/BlockLists/master/Lists/Malware
    https://raw.githubusercontent.com/SlashArash/adblockfa/master/adblockfa.txt
    https://raw.githubusercontent.com/SukkaW/Surge/master/List/domainset/reject_sukka.conf
    https://raw.githubusercontent.com/Th3M3/blocklists/master/tracking%26ads.list
    https://raw.githubusercontent.com/TonyRL/blocklist/master/hosts
    https://raw.githubusercontent.com/UnbendableStraw/samsungnosnooping/master/README.md
    https://raw.githubusercontent.com/UnluckyLuke/BlockUnderRadarJunk/master/blockunderradarjunk-list.txt
    https://raw.githubusercontent.com/VernonStow/Filterlist/master/Filterlist.txt
    https://raw.githubusercontent.com/What-Zit-Tooya/Ad-Block/main/Main-Blocklist/Ad-Block-HOSTS.txt
    https://raw.githubusercontent.com/XionKzn/PiHole-Lists/master/PiHole/Blocklist_HOSTS.txt
    https://raw.githubusercontent.com/YanFung/Ads/master/Mobile
    https://raw.githubusercontent.com/Yuki2718/adblock/master/adguard/tracking-plus.txt
    https://raw.githubusercontent.com/Yuki2718/adblock/master/japanese/jp-filters.txt
    https://raw.githubusercontent.com/ZYX2019/host-block-list/master/Custom.txt
    https://raw.githubusercontent.com/abc45628/hosts/master/hosts
    https://raw.githubusercontent.com/aleclee/DNS-Blacklists/master/AdHosts.txt
    https://raw.githubusercontent.com/angelics/pfbng/master/ads/ads-domain-list.txt
    https://raw.githubusercontent.com/blocklistproject/Lists/master/ransomware.txt
    https://raw.githubusercontent.com/cchevy/macedonian-pi-hole-blocklist/master/hosts.txt
    https://raw.githubusercontent.com/craiu/mobiletrackers/master/list.txt
    https://raw.githubusercontent.com/curutpilek12/adguard-custom-list/main/custom
    https://raw.githubusercontent.com/damengzhu/banad/main/jiekouAD.txt
    https://raw.githubusercontent.com/deletescape/noads/master/lists/add-switzerland.txt
    https://raw.githubusercontent.com/doadin/Pi-Hole-Blocklist/main/block.list
    https://raw.githubusercontent.com/dreammjow/MyFilters/main/src/filters.txt
    https://raw.githubusercontent.com/durablenapkin/block/master/streaming.txt
    https://raw.githubusercontent.com/easylist-thailand/easylist-thailand/master/subscription/easylist-thailand.txt
    https://raw.githubusercontent.com/fandagroupofficial/hosts/main/pihole/ads
    https://raw.githubusercontent.com/fandagroupofficial/hosts/main/pihole/log
    https://raw.githubusercontent.com/fandagroupofficial/hosts/main/pihole/trackers
    https://raw.githubusercontent.com/faralai/Pihole-Rules/master/Fara-Popups_Head
    https://raw.githubusercontent.com/faralai/Pihole-Rules/master/Fara-Xiaomi-info
    https://raw.githubusercontent.com/farrokhi/adblock-iran/master/filter.txt
    https://raw.githubusercontent.com/fskreuz/blocklists/dev/domains.txt
    https://raw.githubusercontent.com/ftpmorph/ftprivacy/master/regex-blocklists/smartphone-and-general-ads-analytics-regex-blocklist-ftprivacy.txt
    https://raw.githubusercontent.com/hell-sh/Evil-Domains/master/evil-domains.txt
    https://raw.githubusercontent.com/hosts-file/BulgarianHostsFile/master/bhf.txt
    https://raw.githubusercontent.com/igorskyflyer/ad-void/main/AdVoid.Core.txt
    https://raw.githubusercontent.com/jackrabbit335/UsefulLinuxShellScripts/master/Hosts%20%26%20sourcelist/blacklist.txt
    https://raw.githubusercontent.com/jakdev121/AMS2/master/pi_indo_ads.txt
    https://raw.githubusercontent.com/jakejarvis/ios-trackers/master/blocklist.txt
    https://raw.githubusercontent.com/jasirfayas/jBlocklist/master/domains.lst
    https://raw.githubusercontent.com/javabean/dnsmasq-antispy/master/dnsmasq.ghostery_bugs.conf
    https://raw.githubusercontent.com/javabean/dnsmasq-antispy/master/dnsmasq.zz-extra-servers-manual.conf
    https://raw.githubusercontent.com/jdlingyu/ad-wars/master/hosts
    https://raw.githubusercontent.com/jlonborg/piblacklist/main/blacklist.txt
    https://raw.githubusercontent.com/joaopinto14/PiHole/main/adverts
    https://raw.githubusercontent.com/kang49/kang49regexblacklistproject/main/blacklist
    https://raw.githubusercontent.com/lesong/Surge/main/rule/BanProgramAD.list
    https://raw.githubusercontent.com/lhie1/Rules/master/Auto/REJECT.conf
    https://raw.githubusercontent.com/mayesidevel/PiHoleLists/master/MiscBlocklist
    https://raw.githubusercontent.com/meinhimmel/hosts/master/hosts
    https://raw.githubusercontent.com/mhhakim/pihole-blocklist/master/custom-blocklist.txt
    https://raw.githubusercontent.com/migueldemoura/ublock-umatrix-rulesets/master/Hosts/ads-tracking
    https://raw.githubusercontent.com/minoplhy/filters/main/Resources/blocked.txt
    https://raw.githubusercontent.com/monojp/hosts_merge/master/hosts_blacklist.txt
    https://raw.githubusercontent.com/mtbnunu/ad-blocklist/master/kr-list.txt
    https://raw.githubusercontent.com/mtxadmin/ublock/master/hosts/_telemetry
    https://raw.githubusercontent.com/mullvad/dns-adblock/main/lists/doh/adblock/custom
    https://raw.githubusercontent.com/muxcc/AdsBlockLists/master/aumm.hosts
    https://raw.githubusercontent.com/nimasaj/uBOPa/master/uBOPa.txt
    https://raw.githubusercontent.com/notracking/hosts-blocklists/master/dnscrypt-proxy/dnscrypt-proxy.blacklist.txt
    https://raw.githubusercontent.com/npljy/npljy.github.io/main/blocks/dns.txt
    https://raw.githubusercontent.com/npljy/npljy.github.io/main/blocks/filter.txt
    https://raw.githubusercontent.com/olegwukr/polish-privacy-filters/master/adblock.txt
    https://raw.githubusercontent.com/parseword/nolovia/master/skel/hosts-government-malware.txt
    https://raw.githubusercontent.com/parseword/nolovia/master/skel/hosts-nolovia.txt
    https://raw.githubusercontent.com/pathforwardit/BlockList/main/DomainList
    https://raw.githubusercontent.com/pirat28/IHateTracker/master/iHateTracker.txt
    https://raw.githubusercontent.com/sa-ki13/jmsf/master/japanese_mobile_site_dns_filter.txt
    https://raw.githubusercontent.com/saurane/Turkish-Blocklist/master/Blocklist/domains.txt
    https://raw.githubusercontent.com/scomper/surge-list/master/reject.list
    https://raw.githubusercontent.com/sirsunknight/QuantumultX/master/Filter/Radical-Advertising
    https://raw.githubusercontent.com/smed79/blacklist/master/hosts.txt
    https://raw.githubusercontent.com/stamparm/maltrail/master/trails/static/suspicious/pua.txt
    https://raw.githubusercontent.com/sutchan/dnsmasq_ads_filter/main/dnsmasq-ads-filter-list.txt
    https://raw.githubusercontent.com/svetlyobg/svet-custom-domains/master/ads-domains
    https://raw.githubusercontent.com/tomzuu/blacklist-named/master/ad.sites.conf
    https://raw.githubusercontent.com/tomzuu/blacklist-named/master/phishing.sites.conf
    https://raw.githubusercontent.com/tomzuu/blacklist-named/master/pushing.sites.conf
    https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/badware.txt
    https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/filters.txt
    https://raw.githubusercontent.com/uBlockOrigin/uAssets/master/filters/privacy.txt
    https://raw.githubusercontent.com/unchartedsky/adguard-kr/master/adguard-kr.txt
    https://raw.githubusercontent.com/unflac/adFILTER/master/filter.txt
    https://raw.githubusercontent.com/vokins/ad/main/ad.list
    https://raw.githubusercontent.com/willianreis89/ADsBlock/master/list.txt
    https://raw.githubusercontent.com/wrysunny/ad_list/master/adlist.txt
    https://raw.githubusercontent.com/xOS/Config/Her/Surge/RuleSet/Advertising.list
    https://raw.githubusercontent.com/xinggsf/Adblock-Plus-Rule/master/rule.txt
    https://raw.githubusercontent.com/xlimit91/xlimit91-block-list/master/blacklist.txt
    https://raw.githubusercontent.com/xylagbx/ADBLOCK/master/BLOCK/customadblockdomain.txt
    https://raw.githubusercontent.com/ziozzang/adguard/master/filter.txt
    https://raw.githubusercontent.com/zznidar/BAR/master/BAR-list
    

    The only slightly tricky part in that is sub(/^raw$/,RS,$6) && sub(OFS RS,"") which is how you remove a mid-record field in awk - first convert the field to a string that matches RS since that can't be present in the input (we can use RS directly when it's a string like \n rather than a regexp) so we changed raw to \n in the 6th field which meant the record now contained /\n/ and then removed /\n thereby removing the 6th field and preceding /.