I get lots of
Cannot resolve destination host
and
Cannot connect to destination host
errors.
In my mobile Unity3d app that is working on iOS and Android.
Even though it can work some seconds before, it starts to get mentioned errors out of nowhere.
I'm trying to reconnect to servers, I have several backup servers but all of them start to give this errors.
I use https, http, and pure ip addresses to connect to server. Also have retry function, that tries 100 times with 0.1s cooldown to reach any of this addresses.
What could be the problem?
UPD1:
About seconds before - app makes several consequence requests (settings, payment settings, available content). And it can receive settings without problem but start getting errors on payment settings (just a json).
UPD2:
Yes, iOS and Android. Could not reproduce on local PC, or on my phone. It's rather random error, maybe dependent on the country of a player.
UPD3: Several backup servers needed because there is a problem with connectivity around the world. I use Bunny CDN but it has no servers in specific locations. Also I use backup servers as there are times when some of them might be in outages.
30s is rather unrealistic for my app, it's a game and players commonly not wait so much time.
About DDOS - Thanks! I think it might be it. Could you please provide any info where I can find more specifics? Like what hardware can do this prevention, ISP, or mobile phone itself.
Otherwise I'm greatly appreciate your comments, please make this as an answer, I will mark it as a solution, if you don't mind.
I'll summarize everything here so it's in once place and easier to follow, however, keep in mind the actual discussion should be longer.
Because your code works at some points but not at others there can be multiple reasons and it is fairly difficult to tell without seeing more diagnostic info or the actual code, however, here are some advices.
There is nothing that a developer hates more than a bug with attitude, that is hard to reproduce and thus hard to fix, since it can't be debugged on every occasion. In your case, you can see if the unreachability either happens after a certain amount of time or amount of calls. Also note the time where you are allowed to connect after you were not allowed (more on this later). Compare the times. If they are consistent, the problem might be server-side protection (more on this later). If it's random the list of possible reasons exapands.
Test the setup with different services or other hosts (i.e. www.google.com).
Some service providers (Unity, Apple, Steam, Oracle etc.) offer a status page that's either public or accessible in a private dashboard in your private account. Most also provide a schedule (or at least notifications) of outages or maintenance that could cause the downtime. Last but not least, you can also contact support and ask them to confirm.
If you want to see if the error is from your code, this is one of the best methods. An MRE also helps you determine if there are conflicts between third party packages / libs / utilities. So you can just add one dependency at a time in your empty project until you see if this is the case. As for the different environments, it is not uncommon for something to work in the Editor and not on a build or for it to work on the PC but not on a MD (mobile device) or WebGL build since there are different rules to sandboxing or available system APIs.
Does it happen on both iOS and Android with the same frequency or does one have it more often? This can help you determine if it is either a problem with how Unity creates or deploys the build (because yes, it did have several issues in the past) or it might be some issue with an older phone or certain versions of an OS or even limitations.
Does this happen only on the retry code or also the "normal" connect code? Similar to the previous point, check if it is only in some cases or not. Addition info that you provided state that this happens only during payment calls, but others work.
When it comes to code, there is no better method than debugging to find the causes of issues (which is why it's called de-buging). Attach a debugger and try to do some code stepping and see if you can determine the problem. Also note that because execution is suspended while debugging, you might end up with false positives (such as expired sessions or timeouts). An alternative is to print some metadata, debug info or values of params or vars. Depending on the library and / or functions you use to connect to servers, some might give more information on top of the dry messages Cannot resolve destination host
and Cannot connect to destination host errors
.
While on paper all of these sound good, they might be detrimental to your app. All three can be interpreted by the network as flood attacks and your connections can be denied. (more on this later)
While debugging using WireShark might be useful, it can prove a little complicated. So even simple tests like ping, tracert, telnet or curl on the addresses you use in your code could help you find the causes.
Test like this to rule out possible issues on your home network as well as to test the route to the servers. Compare the previous points (when it works and when it doesn't) and switch between WiFi and Cellular.
Networks have protections against flooding (the "simple" version of an attack using many attempted connections or requests), DOS (denial of service - the common attack on small servers) or DDOS (distributed DOS which is the "nuclear" version of flooding from multiple devices - not necessarily IPs, not just one). They can either be from your ISP (home network or cellular) the host that the server uses or even the service provider itself (usually all of them). Some might be stricter than others and thus can block requests.
You mentioned you do over 100 retries in a 0.1s delay. This can easily be interpreted as an attack.
Either reduce the number of retries or increase the delay (both are recommended) and try to find a balance.
If you can't modify the delays and you say you are dependent on them, there might also be a design problem in your game. In general, while connections can be persistent, you shouldn't need to call services non-stop or very often. One alternative is to queue them. For example, modify the app workflow and if the objective of a game is to gather coin and transform the coin into $$, instead of doing it real-time on every coin pickup, do it at the end of a level or every 50 coins or every 5 minutes.
Last but not least, talk to the service provider where you call those web-services and ask for a solution from their Technical Client Support.
Is the payment call always resulting in errors or only sometimes? Make sure you call it properly and there are no limitations; checks the docs or ask support for help. We can't help without seeing the code.
Maybe random, maybe dependent on the country, but don't assume. The worst approach when making apps (including games) is making assumptions. Work only with facts. They could be true, but if they're not, you'd be wasting time with possible solutions for a problem that doesn't exist and in the end the solutions would not only end up being worthless, but add unnecessary complications or overhead to your code (not to mention delays before you deploy it live).
Well I find this use-case as very strange. First of all, international data centers need to be linked so data gets replicated so in other words, if a server, zone or data center fail, then it should not be your problem to check that. Secondly, if they do fail, it should not be you who sets up the code, but they should have some routing logic in their mesh network. All of this should be transparent to you and not need any code. Thirdly, the proper way to offer public access to client is via clusters, so when an app tries to connect, then it should automatically be redirected either to the closest cluster, either to the closest online cluster (if the really really really closest one is unavailable as it reached a maximum number of connections or it's offline).
Moreover, instead of using multiple connections all the time (that's what the way you described it sounds like), maybe connect to a server and if that connection (or the current request on that connection) fails, only then connect to a different backup server. There is no reason to hold 50 connections up "just in case you need them later".
Search online for DDOS or DOS (not the same thing but almost the same thing). As for what software detects this, well, that's irrelevant. It can be anything from a WAF to an XDR, but knowing what they are will not help you. There are obviously a lot of apps that use content from cloud or IAP so it's clear that the approach you use is valid. The only possible causes might either be using the code incorrectly or bad app design.