I'd like to download some (I mean a lot of) video ads from YouTube (not the videos, but the ads that are played before them) for research purposes.
Any suggestions for tools or ways this can be achieved? I won't mind some programming on my hack but I currently have no idea how. Also, YouTube is not an requirement; video ads from other video sites work for me too.
When I found your question, I thought it will be fun to play with it, as I recently developed site (http://savedeo.com) allowing you to download a video from many sites including YouTube.
I looked only on the Youtube's ads system, this will not work for other systems (most likely). The good thing about YouTube is, that you can get all necessary information directly from a video page, so it's easier to crawl it really fast (I downloaded almost 22M video pages in a single day on a very small server). So the part you are looking for is ;ytplayer.config = {(.*?)};
which contains inline json object (easy to deal with). You will not need anything else.
If you don't want to parse it from the HTML, you can get directly just the json object by adding a parameter &spf=prefetch
at the end of any Youtube video link e.g. https://www.youtube.com/watch?v=bbEoRnaOIbs&spf=prefetch
Not every YouTube video is showing Ads (from my statistics, only 18% of videos eactually does). You can verify if ads are enabled for the video just by looking for ad_tag
key inside of the json object.
YouTube is using three different Ad systems:
to serve Ads from two different sources: - video uploaded on YouTube used as ad (mostly only part of the video is shown) - video from external source (3rd party ads server)
The starting point for all of them is the same. Locate dynamic_allocation_ad_tag
key inside of the json object. It contains an url leading to doubleclick server. This url will not work until you will change this part sz=WIDTHxHEIGHT;
with a real AR values e.g. sz=480x70,480x360,480x361;
.
You want to locate three other keys tpas_partner_id
, tpas_video_id
and video_id
(is the video_id from the url) in the same json file, as these will be used for the 3rd party ad system.
Now you can hit the doubleclick url, which will return a XML file containg information about the ad which will be served for this video. The whole file is interesting and full of important information (so you should probably store it with the video). Look for these three keys AdSystem
, AdTitle
and Description
.
If the ad is served from adsense system (both adsense or adx) this xml contains all information for the ad, including duration and the direct link for the ad. The link is exactly what are you looking for and you can find it under a key MediaFile
.
The link mostly look like this http://www.youtube.com/get_video?video_id=LCeDi-d5CRg&ts=1391921207&t=CyJEI0XYwJVJEYE5CVhqY-DF3KQ&gad=1
and it redirects you to the real file in the mp4 format. If the Ad system is ADX, you will get a direct link to mostly a flv file e.g. http://playtime.tubemogul.com/ad_promoted_videos/4799351_dhxsYlMYHmLMmxL0oBem_1390593897.flv
If the ad is server from 3rd party system, GDFP, you have to call different server. For 3rd party ads Youtube uses FreeWheel service. To obtain the ad data, you have to prepare a XML request, which looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<adRequest profile="{profile}" networkId="10613" version="1">
<capabilities>
<expectMultipleCreativeRenditions />
<supportsAdUnitInMultipleSlots />
<supportsSlotCallback />
<supportNullCreative />
<supportAdBundle />
<supportsFallbackAds />
<autoEventTracking />
<requiresRendererManifest />
<requiresVideoCallbackUrl />
</capabilities>
<visitor caller="AS3-5.6.0-r9954-1305270957">
<httpHeaders>
<httpHeader value="https://www.youtube.com/watch?v={video_id}" name="referer" />
<httpHeader value="12,0,0,38" name="x-flash-version" />
</httpHeaders>
</visitor>
<keyValues>
<keyValue key="_fw_distributorvideoassetid" value="{video_id}" />
<keyValue key="_fw_yt_type" value="short" />
<keyValue key="_fwu:10613:lang" value="eng" />
</keyValues>
<siteSection pageViewRandom="{random}" customId="youtube_watch" siteSectionNetworkId="{tpas_partner_id}">
<videoPlayer>
<videoAsset autoPlay="true" duration="318" videoPlayRandom="{random}" customId="{tpas_video_id}" videoAssetNetworkId="{tpas_partner_id}">
<adSlots height="390" defaultSlotProfile="{profile}" width="699" compatibleDimensions="2560,1440">
<temporalAdSlot height="390" adUnit="preroll" timePosition="0" customId="0_1" width="699" />
<temporalAdSlot height="390" adUnit="overlay" timePosition="0" customId="0_2" width="699" />
</adSlots>
</videoAsset>
<adSlots>
<nonTemporalAdSlot height="60" customId="0_5" width="300" acceptCompanion="true" />
<nonTemporalAdSlot height="250" customId="0_6" width="300" acceptCompanion="true" />
</adSlots>
</videoPlayer>
<adSlots />
</siteSection>
</adRequest>
You probably spoted multiple variables in {}. You have to replace them with custom data, mostly with the data you've obtained from the json object.
10613:10613_youtube_as3_player
and 10613:youtube2
.Now, you send this XML file as a POST request to https://2975c.v.fwmrm.net/ad/p/1?
(don't forget to send with the content type application/xml
).
The response contains another XML file where you have all necessary data for the ad, including direct links in various formants and dimensions. You can find them under key asset
. Again, you should probably store the whole file with the video as it contains additional data for the ad.
That's it, happy hunting.