I am a big newbie in pyspark. have organized a RDD with the following code:
rdd1 = labRDD.map(lambda kv: (kv[0].split("/")[-1].split('.')[0], kv[1]))
rdd2 = rdd1.flatMapValues(lambda v: v.split('\r\n'))
rdd3 = rdd2.map(lambda kv: (kv[0], kv[0].split('_')[0], kv[1].split()[0], int(kv[1].split()[1])))
The result is ('town','shop','month','revenue') :
[('anger', 'anger', 'JAN', 13),
('marseille', 'marseille_1', 'FEB', 12),
('marseille', 'marseille_2', 'MAR', 14),
('paris', 'paris_1', 'APR', 15),...]
I am forced not to use dataframe, thus I need RDD results. I have to calculate :
Thanks in advance :)
I've found the answer to the two first ones :)
Total revenue per city per year
annual_city_rev = rdd3.map(lambda t:(t[1], t[3])).reduceByKey(lambda x,y:int(x)+int(y))
annual_city_rev.collect()
Total revenue per store per year
annual_store_revenue = rdd3.map(lambda t:(t[0], t[3])).reduceByKey(lambda x,y: int(x)+int(y))
annual_store_revenue.collect()