Data Insights into Provisioning Orders Using R and Python 使用R和Python数说Provisioning Orders

数据源来自于Oracle DB导出的2份数据文件。其中,orderList是order流水记录,二orderRequests是每个order的详细XML格式request payload。
首先,借助Python将2份文件读入Pandas Data Frame。然后遍历orderRequest利用xml.etree.ElementTree解析order包含的具体服务类型和license,并添加到orderList作为新的列。最终汇总得到19633行119维的数据集。数据清洗的部分此处先忽略。

  • pandas.DataFrame.diff
    在分析change order中license的变化时,需要比较同一客户的service license前后变化。如果用SQL,得在做完group by service之后再重新join查询得到license,然后再写DB function实现一个group里面的前后license值变化。简直如同噩梦,果断放弃。这一功能恰好可以直接在分组上使用Pandas的diff方法,只需对差值稍作转换即可得到想要的license变化类型,可谓简洁明了。
    大体的Python代码如下。
    #根据diff差值解释license是Upsell,Downsell还是Unchanged
    def licenseUpsellDownsellUnit(x):

if x > 0:
return ‘Upsell’
elif x 1:
group_df = group.sort_values(by=’ServiceOrderID’).reset_index()
group_df[serviceName+’.License Change’] = group_df[serviceName+’.TOTAL LICENSE VOLUME’].dropna().diff().apply(licenseUpsellDownsellUnit)
group = group_df.set_index(keys=’index’)
else:
group[serviceName+’.License Change’] = ‘Only One Order Found’
return group
#按照产品(这里是SiteURL)分类
def licenseUpsellDownsellAll(df_service, serviceName):
groups = df_service[[‘ServiceOrderID’,’TransactionType’,serviceTotalLicense,’OrderRecieveDate’]].groupby(df_service[‘SiteURL’])
licenseChangeGroups = groups.apply(licenseUpsellDownsellByGroup)
return licenseChangeGroups

  • 等到图形化license change数据的时候,我犯了一个小错误。
    如果我告诉你下面2张图我的本意实际想要传递同样的信息,你是不是会会对图1表示抓狂?!
    其实图1看起来凌乱,实则包含一个很重要的信息,就是同一产品的licens change频率,这点可以从曲线的长度“直观”体现。折线每次上升或者下降即表明一次Upsell与Downsell,所以折线的一段越长,代表保持的时间越长,反之,越短则表示license很快变化。而放弃曲线改为散点的图2就丢失了这部分信息。当然,图2的纵轴可以直接看到license change的百分比。所以嘛,并不是信息量越大的可视化才叫好,相反,可视化的目标不就正是希望同图表代替数据来说话吗,越少烧脑,才能更好传达信息,不是吗?
    图1
    Plot9CMRUpsellDownsell-2
    图2
    Plot9CMRUpsellDownsell-1
  • 往R dyplyr方法内传入列名变量
    这点上Python可以碾压R了。在Python里面,把DataFrame的列名作为函数入参传入是很正常的做法,但是R里面却没那么直观了。
    比如我想定义一个对order里面的服务按周分组的function,当我把serviceName传入的时候,正常使用dyplyr里面的n()和sum()来计数是不行的。这时要改用SE function, 即summarise变成summarise_。同时,由于summarise_里面既有特殊变量serviceName也有一般的totalCount和serviceBookingRatio,参考vignette(“nse”)文档使用interp。
    #Use as.name if you have a character string that gives a variable name
    interp(~ mean(var), var = as.name(“mpg”))
    #> ~mean(mpg)
    #or supply the quoted name directly
    interp(~ mean(var), var = quote(mpg))

#将Data.Frame的列名serviceName作为入参传入
plotServiceBookingTrendByWeek<- function(instanceName,serviceName) { aggregatedServices % filter(InstanceName==instanceName, TransactionType==’Provide’|TransactionType==’Trransfer’|TransactionType==’Renewal’,ProductModule!=”) %>% group_by(year,week,ProductModule) %>% summarise_(totalCount = ~n(), serviceBookingRatio = interp(~ sum(service,na.rm=TRUE)/totalCount, service = as.name(serviceName)))
… ..

  • 绘制order在世界地图上的分布图
    最简单应该是调用GoogleVis封装好的gvisGeoChart,可惜它自身的iso2c识别针对小国家不太理想,比如新加坡和香港在生成的图里面竟然没有显示。
    最终,通过引入rworldmap和countrycode这2个库成功绘制。

#Count of unique Webex Sites per each Country
webexAnnuityOrdersAggrbyCountry % filter(InstanceName==’SITE’, CountryCode != ”) %>% group_by(CountryCode) %>% summarise(WebExSitesCount = n_distinct(SiteURL)) %>% arrange(desc(WebExSitesCount))

#使用GoogleVis封装好的gvisGeoChart
webexAnnuityOrdersAggrbyCountry$CountryName <- countrycode(webexAnnuityOrdersAggrbyCountry$CountryCode,’iso2c’, ‘country.name’)
Geo=gvisGeoChart(webexAnnuityOrdersAggrbyCountry, locationvar=”CountryName”,
colorvar=”WebExSitesCount”,
options=list(projection=”kavrayskiy-vii”))
plot(Geo)

#使用rworldmap和countrycode绘制
annuityOrders_map_poly has both world map and annuityOrdersAggrbyCountry data merged together
webeAnnuityOrders_map <- joinCountryData2Map(webexAnnuityOrdersAggrbyCountry, joinCode = “ISO2”, nameJoinColumn = “CountryCode”)
webexAnnuityOrders_map_poly <- ggplot2::fortify(webeAnnuityOrders_map)
webexAnnuityOrders_map_poly <- merge(webexAnnuityOrders_map_poly, webeAnnuityOrders_map@data, by.x=”id”, by.y=”ADMIN”, all.x=TRUE)

#Ashmore and Cartier Islands作为澳大利亚的附属岛屿可以被忽略不计
webexAnnuityOrders_map_poly$WebExSitesCount[webexAnnuityOrders_map_poly$id==’Ashmore and Cartier Islands’] <- NA
webeAnnuityOrders_map@data$WebExSitesCount[webeAnnuityOrders_map@data$ADMIN==’Ashmore and Cartier Islands’] <- NA
webeAnnuityOrders_map@data[‘Text’] <- paste(webeAnnuityOrders_map@data$ADMIN,webeAnnuityOrders_map@data$WebExSitesCount,sep = ‘:’) ggplot() + coord_map(xlim = c(-180, 180), ylim = c(-60, 75)) + geom_polygon(data = webexAnnuityOrders_map_poly, aes(long, lat, group=group,fill=WebExSitesCount), color=’white’, size = 0.3) + scale_fill_gradient2(trans= ‘log’, breaks=c(0,10,100,200,4000), labels=c(‘4000′)) + xlab(NULL) + ylab(NULL) + guides(fill = guide_legend( title=’Number of Annunity WebEx Sites Around the World \n(June 1,2016 – June 1,2017)’, keywidth = 0.7, keyheight = 0.7, reverse=F, title.position=”top”)) + theme( plot.background = element_blank() ,panel.grid.major = element_blank() ,panel.grid.minor = element_blank() ,panel.border = element_blank() ,axis.ticks = element_blank() ,axis.text.x = element_blank() ,axis.text.y = element_blank() ,legend.position = “bottom” ,legend.direction = “horizontal” ) + geom_text(data=subset(webeAnnuityOrders_map@data,WebExSitesCount>0), aes(x=LON, y=LAT,label=Text), size=2.0)
最终图如下。
Plot1WorldMapofWebexSites-1

  •  Sankey绘制
    Sankey又叫RiverFlow,是一种反应不同类型转化的有效可视化形式。考查change order里面同一service的不同type的变化情况可以使用。

MCSessionSwappedCount % filter(TransactionType==’Amend’|TransactionType==’Transfer’,WEBEX.CONFERENCING.Session.Swap==’Swapped’) %>% group_by(WEBEX.CONFERENCING.FORMER.SESSION.ID,WEBEX.CONFERENCING.SESSION.ID) %>% summarise(count=n())
#这里的一个trick就是前后的名字必须不同,否则绘制会报错。简单的处理方法就是认为添加空格制造不同。
MCSessionSwappedCount$WEBEX.CONFERENCING.FORMER.SESSION.ID[MCSessionSwappedCount$WEBEX.CONFERENCING.FORMER.SESSION.ID==3] <- ‘Pro Meeting ‘
… …

MCSessionSwappedCount$WEBEX.CONFERENCING.SESSION.ID[MCSessionSwappedCount$WEBEX.CONFERENCING.SESSION.ID==3] <- ‘Pro Meeting’
… …

sk_MC <- gvisSankey(MCSessionSwappedCount, from=”WEBEX.CONFERENCING.FORMER.SESSION.ID”, to=”WEBEX.CONFERENCING.SESSION.ID”, weight=”count”)

print(sk_MC, tag = ‘chart’)

Sankey

  • Treemap绘制
    Treemap尤其适合展示多层级关系。这里我用它来显示提交order的合作商及其对应的系统账号的分布。
    treemap(subset(webexAnnuityOrdersAggrbyPartnerAccount,partnerAccountName!=’Direct’),
    index=c(“partnerName”,”partnerAccountName”),

vSize = “CCWOrderCount”,
title=”Partner Orders”,
type=”index”,
title.legend=”number of Partner orders”,
fontsize.labels=10,
align.labels = list(c(“center”, “center”), c(“right”, “bottom”)),
overlap.labels=1,
lowerbound.cex.labels=0,
inflate.labels=FALSE,
aspRatio=1.4)

Plot2PartnerTreeMap-1

当然,完整的分析还涉及到很多metric的考察,由于涉及到商业信息,就不在一一赘述了。完整的Rmd代码上传至我的Github账号。

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s