Avoid partial in multiprocessing in python

I have been using pool.map in the multiprocessing package for simple parallel jobs because of its simplicity and ease of use. However, the simplicity comes at a cost that the computation function f(x) to be parallelized can only take one argument as input. If f(x, D) needs auxiliary data D, there are a few workarounds:

1. combine the main argument and auxiliary data together as a tuple (x, D), and use this tuple as a single argument, i.e., f((x, D)).
2. use the partial function to generate a wrap-up function of f with the auxiliary data g=partial(f, D=D).
3. just ignore D in the argument list and let python find D in the memory.

It turns out that #3 is the most efficient way. I had been using #2 and didn’t not realize the difference until one day my f needs big auxiliary data D. In both #1 and #2, python will pickle the arguments and send them to the workers. When D is large, the pickling process takes a lot of time and the cost on data transfer is huge.

Lesson learnt: sometimes the naive approach might be the best approach.

Advertisement

Downsampling large data for visualization

This code is inspired by Bokeh:datashader. I have a time series of millions of data points which I would like to visualize in a browser. But, 1. It is slow to transmit and plot so many points in JS. 2. Even if the browser is powerful enough to draw all those points, the points will for sure lie on top of each other since the computer scree has at most a few thousands of pixels in one direction. The idea of datashader is to aggregate the data so that I plot at most one point per pixel. This way I make full use of the screen without losing any information visually. However, datashader overqualified for my application and it is not flexible enough for my situation. So I wrote the following code to do some simple downsampling for time series.


# To deal with time series, first need to convert pandas timestamp to int64
# df['time']=df.time.values.astype(np.int64)/1e6

import pandas as pd
import numpy as np
def sampling1d(dataframe,x,y,width,xmin=None,xmax=None):
    df=dataframe[[x,y]]
    if xmin is not None:
        df=df[df[x]>=xmin]
    if xmax is not None:
        df=df[df[x]<=xmax]
    bin_edges=np.linspace(df[x].min(),df[x].max(),width+1)
    bins=np.searchsorted(bin_edges, df[x])
    bins[bins==0]=1
    agg=df.groupby(bins)
    df2=pd.DataFrame()
    df2[x]=agg[x].max()
    df2[y+'_mean']=agg[y].mean()
    df2[y+'_min']=agg[y].min()
    df2[y+'_max']=agg[y].max()
    return df2

Here is a version for sampling big data in 2D


def downsample2d(x,y,logx=False,logy=False,width=500,height=500,weights=None):
    if logx:
        binx=np.logspace(np.log10(np.min(x)),np.log10(np.max(x)),width)
    else:
        binx=width
    if logy:
        biny=np.logspace(np.log10(np.min(y)),np.log10(np.max(y)),height)
    else:
        biny=height

    z,binx2,biny2=np.histogram2d(x,y,bins=[binx, biny])
    xi,yi=z.nonzero()
    binx2=(binx2[:-1] + binx2[1:])/2
    biny2=(biny2[:-1] + biny2[1:])/2
    
    if weights is not None:
        z2,_,_=np.histogram2d(x,y,bins=[binx, biny],weights=weights)
        return binx2[xi],biny2[yi],z2[xi,yi]/z[xi,yi]
    
    return binx2[xi],biny2[yi]

Gephi streaming from python igraph

It is a nightmare to do visualization in python igraph, at least for me. After hours tweaking cairo and pycairo and distorted node labels, I found an alternative route – push graphs to Gephi from igraph. And what’s cool about it, I can update my graph dynamically!

  1. Download the streamer plugin fro gephi
  2. start the master server in gephi
  3. Run the following python code.

import igraph as ig
import igraph.remote.gephi as igg

# Create graph
g = ig.Graph([(0,1), (0,2), (2,3), (3,4), (4,2), (2,5), (5,0), (6,3), (5,6)]
g.vs[“name”] = [“Alice”, “Bob”, “Claire”, “Dennis”, “Esther”, “Frank”, “George”]
g.vs[“age”] = [25, 31, 18, 47, 22, 23, 50]
g.vs[“gender”] = [“f”, “m”, “f”, “m”, “f”, “m”, “m”]
g.es[“is_formal”] = [False, False, True, True, True, False, True, False, False]

# Send to Gephi
gephi=igg.GephiConnection()
streamer=igg.GephiGraphStreamer()
streamer.post(g,gephi)

# Update graph
api = igg.GephiGraphStreamingAPIFormat()
event=api.get_add_node_event(“1″, dict(label=”eggs”))
streamer.send_event(event,gephi)

Finally get cairo to work with igraph

I have an anaconda distribution of python, so I tried

conda install cairo

conda install pycairo

But the latter throws error cannot find pixman even after I conda install pixman succesfully. So I gave up on this route and

brew install cairo

brew install py2cairo

This way cairo is installed in the brew directory. To use it with anaconda python, add it to the sys path

import sys

sys.path.append(“/usr/local/lib/python2.7/site-packages”)

Then it works!

p.s. to manually compile pycairo, remember to add cairo to the path because I had hard time to have configure find cairo. This is not necessary if you use brew.

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:/opt/X11/lib/pkgconfig

pkg-config –cflags-only-I cairo

error igraph_attributes.h: No such file or directory when installing igraph

It takes me a lot of time to pip install python-igraph on a remote ubuntu machine. The  error I got is “igraph_attributes.h: No such file or directory.” but that is not the real problem.

The real problem happens when pip was trying to compile the c core of igraph and it failed due to missing library lxml2!. And what I really need is THE FOLLOWING:

sudo apt-get install libxml2-dev

ipython notebook server on a remote machine

Goal: running an ipython notebook server on a remote machine, and access from a local browser

How to: (shamelessly copied from someone’s blog

1. On the remote machine:

ipython notebook --no-browser --port=7777

2. On the local machine, my remote machine can only be accessed via a login node, so I need to use a multi-hop ssh tunnel. In order not to type the following every time, save it into a file.

host1=username@login_node.com
host2=username@dest.ination.com
ssh -L 7777:localhost:7777 $host1 ssh -L 7777:localhost:7777 -N $host2

If you don’t need to go through a login node, it is a little easier:

ssh -N -f -L localhost:7777:localhost:7777 username@dest.ination.com

error: ‘NAN’ undeclared when installing igraph

I got this strange error when installing igraph:

plfit/gss.c: In function ‘gss’:
plfit/gss.c:92: error: ‘NAN’ undeclared (first use in this function)
plfit/gss.c:92: error: (Each undeclared identifier is reported only once
plfit/gss.c:92: error: for each function it appears in.)
plfit/gss.c:93: error: ‘INFINITY’ undeclared (first use in this function)

It turns out to be a compiler standard problem. Adding the flag CFLAGS=’-std=gnu99′ to make solves the problem

easy_install does not work after distribute upgrade

I tried to upgrade matplotlib which asked me to upgrade distribute. I upgraded distribute and then easy_install does not work…… It is solved by the following

1. Check your /usr/bin and /usr/local/bin for easy_install installations and remove any old script:

sudo rm /usr/bin/easy_install*

sudo rm /usr/local/bin/easy_install*

2. Download and run distribute:

curl -O http://python-distribute.org/distribute_setup.py

sudo python distribute_setup.py

sudo rm distribute_setup.py

Copy from