Python workshop¶

bioinformatics@age.mpg.de¶

http://bioinformatics.age.mpg.de/¶

Pre workshop instructions¶

Install docker: https://www.docker.com/products/docker-desktop

Pull the workshop image

docker pull mpgagebioinformatics/python_workshop

Run the image and enter the container:

mdkir -p ~/python_workshop
docker run -p 8787:8787 -p 8888:8888 -v ~/python_workshop/:/home/mpiage --name python_workshop -it mpgagebioinformatics/python_workshop:latest

Start jupyterhub inside the container

module load jupyterhub
jupyter notebook --ip=0.0.0.0

Access jupyter. You will be shown a message like this:
```
Copy/paste this URL into your browser when you connect for the first time,
 to login with a token:
     http://b4b294d48b2b:8888/?token=01d1532b8624becc5c7b288cf0c600b7805e802349e81478&token=01d1532b8624becc5c7b288cf0c600b7805e802349e81478
```
Replace the "b4b294d48b2b" string (ie. bettween the http:// and :8888/) with "localhost" (eg. http://localhost:8888/?token=01d1532b8624becc5c7b288cf0c600b7805e802349e81478&token=01d1532b8624becc5c7b288cf0c600b7805e802349e81478 ) and paste it into your web browser.

If you are not able to install docker you can still run the notebook by install Python3 - https://www.python.org - and jupyter - https://jupyter.org. Afterwards make sure you have installed all required packages:

Package            Version
------------------ -------
autograd           1.2    
backcall           0.1.0  
bleach             2.1.3  
Bottleneck         1.2.1  
cycler             0.10.0 
decorator          4.3.0  
entrypoints        0.2.3  
future             0.17.1 
html5lib           1.0.1  
ipykernel          4.8.2  
ipython            6.4.0  
ipython-genutils   0.2.0  
ipywidgets         7.2.1  
jedi               0.12.0 
Jinja2             2.10   
jsonschema         2.6.0  
jupyter            1.0.0  
jupyter-client     5.2.3  
jupyter-console    5.2.0  
jupyter-core       4.4.0  
kiwisolver         1.0.1  
lifelines          0.19.4 
MarkupSafe         1.0    
matplotlib         3.0.2  
matplotlib-venn    0.11.5 
mistune            0.8.3  
nbconvert          5.3.1  
nbformat           4.4.0  
notebook           5.5.0  
numpy              1.16.1 
pandas             0.24.1 
pandocfilters      1.4.2  
parso              0.2.1  
pexpect            4.6.0  
pickleshare        0.7.4  
pip                9.0.3  
prompt-toolkit     1.0.15 
ptyprocess         0.6.0  
Pygments           2.2.0  
pyparsing          2.3.1  
python-dateutil    2.7.3  
pytz               2018.9 
pyzmq              17.0.0 
qtconsole          4.3.1  
scikit-learn       0.20.2 
scipy              1.2.1  
seaborn            0.9.0  
Send2Trash         1.5.0  
setuptools         39.0.1 
simplegeneric      0.8.1  
six                1.11.0 
sklearn            0.0    
terminado          0.8.1  
testpath           0.3.1  
tornado            5.0.2  
traitlets          4.3.2  
wcwidth            0.1.7  
webencodings       0.5.1  
widgetsnbextension 3.2.1  
xlrd               1.2.0

Check your packages with pip3 list and install with pip3 install <package_name>==<version> --user.

You can then download the notebook start you juoyter with jupyter notebook --ip=0.0.0.0 and download the notebook from: https://github.com/mpg-age-bioinformatics/presentations-tutorials/blob/gh-pages/presentations/modules/python_workshop/python_workshop.ipynb.

Why Python?¶

Easy to learn

It's free and well documented.

Popular (easy to get help on internet forums), big community.

Fast !?

Scripts are portable (can run on windows, mac os or linux).

A lot of libraries for many applications ready to use.

Basic syntax¶

Python is a “high level” script language, witch means it is as closest as possible to natural human language.

It is possible to run commands interactively or in scripts.

On the terminal, type python:

Print is a function, usually in python all the functions are called by name_function()

But before python 3, print function can be called without parenthesis ()

Inside the parenthesis, for some functions, there are parameters, called arguments

Inside a script:

First line points to the python interpreter on the operational system

And now to run (on unix based system):

Using # on a script, you can write comments and they will be ignored for the execution of the program

Blank lines are ignored in a script

It is possible to write multiple statements on a single line, separated by ;

From now, in this workshop, we will use only iterative python or Jupyter notebook.

# First print on Jupyter

print("Hello, Python!")

Hello, Python!

Variables¶

A variable is something which can change!!

It is a way of referring to a memory location by a computer program.

It stores values, has a name (identifier) and data type.

While the program is running, the variable can be accessed, and sometimes can be changed.

Python is not “strongly-typed” language. It means that the type of data storage on variables can be changed (other languages like C the variable should be declared with a data type).

Variables and identifiers¶

Some people mistakes variables and identifiers.

But identifiers are names of variables, they have name AND other features, like a value and data type.

In addition, identifiers are not used only for variables, but also for functions, modules, packages, etc.

A valid identifier is a non-empty sequence of characters of any length with:

a) The start character can be the underscore "_" or a capital or lower case letter.

b) The letters following the start character can be anything which is permitted as a start character plus the digits.

c) Just a warning for Windows-spoilt users: Identifiers are case-sensitive!

d) Python keywords are not allowed as identifier names! Ex: and, as, assert, break, class, continue, def, del, elif, else, except, for, if, in, is, lambda, not, or, pass, return, try, with.

# Some example of variables

i = 30
J = 32.1

some_string= "string"

Basic Data Types summary in Python¶

Numerics:¶

int: integers, ex: 610, 9580

long: long integers of non-limited length (only python 2.x) <- Python 3 int is unlimited

Floating-point, ex: 42.11, 2.5415e-12

Complex, ex: x = 2 + 4i

Sequences:¶

str: Strings (sequence of characters), ex: “ABCD”, “Hello world!”, “C_x-aer”

list

tuple

Boolean:¶

True or False

Mapping:¶

dict (dictionary)

Numbers¶

int (integers): positive or negative numbers without decimal point.

a_number = 35
type(a_number)

int

Note that there is no “ ”. If you include quotation marks:

It is not a integer (int) but a string (str) type.

a_number = "35" 
type(a_number)

str

long (long integers): unlimited size integer to store very big numbers with a L in the and.

Ex: 202520222271131711312111411287282652828918918181050001012021L (only Python 2.x)

a=202520222271131711312111411287282652828918918181050001012021
type(a)

int

a

202520222271131711312111411287282652828918918181050001012021

float (floating point real values): real numbers with decimal points (ex: 23.567) and also can be in scientific notation (1.54e2 – it is the same of 1.54 x 100 and the same of 154)

a_float_number=23.567
type(a_float_number)

float

another_float_number = 1.54e20
type(another_float_number)

float

another_float_number

1.54e+20

another_float_number

1.54e+20

complex (complex numbers): not used much in Python. They are of the form a + bJ (a and b are floats) and J an imaginary number

a_complexcomplex=23.567+0j

type(a_complexcomplex)

complex

Number type conversions¶

a_float_number

23.567

int(a_float_number)

23

a_float_number

23.567

# modifications on-fly during workshop

a_int_number = int(a_float_number)

a_int_number

23

complex(a_float_number)

(23.567+0j)

str(a_float_number)

'23.567'

Hands-on:¶

Create a variable numeric type float
Convert it to a type inter

Numerical operations¶

Unlike for strings, for numeric data types the operators +,*,-,/ are arithmetic

x=55
y=30

x+y

85

x-y

25

x*y

1650

x/y  # why 1 ??

1.8333333333333333

float(x)/float(y)

1.8333333333333333

Some extra operators:¶

% (Modulus) return the remainder of a division
** (Exponent) return result of exponential calculation

x%y

25

x**y

16251022246560461184530336180516518652439117431640625

Hands-on¶

Create 3 variables with values 4, 12.4 and 30.
Calculate the sum of all and store it into a new variable
Multiple the result by 5 and assign it to another variable

Strings¶

Strings are marked by quotes

Wrapped with single-quote ('):

'This is a string with single quotes'

'This is a string with single quotes'

Wrapped with double-quote ("):

"This is a string with double quotes"

'This is a string with double quotes'

Wrapped with three characters, using either single-quote or double-quote:

'''A String in triple quotes can extend over multiple lines, and can contain "double" quotes.'''

'A String in triple quotes can extend over multiple lines, and can contain "double" quotes.'

A string in Python consists of a series or sequence of characters - letters, numbers, and special characters.

Strings can be indexed. The first character of a string has the index 0.

str_1= "A string consists of characters"

str_1

'A string consists of characters'

str_1[0]

'A'

str_1[3]

't'

len(str_1)

31

Last character:

str_1[-1]

's'

It is possible because the index can be also counted from the right, using negative values:

“STRING” [-6],[-5],[-4],[-3],[-2],[-1]

In addition to the normal way, from the left:

“STRING” [0],[1],[2],[3],[4],[5]

Strings are “immutable”¶

Like in Java, Python strings cannot be changed.

# a wrong example
#s = "Some things are immutable!"
#s[-1] = "."

Operations with strings¶

Concatenation: using operator + is possible to concatenate 2 or more strings:

Attention to the absence of space¶

"Hello" + "World"  #  <- *Attention to the absence of space*

'HelloWorld'

Repetition: using operator * is possible to repeat n times one string:

"HelloWorld" * 3

'HelloWorldHelloWorldHelloWorld'

Indexing: as mentioned before, is possible to recover one specific position of the string by index:

"HelloWorld"[0]

'H'

Slicing: it is possible to recover substring of strings with slicing notation:

"HelloWorld"[2:4]  # Note that [2:4] will get the 3nd  and 4rd  position, not the 5th

'll'

Size

len("HelloWorld")

10

type(len("HelloWorld")) # This is a int number

int

Split: It is possible to use a substring (or character) to split a string:

s = "Some things are immutable!"

print(s.split())  # Default is space character

['Some', 'things', 'are', 'immutable!']

print(s.split("a"))

['Some things ', 're immut', 'ble!']

s2 = "You can split some strings, and you can decide how to do it!"

print(s2.split(","))

['You can split some strings', ' and you can decide how to do it!']

print(s2.split(",")[0])

You can split some strings

Hands-on¶

Print the 7th letter of the string "RafaelCuadrat"
Split the names on string "Jorge,Rafael,Franziska,Daniel"
Create a string using the 2 first and 2 last characters from the string "I love python"
Add "ing" to the end of the string created.

Escape sequences¶

If you want to use some special characters with the literal meaning inside a string, you need to use backslash \ before.

Ex: “The double quotation mark is \”” -> If you use only ” without \ it will close the string without show ” inside the string

Ex2: “I like to use \” -> You need to use a backslash before a backslash to display \ inside a string

Another scape sequences using backslash include:
- New line: \newline
- tabular space: \t <- very important dealing with tabular files

# a wrong example
# print("This is quotes mark " in a string")

print("This is quotes mark \" in a string")

This is quotes mark " in a string

print("This is tabular t in a string")

This is tabular t in a string

print("This is tabular \t in a string")

This is tabular 	 in a string

Lists¶

It is a versatile data type in python and can be written as comma-separated values (items) limited by [ ].

Items in the list does not need to be of the same type!

list1 = ['physics', 'chemistry', 1997, 2000]
list2 = [1, 2, 3, 4, 5 ]
list3 = ["a", "b", "c", "d"]

Similar to string, index of list also starts with 0 and the list can be sliced, concatenated, etc.

list1

['physics', 'chemistry', 1997, 2000]

list1[-2]

1997

list2[0:3]

[1, 2, 3]

list3[2]

'c'

Unlike strings, it is possible to update a list by changing an element:

list1[2]

1997

list1[2] = 2001 #replacing the item on position [2]  by the value 2001

list1[2]

2001

list1

['physics', 'chemistry', 2001, 2000]

del list1[2] # deleting  item on position [2] from the list (2001)

list1

['physics', 'chemistry', 2000]

List operations¶

The same as string for + (concatenation) and * (repetition) and most of operators. Ex: len(list)

More examples:

list1= ['physics', 'chemistry', 1997, 2000, 2004, 1999,2000]

len(list1)

7

2001 in list1

False

2000 in list1

True

list1.count(2000)

2

list1=[ str(s) for s in list1 ]

list1.sort()

list1

['1997', '1999', '2000', '2000', '2004', 'chemistry', 'physics']

Hans-on¶

From the list provided:

1 - create a new list with the lengh of each item

2 - Calculate the number of occurrences of the string "google"

list_hands_on = [11111, "python","perl","bash","R","google","microsoft","apple","google"]

Tuple¶

Tuple is very similar to list, but immutable.

Instead [ ] it uses ( )

tup1 = ('bla','ble','blu',1,30)
tup2 = (1,2,3,4,5,6) 
tup3=(50)

Dictionary¶

In the dictionary, for every item there is a key and a value, separated by colon (:)

The items are separated by comma (,)

Keys are unique in a dictionary, but the values may be not.

You can access the values by the key

dict1 = {'Name': 'John', 'Age': 25, 'Class': 'First'}

dict1['Name']

'John'

dict1['Age']

25

Comparison and logical operators in Python¶

Logical:¶

 *and* -  both if both operators are true, the result is true
 *or*  - on of the operators is true, the result is true
 *not*  -  reverse the logical operator

Comparison:¶

  ==  check if two values are equal. Ex: a == b.
  !=  check if two values are NOT equal.
  >   check if the left is greater than the right. Ex: a>b. 
  <   check if the left is less than right
  >=  bigger or equal
  <=  smaller or equal

Decision making¶

For decision making it is possible to use just if for one single condition, if and elif for more than one condition and else for anything else (and every elif).

var1=100

if var1>90:
    print("Variable  is greater than 90")

Variable  is greater than 90

if var1>90:
    print("Variable is greater than 90")
else:
    print("Variable can be equal or smaller than 90")

Variable is greater than 90

var1=90

if var1>90:
    print("Variable is greater than 90")
else:
    print("Variable can be equal or smaller than 90")

Variable can be equal or smaller than 90

if var1>90: 
    print("Variable is greater than 90")
elif var1<90:
    print("Variable is smaller than 90")
else:  #(could also be elif var1 == 90:)
    print("Variable is 90")

Variable is 90

Identation¶

Attention for the indentation in Python:¶

No braces or reserved words to indicate blocks of code for class, functions or flow control. It is done by indentation.

Loops¶

Sometimes you need to repeat the same operations, so it is possible to use a loop, with a structure of repetition based in a decision or just a fixed number of times.

For loop has the ability to iterate over the items of any sequence, such as a list or a string.

for letter in 'Python bla': 
    print('Current Letter :', letter)

Current Letter : P
Current Letter : y
Current Letter : t
Current Letter : h
Current Letter : o
Current Letter : n
Current Letter :  
Current Letter : b
Current Letter : l
Current Letter : a

fruits = ["apple", "mango", "banana"]
for fruit in fruits:
    print('Current fruit :', fruit)

Current fruit : apple
Current fruit : mango
Current fruit : banana

fruits = ["apple", "mango", "banana"]
for fruit in fruits:
    for letter in fruit:
        print('Current fruit :', fruit, 'Current letter :', letter)

Current fruit : apple Current letter : a
Current fruit : apple Current letter : p
Current fruit : apple Current letter : p
Current fruit : apple Current letter : l
Current fruit : apple Current letter : e
Current fruit : mango Current letter : m
Current fruit : mango Current letter : a
Current fruit : mango Current letter : n
Current fruit : mango Current letter : g
Current fruit : mango Current letter : o
Current fruit : banana Current letter : b
Current fruit : banana Current letter : a
Current fruit : banana Current letter : n
Current fruit : banana Current letter : a
Current fruit : banana Current letter : n
Current fruit : banana Current letter : a

While loop statement in Python programming language repeatedly executes a target statement as long as a given condition is true.

count = 0 
while (count < 9):
    print("The count is:", count)
    count = count+1
print("Bye!")

The count is: 0
The count is: 1
The count is: 2
The count is: 3
The count is: 4
The count is: 5
The count is: 6
The count is: 7
The count is: 8
Bye!

The infinite loop: you need to take care when using while loops to not create an endless loop (unless you need one).

var == 1 
while var == 1:
    print "Endless loop"

To stop it you need to use CTRL+C

Hands on:¶

Create a list of words on the string "I'm learning python for bioinformatics"
Create a loop and print all words from the list, but just if there is "n" in the word

I/O in Python¶

A program (or script) needs to deal with an input and to generate an output. In Python there are many ways to input data and we will see the most basic ones first.

The input([prompt]) function reads one line from standard input and returns it as a string (removing the trailing newline).

str_1 = input("Enter your input: ")
print("Received input is : ", str_1)

Enter your input: jorge
Received input is :  jorge

The input([prompt]) function is equivalent to raw_input, except that it assumes the input is a valid Python expression and returns the evaluated result to you.

# str_2 = input("Enter your input: ");
# print "Received input is : ", str_2

Before read or write a file, you have to open it with the open() function.
This function create a file object:

afile = open("example.txt", "w+")

The first argument is the file name, and the second is the “access mode”: It determines if the file will be only read, if can be write, append, etc. The default is r (read).
In the example, w+ is for both writing and reading, overwriting the existing file if the file exists.
Example of other modes are:

‘w’ – Write mode which is used to edit and write new information to the file (any existing files with the same name will be erased when this mode is activated)

‘a’ – Appending mode, which is used to add new data to the end of the file; that is new information is automatically amended to the end

‘r+’ – Special read and write mode, which is used to handle both actions when working with a file

afile #afile is an object

<_io.TextIOWrapper name='example.txt' mode='w+' encoding='UTF-8'>

Writing text on the new file

afile.write( "Python is amazing.\nYeah its great!!\n")  # \n is new line

36

Closing the file:

afile.close()

Open, read and close a file:

fo = open("example.txt", "r+")

str_1 = fo.read()

print(str_1.endswith)

<built-in method endswith of str object at 0x7f414dfc4138>

str_1.splitlines()[1]

'Yeah its great!!'

fo = open("example.txt", "r+")
str_2 = fo.read();
print(str_2)
fo.close()

Python is amazing.
Yeah its great!!

fo.close()

### include fasta file reading and manupulations

Functions¶

A function is a block of organized, reusable code that is used to perform a single, related action. Functions provide better modularity for your application and a high degree of code reusing.

As you already know, Python gives you many built-in functions like print(), etc. but you can also create your own functions. These functions are called user-defined functions.

You can define functions to provide the required functionality.

Function blocks begin with the keyword def followed by the function name and parentheses ( ):

def a_new_function():
    print("A new function")
    return

a_new_function()

A new function

Any input parameters or arguments should be placed within these parentheses. You can also define parameters inside these parentheses.

The first statement of a function can be an optional statement - the documentation string of the function or docstring.

The code block within every function starts with a colon (:) and is indented.

The statement return [expression] exits a function, optionally passing back an expression to the caller. A return statement with no arguments is the same as return None.

def a_new_function(a_str):
    print(a_str)
    return

a_new_function("another string")

another string

It is possible to use arguments as keyword arguments. It allows you to skip arguments or place out of order because python will identify by the keyword

# Function definition is here
def printinfo( name, age ):
    "This prints a passed info into this function"
    print("Name: ", name)
    print("Age ", age)
    return;

# Now you can call printinfo function - no keyword, but same order 
printinfo("miki",50)

Name:  miki
Age  50

printinfo(50,"miki")

Name:  50
Age  miki

# Now you can call printinfo function - using the keywords 
printinfo(age=50,name="miki")

Name:  miki
Age  50

How to get help:

help(printinfo)

Help on function printinfo in module __main__:

printinfo(name, age)
    This prints a passed info into this function

It is possible to use default arguments:

def printinfo( name, age = 35 ): # default argument age is 35
    "This prints a passed info into this function"
    print("Name: ", name)
    print("Age ", age)
    return;

printinfo( name="miki" ) # no age argument, it will be the default

Name:  miki
Age  35

The statement return [expression] exits a function, optionally passing back an expression to the caller. A return statement with no arguments is the same as return None.

def sum_( arg1, arg2 ):
   # Add both the parameters and return them."
    total = arg1 + arg2
    print("Inside the function : ", total)
    return total;

sum_( 10, 20 )

Inside the function :  30

30

Global vs. Local variables:¶

Variables that are defined inside a function body have a local scope, and those defined outside have a global scope.

total = 0 # Global variable.

def sum_2( arg1, arg2 ):
    # Add both the parameters and return them."
    total = arg1 + arg2; # Local variable.
    print("Inside the function local total : ", total)
    return

sum_2( 10, 20 );
print("Outside the function global total : ", total )

Inside the function local total :  30
Outside the function global total :  0

Modules¶

A module allows you to logically organize your Python code. Grouping related code into a module makes the code easier to understand and use.

A module is a Python object with arbitrarily named attributes that you can bind and reference.

Simply, a module is a file consisting of Python code. A module can define functions, classes and variables. A module can also include runnable code.

The Python code for a module named aname normally resides in a file named aname.py. Here's an example of a simple module, support.py

def print_func( name ):
    print("Hello : ", name)
    return

# Creating a module on fly - just illustrative, never do it, use a text editor or IDE

new_mod=open("support.py","w+")
new_mod.write("def print_func( name ):\n\tprint(\"Hello : \", name)\n\treturn")
new_mod.close()

To use a module you need first to import:¶

# Import module support
import support

# Now you can call defined function that module as follows
support.print_func("Zara")

Hello :  Zara

You can also use an alias for the module:¶

# Import module support with an alias “sup”
import support as sup

sup.print_func("Zara")

Hello :  Zara

Hands-on¶

Write a Python function that takes a list of words and returns the length of the longest one.

Questions ???¶

Second Day workshop¶

Numerical operations (Numpy)

Dataframes (pandas)

Plots (matplotlib and seaborn)

Numerical operations with Numpy¶

Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays (including many numerical operations)

We will not cover arrays in detail, but we will see how to use numpy to do numerical operations

NumPy’s array class is called ndarray. It is also known by the alias array. Note that numpy.array is not the same as the Standard Python Library class array.array, which only handles one-dimensional arrays and offers less functionality.

import numpy as np

a = np.arange(15).reshape(3,5)

a

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

a = np.array([2,3,4])
b = np.array([2.1,3.2,4.5])

Operations with Numpy¶

c=a+b
c

array([4.1, 6.2, 8.5])

d=a-b
d

array([-0.1, -0.2, -0.5])

d**2

array([0.01, 0.04, 0.25])

a*b # Element wise product

array([ 4.2,  9.6, 18. ])

a.dot(b) # Matrix product

31.8

e=np.arange(12).reshape(3,4)
e

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

e.sum(axis=0)

array([12, 15, 18, 21])

e.sum(axis=1)

array([ 6, 22, 38])

e.min(axis=0)

array([0, 1, 2, 3])

e.min(axis=1)

array([0, 4, 8])

e.max(axis=1)

array([ 3,  7, 11])

e.cumsum(axis=1)

array([[ 0,  1,  3,  6],
       [ 4,  9, 15, 22],
       [ 8, 17, 27, 38]])

e.T

array([[ 0,  4,  8],
       [ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11]])

e[2,3]

11

e[-1]

array([ 8,  9, 10, 11])

Hands-on¶

with the 2 provided arrays:

1 - reshape both to 3 rows and 4 columns

2 - Transpose both

3 - calculate the sum of the cumsum of both arrays

anewarray=np.random.rand(4,3)
anotherarray=np.random.rand(4,3)

Introduction to data frames with Pandas¶

What is a Pandas data frame?¶

It is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).

Arithmetic operations align on both row and column labels.

In general, we import a tabular or CSV file to a pandas dataframe

The are some options, for example, we can use different separator character, we can read a file with or without a header

import pandas as pd


df=pd.read_csv("new_AVS.tsv")

df.head()

#checking the first rows of dataframe - It shows 5 rows by default, but you can pass argument inside ()
df.head()

df.tail() # Last 5 rows

By default, function read_csv() would use "," as separator to read the table.

We can change it for any carachter, for example tabular "\t":

df=pd.read_csv("new_AVS.tsv",sep="\t")

df.head()

It's also possible to read excel tables using read_excel("file.xls")

df_cog=pd.read_excel("CAMERA_COG_RPKG_sig.xlsx")

This table stores the normalized abundance of each COG (ortholog genes) on each sample¶

df_cog.head(10)

Working with the provided files¶

It was provided 2 files from metagenomics analysis:¶

new_avs.csv and metatable.csv

AVS is Average Genome Size calculated based on gene markers for each metagenomic sample

Metatable is metadata about each sample

df=pd.read_csv("new_AVS.tsv",sep="\t")

This file is a parsed output from MicrobeCensus software:¶

https://github.com/snayfach/MicrobeCensus

MicrobeCensus is a fast and easy to use pipeline for estimating the average genome size (AGS) of a microbial community from metagenomic data.

df.head()

The samples used on this study are from GOS - baltic sea.

Checking the column names:¶

df.columns

Index(['metagenome:', 'reads_sampled:', 'trimmed_length:', 'min_quality:',
       'mean_quality:', 'filter_dups:', 'max_unknown:', 'average_genome_size:',
       'total_bases:', 'genome_equivalents:'],
      dtype='object')

l=list(df.columns)

l

['metagenome:',
 'reads_sampled:',
 'trimmed_length:',
 'min_quality:',
 'mean_quality:',
 'filter_dups:',
 'max_unknown:',
 'average_genome_size:',
 'total_bases:',
 'genome_equivalents:']

Subseting a dataframe:¶

For us, its only important now 2 columns¶

new_df=df[["metagenome:","average_genome_size:","genome_equivalents:"]]

new_df.head()

Rename columns (all):¶

new_df.columns=["Sample","AGS","GE"]

new_df.head()

Rename specific columns:¶

new_df.rename(columns= {"Sample":"sample"},inplace=False).head() #Needs inplace = True to really change it

Working with index on dataframes:¶

this will set index and display the new_df with new index, but no alterations to original dataframe:

new_df.set_index("Sample").head() #Just show the dataframe with the index on "Sample", 
                                  # no alterations to original df

#Original still unchanged
new_df.head()

new_df.set_index("Sample",inplace=True) # inplace=True modifies the dataframe

# Now its changed
new_df.head()

General statistic on dataframe:¶

new_df.describe()

Get the min value from GE:¶

new_df["GE"].min()

1.77748993023

Get the min value from all columns:¶

new_df.min()

AGS    790976.120718
GE          1.777490
dtype: float64

new_df.max()

AGS    2.235292e+06
GE     9.400874e+01
dtype: float64

new_df.dtypes

AGS    float64
GE     float64
dtype: object

type(new_df)

pandas.core.frame.DataFrame

Transpose dataframe:¶

new_df.T

#some hands on here

Sort by values in one column (in this case, GE):¶

new_df.sort_values(by="GE").head()

By default it will sort ascending values, but you can specify to be descending:¶

new_df.sort_values(by="GE",ascending=False).head()

Selecting one column as Pandas Series:¶

new_df["AGS"].head()

Sample
CAM_SMPL_003398    1.444173e+06
CAM_SMPL_003374    1.195763e+06
CAM_SMPL_003389    1.185496e+06
CAM_SMPL_003418    1.720945e+06
CAM_SMPL_003391    1.461729e+06
Name: AGS, dtype: float64

Selecting one column as new Pandas DataFrame:¶

new_df[["AGS"]].head()

Slicing by rows (positional):¶

new_df.iloc[0:3]

Slicing by rows (by index name):¶

new_df.loc["CAM_SMPL_003398":"CAM_SMPL_003418"]

Conditional selection:¶

new_df[(new_df["GE"]<40) & (new_df["AGS"]>1000000)][["AGS"]].head()

new_df[new_df["GE"]>40] # Select the rows where GE is bigger than 40

Operations with the values:¶

new_df.max()-new_df.min()

AGS    1.444315e+06
GE     9.223125e+01
dtype: float64

Create a new column based on results of operations:¶

new_df["GE - AGS"]=new_df["GE"]-new_df["AGS"]

/modules/software/jupyterhub/0.9.0/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

new_df.head()

Delete a column:¶

del new_df["GE - AGS"]

new_df.head()

new_df["AGS - GE"]=new_df["AGS"]-new_df["GE"]

/modules/software/jupyterhub/0.9.0/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

More about Pandas in: http://pandas.pydata.org/¶

Cheat Sheet: http://pandas.pydata.org/Pandas_Cheat_Sheet.pdf ¶

Hands on:¶

read a dataframe (any!! From internet, from your computer...)

print the columns

store the column names into a list

print the max value from the last column

print the max value from the last row

Plots with matplotlib¶

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

https://matplotlib.org/¶

import matplotlib.pyplot as plt

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

https://seaborn.pydata.org/¶

import seaborn as sns

Starting with plots - the ugly and lazy way¶

#function plot direct from pandas dataframe
new_df["GE"].plot()
plt.show()

#Rotate the name of samples 

new_df["GE"].plot()


plt.xticks(rotation=90)

plt.show()

This plot is useless, we dont have continuos data, we need a bar plot in this case¶

new_df["GE"].plot(kind="bar")  # kind can be also pie, scatter, etc ...
plt.xticks(rotation=90)
plt.show()

We need to control the size and style of figure, and also axis titles, etc¶

sns.set_style("white")

#Define a figure and its size
plt.figure(figsize=(15,8))



#Bar plot 

# range(len(new_df["AGS"])) -> define the number of bars to be plotted and the position on X axis 
x=range(len(new_df["AGS"]))

# new_df["AGS"] -> values for axis Y
y=new_df["AGS"]

plt.bar(x,y,color="b")

#axis labels
plt.ylabel("AGS ",fontsize="16")
plt.xlabel("Samples",fontsize="16")

plt.title("PLOT FOR WORKSHOP",fontsize="22")
#xticks -> define the text to be ploted on axis X -> here we use index (the name of samples)


### error .... index values not matching the positions of xticks

plt.xticks(range(0,len(new_df.index),2), new_df.index,rotation="vertical",fontsize="12")
plt.yticks(fontsize="22")

## Save the figure in pdf

plt.savefig("my_first_plot.pdf", format = 'pdf', dpi = 300, bbox_inches = 'tight')

#Necessary to display the plot here
plt.show()

More about seaborn styles in: https://seaborn.pydata.org/generated/seaborn.set_style.html ¶

The values on Y are in bases, hard to read, so we can convert to Mb¶

new_df["AGS (mb)"]=new_df["AGS"]/1000000

/modules/software/jupyterhub/0.9.0/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

#Seaborn style
sns.set_style("white")

#Define a figure and its size
plt.figure(figsize=(15,8))


x=range(len(new_df["AGS (mb)"]))

y=new_df["AGS (mb)"]


#Bar plot 
plt.bar(x,y,color="b")

#axis labels
plt.ylabel("AGS (Mb)",fontsize="16")
plt.xlabel("Samples",fontsize="16")


plt.xticks(range(len(new_df.index)), new_df.index,rotation="vertical")

plt.show()

Metadata from samples¶

df2=pd.read_csv("metatable.csv", encoding = "latin")

df2.head()

new_df.head()

new_df.reset_index(inplace=True)

new_df.head()

# checking size of both dataframes, if they have same number of samples

len(new_df)

63

len(df2)

63

#mergeing 2 df
merged=pd.merge(new_df,df2,left_on="Sample",right_on="SAMPLE_ACC")

merged.head(15)

l1=list(merged.columns)

#removing columns - (axis 1) -  where ALL the values are NaN (null, missing)
merged.dropna(how="all",axis=1,inplace=True)

merged.head()

l2=list(merged.columns)

[x for x in l1 if x not in l2]

['SAMPLE_DESCRIPTION',
 'COMMENTS',
 'TAXON_ID',
 'DESCRIPTION',
 'ALTITUDE',
 'SITE_DEPTH',
 'SITE_DESCRIPTION',
 'HOST_TAXON_ID',
 'HOST_DESCRIPTION',
 'HOST_ORGANISM']

Sort values by filter size:¶

merged=merged.sort_values("FILTER_MIN")

merged.head(10)

merged.reset_index(inplace=True,drop=True)

merged.head(10)

Creating 3 groups (g1, g2, g3) for each filter_min size¶

merged["FILTER_MIN"].drop_duplicates()

0     0.1
21    0.8
42    3.0
Name: FILTER_MIN, dtype: float64

g1_ = merged[merged["FILTER_MIN"]==0.1]
g2_ = merged[merged["FILTER_MIN"]==0.8]
g3_ = merged[merged["FILTER_MIN"]==3.0]

# Group by and calculating the mean of each group

list(merged.groupby("FILTER_MIN")["AGS (mb)"].mean())

[1.1259436419745714, 1.5636385480638095, 1.502763909280381]

Hands-on:¶

plot a barplot from with GE columns only for g1 (FILTER_MIN = 0.1).
Add the legend of axis x (xticks) with 45 degrees rotation, using name of samples

Creating more complicated bar plots:¶

ploting 3 groups (g1, g2 and g3), different colors, and a line defining the average value for each group

sns.set_style("white")

plt.figure(figsize=(15,10))

#Here we are ploting on the same figure, 3 bar plots, ranging from the original index obtained from the merged dataframe

plt.bar(g1_.index,g1_["AGS (mb)"],color="b") # -> color b=blue, each bar plot we use different colors

plt.bar(g2_.index,g2_["AGS (mb)"],color="c")

plt.bar(g3_.index,g3_["AGS (mb)"],color="r")

plt.ylabel("AGS (mb)",size="16") 
plt.xticks(range(len(merged)), merged["Sample"], size='small',rotation="vertical")
plt.legend(merged["FILTER_MIN"].drop_duplicates())

#here we plot 3 horizontal lines (hlines) with the mean of AGS values for each group
plt.hlines(g1_["AGS (mb)"].astype(float).mean(),g1_.index[0],g1_.index[-1])
plt.hlines(g2_["AGS (mb)"].astype(float).mean(),g2_.index[0],g2_.index[-1])
plt.hlines(g3_["AGS (mb)"].astype(float).mean(),g3_.index[0],g3_.index[-1])
plt.show()

Starting with seaborn, using hue and crating more fancy plots¶

sns.set_style("white")

plt.figure(figsize=(15,10))
x=merged.index
y="AGS (mb)"
sns.barplot(x=x,y=y,hue="FILTER_MIN",data=merged)
plt.xticks(range(len(x)), merged["Sample"], size='small',rotation="vertical")

plt.show()

Hands-on:¶

add the mean values line to the previous plot

Barplot with error bars, violinplot and boxplots¶

sns.set_style("white")

plt.figure(figsize=(15,10))
x=merged["FILTER_MIN"]
y=merged["AGS (mb)"]
sns.barplot(x,y)
plt.show()

sns.set_style("white")

plt.figure(figsize=(15,10))
x=merged["FILTER_MIN"]
y=merged["AGS (mb)"]
sns.violinplot(x,y)
plt.show()

sns.set_style("white")

plt.figure(figsize=(15,10))
x=merged["FILTER_MIN"]
y=merged["AGS (mb)"]
sns.boxplot(x,y)
plt.show()

Hands on:¶

create the same violin plots, but adding dots on the plot
create violin plots for sample volume, instead filter min

Exploring data, checking correlations¶

#correlation between 2 variables

merged[["AGS (mb)","temperature - (C)"]].corr()

Do you think AGS (average genome size) are correlated with temperature?¶

#Doing a scatter plot

sns.set_style("white")
x=merged["AGS (mb)"]
y=merged["temperature - (C)"]
plt.figure(figsize=(15,10))
plt.scatter(x,y)
plt.ylabel("temperature - (C)",size="16")
plt.xlabel("AGS (mb)",size="16")
plt.show()

sns.jointplot("temperature - (C)", "AGS (mb)", data=merged, kind="reg",
                  xlim=(0, 20), ylim=(0, 3), color="r", size=7)
plt.show()

/modules/software/jupyterhub/0.9.0/lib/python3.6/site-packages/seaborn/axisgrid.py:2262: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

sns.jointplot("LATITUDE", "AGS (mb)", data=merged, kind="reg",
                  xlim=(0, 20), ylim=(0, 3), color="r", size=7)
plt.show()

/modules/software/jupyterhub/0.9.0/lib/python3.6/site-packages/seaborn/axisgrid.py:2262: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

Hands on:¶

plot correlation plot with other variables

Heatmaps¶

Working with genes abundance on samples

df_cog.index=df_cog["Unnamed: 0"].tolist()
df_cog=df_cog.drop(["Unnamed: 0"], axis=1)

# Remove rows only with 0
df_2=df_cog[(df_cog.T != 0).any()]

df_2.columns

Index(['CAM_SMPL_003410', 'CAM_SMPL_003375', 'CAM_SMPL_003388',
       'CAM_SMPL_003380', 'CAM_SMPL_003361', 'CAM_SMPL_003395',
       'CAM_SMPL_003378', 'CAM_SMPL_003365', 'CAM_SMPL_003373',
       'CAM_SMPL_003384', 'CAM_SMPL_003406', 'CAM_SMPL_003417',
       'CAM_SMPL_003372', 'CAM_SMPL_003374', 'CAM_SMPL_003364',
       'CAM_SMPL_003362', 'CAM_SMPL_003409', 'CAM_SMPL_003418',
       'CAM_SMPL_003400', 'CAM_SMPL_003405', 'CAM_SMPL_003370',
       'CAM_SMPL_003392', 'CAM_SMPL_003382', 'CAM_SMPL_003394',
       'CAM_SMPL_003358', 'CAM_SMPL_003367', 'CAM_SMPL_003396',
       'CAM_SMPL_003391', 'CAM_SMPL_003360', 'CAM_SMPL_003381',
       'CAM_SMPL_003404', 'CAM_SMPL_003387', 'CAM_SMPL_003376',
       'CAM_SMPL_003379', 'CAM_SMPL_003398', 'CAM_SMPL_003399',
       'CAM_SMPL_003412', 'CAM_SMPL_003403', 'CAM_SMPL_003356',
       'CAM_SMPL_003414', 'CAM_SMPL_003368', 'CAM_SMPL_003401',
       'CAM_SMPL_003390', 'CAM_SMPL_003385', 'CAM_SMPL_003397',
       'CAM_SMPL_003357', 'CAM_SMPL_003408', 'CAM_SMPL_003359',
       'CAM_SMPL_003415', 'CAM_SMPL_003407', 'CAM_SMPL_003366',
       'CAM_SMPL_003413', 'CAM_SMPL_003386', 'CAM_SMPL_003369',
       'CAM_SMPL_003402', 'CAM_SMPL_003383', 'CAM_SMPL_003377',
       'CAM_SMPL_003371', 'CAM_SMPL_003363', 'CAM_SMPL_003389',
       'CAM_SMPL_003393', 'CAM_SMPL_003411', 'CAM_SMPL_003416', 'g1-mean',
       'g2-mean', 'g3-mean', 'g1-sd', 'g2-sd', 'g3-sd', 'ANOVA F',
       'ANOVA pvalue', 'pvalue adjusted', 'log2(FC)_g1/g2', 'log2(FC)_g1/g3',
       'log2(FC)_g2/g3'],
      dtype='object')

len(df_cog) # number of cogs (gene ortologs groups) on the original table

363

len(df_2) # number of cogs after removing those with 0 abundance in all the samples

164

# New dataframe only with the RPKG for all the samples

heat=df_2[[x for x in df_2.columns if "CAM_SMPL"in x ]]

heat.head()

##test.ix[:, test.columns != 'compliance']

sns.clustermap(heat,figsize=(12,25),cbar_kws={'label': 'COGs RPKG'},yticklabels=False)
#plt.savefig("../clustermap_cogs.pdf", format = 'pdf', dpi = 300, bbox_inches = 'tight')
plt.show()

# customizing colors

import matplotlib.cm as cm
from  matplotlib.colors import LinearSegmentedColormap

c = ["pink","lightcoral","red","darkred","black"]
v = [0,.25,.5,.75,1.]
l = list(zip(v,c))
cmap=LinearSegmentedColormap.from_list('rg',l, N=1024)

sns.clustermap(heat,figsize=(12,25),cbar_kws={'label': 'COGs RPKG'},yticklabels=False,cmap=cmap)
#plt.savefig("../clustermap_cogs.pdf", format = 'pdf', dpi = 300, bbox_inches = 'tight')
plt.show()

heat_log= heat.apply(lambda x: np.log10(x+0.001))

c = ["darkred","red","lightcoral","white","palegreen","green","darkgreen"]
v = [0,.15,.4,.5,0.6,.9,1.]
l = list(zip(v,c))
cmap=LinearSegmentedColormap.from_list('rg',l, N=1024)

sns.clustermap(heat_log,figsize=(12,25),cbar_kws={'label': 'log2(COGs RPKG)'},yticklabels=False,cmap=cmap)
#plt.savefig("../clustermap_cogs.pdf", format = 'pdf', dpi = 300, bbox_inches = 'tight')
plt.show()

Hands-on:¶

Plot the same heatmaps, but only for genes where pvalue adjusted < 0.01
plot a heatmap for z-score

Getting upregulated COGs in the comparisons of different filtrations methods groups¶

a=set(df_2[df_2["log2(FC)_g1/g3"]>0].index)
b=set(df_2[df_2["log2(FC)_g1/g2"]>0].index)
c=set(df_2[df_2["log2(FC)_g2/g3"]>0].index)

To install matplotlib_venn:¶

pip install --user matplotlib-venn

more info about the package in https://pypi.org/project/matplotlib-venn/

from matplotlib_venn import venn3,venn3_unweighted

plt.figure(figsize=(10,10))


vd = venn3_unweighted([a, b,c], ('up g1/g3', 'up g1/g2','up g2/g3'))

plt.show()

plt.figure(figsize=(10,10))


vd = venn3([a, b,c], ('up g1/g3', 'up g1/g2','up g2/g3'))

plt.show()

Hands-on¶

Plot a heatmap for log2(FC) up regulated genes for the 3 comparions (g1/g2, g2/g3 and g1/g3)

PCA - Principal component analysis¶

Very nice expanation about PCA: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

Package sklearn - more info at http://scikit-learn.org/stable/index.html ¶

from sklearn import preprocessing
from sklearn.decomposition import PCA
from itertools import cycle

heat.head(2)

len(heat)

164

# we have 164 COGs, that means, we have 164 variables to be transformed to 2 (or more) dimensions

df_pca=heat.T.reset_index()

df_pca.head()

merged.head()

df_pca=pd.merge(merged[["Sample","FILTER_MIN"]],df_pca,left_on="Sample",right_on="index").drop("index",axis=1)

df_pca.set_index(["Sample","FILTER_MIN"],inplace=True)

df_pca.head()

pca = PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

#scaling the values
df_pca_scaled = preprocessing.scale(df_pca)

projected=pca.fit_transform(df_pca_scaled)

print(pca.explained_variance_ratio_)

[0.42767701 0.09641663]

tmp=pd.DataFrame(projected)
tmp.rename(columns={0: 'Component 1',1: 'Component 2'}, inplace=True)
tmp.head()

final_pca=pd.merge(df_pca.reset_index(),tmp,left_index=True,right_index=True)

# in the final_pca dataframe, in adition to all the COGs, 
# we have new columns (0 and 1), with the PCA results (2 components)

final_pca.head()

sns.set_style("white")
color_gen = cycle(('blue', 'green', 'red'))
plt.figure(figsize=(10,10))
for lab in set(final_pca["FILTER_MIN"]):
    plt.scatter(final_pca.loc[final_pca['FILTER_MIN'] == lab, 'Component 1'], 
                final_pca.loc[final_pca['FILTER_MIN'] == lab, 'Component 2'], 
                c=next(color_gen),
                label=lab)

    
    
plt.xlabel('component 1  - ' + str(pca.explained_variance_ratio_[0]*100)+ " % " )
plt.ylabel('component 2  - '+ str(pca.explained_variance_ratio_[1]*100)+ "  % " )
plt.legend(loc='best')
plt.show()

#pca.components_

components = pd.DataFrame(pca.components_, columns = df_pca.columns, index=[1, 2])

# Principal axes in feature space, representing the directions of maximum variance in the data. 
components

import math

def get_important_features(transformed_features, components_, columns):
    """
    This function will return the most "important" 
    features so we can determine which have the most
    effect on multi-dimensional scaling
    """
    num_columns = len(columns)

    # Scale the principal components by the max value in
    # the transformed set belonging to that component
    xvector = components_[0] * max(transformed_features[:,0])
    yvector = components_[1] * max(transformed_features[:,1])

    # Sort each column by it's length. These are your *original*
    # columns, not the principal components.
    important_features = { columns[i] : math.sqrt(xvector[i]**2 + yvector[i]**2) for i in range(num_columns) }
    important_features = sorted(zip(important_features.values(), important_features.keys()), reverse=True)
    return important_features,xvector,yvector

important_features,xvector,yvector=get_important_features(projected, pca.components_, df_pca.columns.values)

def draw_vectors(transformed_features, components_, columns):
    """
    This funtion will project your *original* features
    onto your principal component feature-space, so that you can
    visualize how "important" each one was in the
    multi-dimensional scaling
    """

    num_columns = len(columns)

    # Scale the principal components by the max value in
    # the transformed set belonging to that component
    xvector = components_[0] * max(transformed_features[:,0])
    yvector = components_[1] * max(transformed_features[:,1])
    
    # Sort each column by it's length. These are your *original*
    # columns, not the principal components.
    important_features = { columns[i] : math.sqrt(xvector[i]**2 + yvector[i]**2) for i in range(num_columns) }
    important_features = sorted(zip(important_features.values(), important_features.keys()), reverse=True)
    
    ax = plt.axes()
    
    tmp=pd.DataFrame({"xvector":xvector,"columns":columns} )
    tmp2=pd.DataFrame({"yvector":yvector,"columns":columns} )

    
    azip=zip(xvector,columns)
    azip2=zip(yvector,columns) 
    col_array=[]
    
    # Use an arrow to project each original feature as a
    # labeled vector on your principal component axes
    for i in range(num_columns):
        
        if columns[i]==tmp[tmp["xvector"]==max(xvector)]["columns"].tolist()[0]:
            plt.arrow(0, 0, xvector[i], yvector[i], color='b', width=0.0005, head_width=0.02, alpha=0.75)
            plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='b', alpha=0.75)
            col_array.append(columns[i])
        if columns[i]==tmp[tmp["xvector"]==min(xvector)]["columns"].tolist()[0]:
            plt.arrow(0, 0, xvector[i], yvector[i], color='b', width=0.0005, head_width=0.02, alpha=0.75)
            plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='b', alpha=0.75)
            col_array.append(columns[i])
        if columns[i]==tmp2[tmp2["yvector"]==min(yvector)]["columns"].tolist()[0]:
            plt.arrow(0, 0, xvector[i], yvector[i], color='b', width=0.0005, head_width=0.02, alpha=0.75)
            plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='b', alpha=0.75) 
            col_array.append(columns[i])
        if columns[i]==tmp2[tmp2["yvector"]==min(yvector)]["columns"].tolist()[0]:
            plt.arrow(0, 0, xvector[i], yvector[i], color='b', width=0.0005, head_width=0.02, alpha=0.75)
            plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='b', alpha=0.75) 
            col_array.append(columns[i])
    
    return ax,col_array

color_gen = cycle(('red', 'green', 'blue'))
plt.figure(figsize=(10,10))
for lab in set(final_pca["FILTER_MIN"]):
    plt.scatter(final_pca.loc[final_pca['FILTER_MIN'] == lab, 'Component 1'], 
                final_pca.loc[final_pca['FILTER_MIN'] == lab, 'Component 2'], 
                c=next(color_gen),
                label=lab,alpha=0.8)

ax,cols=draw_vectors(projected, pca.components_, df_pca.columns.values)    
    
plt.xlabel('Component 1  - ' + str(pca.explained_variance_ratio_[0]*100)+ " % " )
plt.ylabel('Component 2  - '+ str(pca.explained_variance_ratio_[1]*100)+ "  % " )
plt.legend(loc='best')
plt.show()

/modules/software/jupyterhub/0.9.0/lib/python3.6/site-packages/matplotlib/figure.py:98: MatplotlibDeprecationWarning: 
Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  "Adding an axes using the same arguments as a previous axes "

Hands-on:¶

Plot the same PCA, but fixing the text on arrows (keeping only COG number, removing the descriptions)
Add a vertical line and a horizontal line on zero values

Plotting the 4 COGs with extreme vectors on PCA¶

sns.set_style("white")
plt.figure(figsize=(15,10))
tmp=final_pca[final_pca['Sample'].isin(g1_["Sample"])]
tmp2=final_pca[final_pca['Sample'].isin(g2_["Sample"])]
tmp3=final_pca[final_pca['Sample'].isin(g3_["Sample"])]
for c in cols:
    plt.figure(figsize=(15,10))
    plt.bar(tmp.index,tmp[c],color="b")
    plt.bar(tmp2.index,tmp2[c],color="g")
    plt.bar(tmp3.index,tmp3[c],color="r")
    plt.ylabel(c,size="12") 
    plt.xticks(range(len(final_pca)), final_pca["Sample"], size='small',rotation="vertical")
    plt.legend(final_pca["FILTER_MIN"].drop_duplicates())

    #here we plot 3 horizontal lines (hlines) with the mean of AGS values for each group
    plt.hlines(tmp[c].astype(float).mean(),g1_.index[0],g1_.index[-1])
    plt.hlines(tmp2[c].astype(float).mean(),g2_.index[0],g2_.index[-1])
    plt.hlines(tmp3[c].astype(float).mean(),g3_.index[0],g3_.index[-1])
    plt.show()

<Figure size 1080x720 with 0 Axes>

sns.set_style("white")

for c in cols:
    plt.figure(figsize=(15,10))
    x=final_pca["FILTER_MIN"]
    y=final_pca[c]
    sns.violinplot(x,y)
plt.show()

## Kmeans

## clusterization (kmeans with euc distance)
## calculate distance btw centroid of clusters and each gene
## plot bar plots for the genes with shorter and bigger distance from one cluster

### a bit more about K-means: https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/

def dist(a, b, ax=1):
    return np.linalg.norm(a - b, axis=ax)

f1 = final_pca["Component 1"].values
f2 = final_pca["Component 2"].values
X = np.array(list(zip(f1, f2)))
plt.scatter(f1, f2, c='black', s=15)
plt.show()

from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist

# k means determine k
distortions = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k).fit(X)
    kmeanModel.fit(X)
    distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])

# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

from sklearn.cluster import KMeans

# Number of clusters
k=3
kmeans = KMeans(n_clusters=k)
# Fitting the input data
kmeans = kmeans.fit(X)
# Getting the cluster labels
labels = kmeans.predict(X)
# Centroid values
C = kmeans.cluster_centers_

print(C)

[[-8.89103651 -1.33884732]
 [ 9.92669966 -3.51000995]
 [ 2.028681    4.17284112]]

from copy import deepcopy

# colors = ['#FF0000', 'g', '#FF7D40', '#66CDAA', '#473C8B', '#9400D3','#8B8386','#FFC0CB','#B0171F']
# fig, ax = plt.subplots()
# for i in range(k):
#         points = np.array([X[j] for j in range(len(X)) if clusters[j] == i])
#         ax.scatter(points[:, 0], points[:, 1], s=15, c=colors[i])
# ax.scatter(C[:, 0], C[:, 1], marker='*', s=100, c='#050505')
# plt.show()

kmeans.labels_

array([1, 1, 1, 2, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 0,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 0,
       0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int32)

final_pca.head()

from sklearn.metrics import pairwise_distances_argmin_min

closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, X)

closest

array([60, 13, 34])

final_pca.loc[[34]]

final_pca.loc[[13]]

final_pca.loc[[60]]

Hands-on:¶

Try the kmeans clusterization with different number of cluters (4 and 5)

Lifespan analysis¶

lifelines package: http://lifelines.readthedocs.io/en/latest/¶

from lifelines.datasets import load_waltons
df = load_waltons() # returns a Pandas DataFrame

print(df.head())


T = df['T']
E = df['E']

      T  E    group
0   6.0  1  miR-137
1  13.0  1  miR-137
2  13.0  1  miR-137
3  13.0  1  miR-137
4  19.0  1  miR-137

T is an array of durations, E is a either boolean or binary array representing whether the “death” was observed (alternatively an individual can be censored).

from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
kmf.fit(T, event_observed=E)  # or, more succiently, kmf.fit(T, E)

<lifelines.KaplanMeierFitter: fitted with 163 observations, 7 censored>

kmf.survival_function_
kmf.median_
kmf.plot()
plt.show()

groups = df['group']
ix = (groups == 'miR-137')

kmf.fit(T[~ix], E[~ix], label='control')
ax = kmf.plot()

kmf.fit(T[ix], E[ix], label='miR-137')
kmf.plot(ax=ax)
plt.show()

from lifelines.statistics import logrank_test

results = logrank_test(T[ix], T[~ix], E[ix], E[~ix], alpha=.99)

results.print_summary()

<lifelines.StatisticalResult>
               t_0 = -1
 null_distribution = chi squared
degrees_of_freedom = 1
             alpha = 0.99

---
 test_statistic      p  -log2(p)
         122.25 <0.005     91.99

from lifelines.datasets import load_regression_dataset
regression_dataset = load_regression_dataset()

regression_dataset.head()

from lifelines import CoxPHFitter

# Using Cox Proportional Hazards model
cph = CoxPHFitter()
cph.fit(regression_dataset, 'T', event_col='E')
cph.print_summary()

<lifelines.CoxPHFitter: fitted with 200 observations, 11 censored>
      duration col = 'T'
         event col = 'E'
number of subjects = 200
  number of events = 189
    log-likelihood = -807.62
  time fit was run = 2019-02-26 15:53:53 UTC

---
      coef  exp(coef)  se(coef)    z      p  -log2(p)  lower 0.95  upper 0.95
var1  0.22       1.25      0.07 2.99 <0.005      8.49        0.08        0.37
var2  0.05       1.05      0.08 0.61   0.54      0.89       -0.11        0.21
var3  0.22       1.24      0.08 2.88 <0.005      7.97        0.07        0.37
---
Concordance = 0.58
Log-likelihood ratio test = 15.54 on 3 df, -log2(p)=9.47

cph.plot()
plt.show()

	metagenome: reads_sampled: trimmed_length: min_quality: mean_quality: filter_dups: max_unknown: average_genome_size: total_bases: genome_equivalents:
0	CAM_SMPL_003398\t14100\t400\t-5\t-5\tFalse\t10...
1	CAM_SMPL_003374\t35265\t450\t-5\t-5\tFalse\t10...
2	CAM_SMPL_003389\t12694\t400\t-5\t-5\tFalse\t10...
3	CAM_SMPL_003418\t13981\t450\t-5\t-5\tFalse\t10...
4	CAM_SMPL_003391\t26779\t400\t-5\t-5\tFalse\t10...

	metagenome: reads_sampled: trimmed_length: min_quality: mean_quality: filter_dups: max_unknown: average_genome_size: total_bases: genome_equivalents:
0	CAM_SMPL_003398\t14100\t400\t-5\t-5\tFalse\t10...
1	CAM_SMPL_003374\t35265\t450\t-5\t-5\tFalse\t10...
2	CAM_SMPL_003389\t12694\t400\t-5\t-5\tFalse\t10...
3	CAM_SMPL_003418\t13981\t450\t-5\t-5\tFalse\t10...
4	CAM_SMPL_003391\t26779\t400\t-5\t-5\tFalse\t10...

	metagenome: reads_sampled: trimmed_length: min_quality: mean_quality: filter_dups: max_unknown: average_genome_size: total_bases: genome_equivalents:
58	CAM_SMPL_003359\t46837\t450\t-5\t-5\tFalse\t10...
59	CAM_SMPL_003382\t31822\t400\t-5\t-5\tFalse\t10...
60	CAM_SMPL_003357\t55946\t450\t-5\t-5\tFalse\t10...
61	CAM_SMPL_003368\t15148\t450\t-5\t-5\tFalse\t10...
62	CAM_SMPL_003378\t67961\t450\t-5\t-5\tFalse\t10...

	Unnamed: 0	CAM_SMPL_003410	CAM_SMPL_003375	CAM_SMPL_003388	CAM_SMPL_003380	CAM_SMPL_003361	CAM_SMPL_003395	CAM_SMPL_003378	CAM_SMPL_003365	CAM_SMPL_003373	...	g3-mean	g1-sd	g2-sd	g3-sd	ANOVA F	ANOVA pvalue	pvalue adjusted	log2(FC)_g1/g2	log2(FC)_g1/g3	log2(FC)_g2/g3
0	COG0006 - Xaa-Pro aminopeptidase [Amino acid t...	1.545723	4.167033	1.534017	0.000000	1.724899	0.000000	2.217853	6.629477	0.000000	...	0.838736	1.323263	1.241376	1.042022	21.810674	7.604322e-08	0.000291	0.450025	1.961559	1.511534
1	COG0018 - Arginyl-tRNA synthetase [Translation...	1.024882	3.036821	2.855767	4.149942	0.000000	0.000000	3.104112	3.991667	0.000000	...	0.933072	1.318947	1.371911	1.309937	15.510043	3.718949e-06	0.014236	0.463642	1.779949	1.316308
2	COG0020 - Undecaprenyl pyrophosphate synthase ...	0.000000	3.220692	0.973403	0.000000	1.542692	0.000000	1.481556	2.195459	0.000000	...	0.388321	0.790581	0.971956	0.640390	17.853436	8.246011e-07	0.003157	0.751614	2.277260	1.525646
3	COG0021 - Transketolase [Carbohydrate transpor...	0.659364	2.388176	1.485892	0.000000	0.000000	0.000000	4.185504	5.248721	1.673285	...	0.714487	1.139996	1.044342	1.109277	29.397518	1.260633e-09	0.000005	0.738438	2.211978	1.473539
4	COG0028 - Acetolactate synthase large subunit ...	1.799150	6.434617	4.867390	0.000000	3.255559	0.000000	4.231023	9.012939	2.817330	...	2.285751	2.220290	2.264018	2.401486	15.491232	3.765361e-06	0.014414	0.330109	1.428212	1.098103
5	COG0035 - Uracil phosphoribosyltransferase [Nu...	0.000000	0.745211	0.205852	0.000000	0.000000	0.000000	2.261908	0.623539	0.000000	...	0.186456	0.538982	0.359889	0.359018	15.691544	3.300370e-06	0.012634	0.927917	2.299039	1.371122
6	COG0040 - ATP phosphoribosyltransferase [Amino...	0.637089	0.948030	1.106626	0.000000	0.000000	1.385314	1.789779	1.529666	0.000000	...	0.196886	0.611950	0.536970	0.408410	25.869401	7.914384e-09	0.000030	0.855852	2.785937	1.930085
7	COG0049 - Ribosomal protein S7 [Translation, r...	0.000000	0.000000	0.875543	0.000000	0.000000	0.000000	1.927435	1.336095	0.000000	...	0.242650	0.538830	0.441765	0.537685	14.139673	9.306025e-06	0.035623	0.757113	2.144314	1.387201
8	COG0050 - Translation elongation factor EF-Tu,...	6.583426	4.057610	4.395697	2.263685	4.002741	0.000000	4.193771	6.215216	0.000000	...	2.142647	1.355634	1.279844	2.319426	14.538509	7.105056e-06	0.027198	0.142089	1.167283	1.025195
9	COG0064 - Asp-tRNAAsn/Glu-tRNAGln amidotransfe...	3.050589	3.351769	1.429165	1.218677	0.000000	0.000000	2.973363	3.349802	0.000000	...	1.206369	1.109038	1.057338	1.454888	15.511164	3.716203e-06	0.014226	0.531574	1.452333	0.920759

	AGS	GE
count	6.300000e+01	63.000000
mean	1.397449e+06	20.054522
std	3.212700e+05	17.076021
min	7.909761e+05	1.777490
25%	1.182685e+06	6.281441
50%	1.412405e+06	16.458093
75%	1.635010e+06	30.137062
max	2.235292e+06	94.008744

	metagenome:	reads_sampled:	trimmed_length:	min_quality:	mean_quality:	filter_dups:	max_unknown:	average_genome_size:	total_bases:	genome_equivalents:
0	CAM_SMPL_003398	14100	400	-5	-5	False	100	1.444173e+06	9390688	6.502469
1	CAM_SMPL_003374	35265	450	-5	-5	False	100	1.195763e+06	27953336	23.376988
2	CAM_SMPL_003389	12694	400	-5	-5	False	100	1.185496e+06	9484191	8.000187
3	CAM_SMPL_003418	13981	450	-5	-5	False	100	1.720945e+06	11153362	6.480953
4	CAM_SMPL_003391	26779	400	-5	-5	False	100	1.461729e+06	17524363	11.988789

	AGS	GE
Sample
CAM_SMPL_003398	1.444173e+06	6.502469
CAM_SMPL_003374	1.195763e+06	23.376988
CAM_SMPL_003389	1.185496e+06	8.000187
CAM_SMPL_003418	1.720945e+06	6.480953
CAM_SMPL_003391	1.461729e+06	11.988789

	AGS	GE
Sample
CAM_SMPL_003367	1.082166e+06	1.777490
CAM_SMPL_003364	1.480442e+06	1.863965
CAM_SMPL_003358	1.295514e+06	2.374381
CAM_SMPL_003386	1.151237e+06	3.005024
CAM_SMPL_003373	9.909482e+05	3.064752

	AGS	GE
Sample
CAM_SMPL_003393	8.948887e+05	94.008744
CAM_SMPL_003406	1.458303e+06	53.784971
CAM_SMPL_003384	9.546334e+05	53.653646
CAM_SMPL_003390	8.312135e+05	48.673980
CAM_SMPL_003394	1.268083e+06	46.231671

	SAMPLE_ACC	Sample Type	SAMPLE_VOLUME	VOLUME_UNIT	FILTER_MIN	FILTER_MAX	SAMPLE_DESCRIPTION	SAMPLE_NAME	COMMENTS	TAXON_ID	...	MATERIAL_ID	sample depth - (m)	Ammonium - (umol/kg)	Silicate - (umol/kg)	Total phosphorus	salinity - (psu)	temperature - (C)	nitrite - (µmol/l)	phosphate - (umol/kg)	nitrate(NO3) - (umol/l)
0	CAM_SMPL_003356	metagenome	190	L	0.1	0.8	NaN	GS682_0.1	NaN	NaN	...	3785	24	0.263	7.079	0.395	8.00	9.47	0.033	0.208	0.212
1	CAM_SMPL_003357	metagenome	190	L	0.8	3.0	NaN	GS682_0.8	NaN	NaN	...	3786	24	0.263	7.079	0.395	8.00	9.47	0.033	0.208	0.212
2	CAM_SMPL_003358	metagenome	190	L	3.0	200.0	NaN	GS682_3.0	NaN	NaN	...	3787	24	0.263	7.079	0.395	8.00	9.47	0.033	0.208	0.212
3	CAM_SMPL_003359	metagenome	180	L	0.1	0.8	NaN	GS683_0.1	NaN	NaN	...	3788	1	0.188	7.258	0.458	13.72	18.15	NaN	0.073	0.336
4	CAM_SMPL_003360	metagenome	180	L	0.8	3.0	NaN	GS683_0.8	NaN	NaN	...	3789	1	0.188	7.258	0.458	13.72	18.15	NaN	0.073	0.336

	Sample	AGS	GE	AGS - GE	AGS (mb)	SAMPLE_ACC	Sample Type	SAMPLE_VOLUME	VOLUME_UNIT	FILTER_MIN	...	MATERIAL_ID	sample depth - (m)	Ammonium - (umol/kg)	Silicate - (umol/kg)	Total phosphorus	salinity - (psu)	temperature - (C)	nitrite - (µmol/l)	phosphate - (umol/kg)	nitrate(NO3) - (umol/l)
62	CAM_SMPL_003378	1.109780e+06	41.574616	1.109738e+06	1.109780	CAM_SMPL_003378	metagenome	180	L	0.1	...	3807	1	0.237	7.113	0.201	4.79	15.31	NaN	0.042	0.215
51	CAM_SMPL_003396	9.275912e+05	32.594154	9.275586e+05	0.927591	CAM_SMPL_003396	metagenome	200	L	0.1	...	3825	16	0.390	9.919	0.189	4.51	10.30	NaN	0.030	0.344
25	CAM_SMPL_003377	1.412405e+06	19.604380	1.412385e+06	1.412405	CAM_SMPL_003377	metagenome	200	L	0.1	...	3806	72	0.831	16.290	1.172	34.35	6.35	NaN	0.913	8.538
61	CAM_SMPL_003368	1.026569e+06	12.120798	1.026557e+06	1.026569	CAM_SMPL_003368	metagenome	200	L	0.1	...	3797	20	0.801	5.109	0.360	31.50	15.00	0.035	0.107	1.011
44	CAM_SMPL_003371	1.456099e+06	18.592277	1.456081e+06	1.456099	CAM_SMPL_003371	metagenome	200	L	0.1	...	3800	1	0.832	3.635	0.382	23.20	18.85	NaN	0.029	0.348
50	CAM_SMPL_003362	1.235364e+06	16.458093	1.235347e+06	1.235364	CAM_SMPL_003362	metagenome	200	L	0.1	...	3791	15	0.303	6.115	0.495	16.45	17.00	NaN	0.136	0.304
18	CAM_SMPL_003413	1.414618e+06	11.659474	1.414606e+06	1.414618	CAM_SMPL_003413	metagenome	200	L	0.1	...	3842	1	0.328	2.895	0.335	32.16	17.43	NaN	NaN	0.365
16	CAM_SMPL_003365	1.278115e+06	27.447482	1.278088e+06	1.278115	CAM_SMPL_003365	metagenome	200	L	0.1	...	3794	1	0.250	1.417	0.258	20.22	18.63	NaN	NaN	0.285
36	CAM_SMPL_003387	7.909761e+05	32.882668	7.909432e+05	0.790976	CAM_SMPL_003387	metagenome	200	L	0.1	...	3816	10	0.498	17.074	0.153	2.90	9.10	0.073	0.041	2.867
27	CAM_SMPL_003399	1.107213e+06	37.547570	1.107175e+06	1.107213	CAM_SMPL_003399	metagenome	150	L	0.1	...	3828	9	0.309	4.540	0.283	5.75	15.31	NaN	0.089	0.187

		COG0006 - Xaa-Pro aminopeptidase [Amino acid transport and metabolism].	COG0018 - Arginyl-tRNA synthetase [Translation, ribosomal structure and biogenesis].	COG0020 - Undecaprenyl pyrophosphate synthase [Lipid transport and metabolism].	COG0021 - Transketolase [Carbohydrate transport and metabolism].	COG0028 - Acetolactate synthase large subunit or other thiamine pyrophosphate-requiring enzyme [Amino acid transport and metabolism, Coenzyme transport and metabolism].	COG0035 - Uracil phosphoribosyltransferase [Nucleotide transport and metabolism].	COG0040 - ATP phosphoribosyltransferase [Amino acid transport and metabolism].	COG0049 - Ribosomal protein S7 [Translation, ribosomal structure and biogenesis].	COG0050 - Translation elongation factor EF-Tu, a GTPase [Translation, ribosomal structure and biogenesis].	COG0064 - Asp-tRNAAsn/Glu-tRNAGln amidotransferase B subunit [Translation, ribosomal structure and biogenesis].	...	COG4270 - Uncharacterized membrane protein [Function unknown].	COG4567 - Two-component response regulator, ActR/RegA family, consists of REC and Fis-type HTH domains [Signal transduction mechanisms, Transcription].	COG4579 - Isocitrate dehydrogenase kinase/phosphatase [Signal transduction mechanisms].	COG4581 - Superfamily II RNA helicase [Replication, recombination and repair].	COG4618 - ABC-type protease/lipase transport system, ATPase and permease components [Intracellular trafficking, secretion, and vesicular transport].	COG4663 - TRAP-type mannitol/chloroaromatic compound transport system, periplasmic component [Secondary metabolites biosynthesis, transport and catabolism].	COG4666 - TRAP-type uncharacterized transport system, fused permease components [General function prediction only].	COG4977 - Transcriptional regulator GlxA family, contains an amidase domain and an AraC-type DNA-binding HTH domain [Transcription].	COG5009 - Membrane carboxypeptidase/penicillin-binding protein [Cell wall/membrane/envelope biogenesis].	COG5282 - Uncharacterized conserved protein, DUF2342 family [Function unknown].
Sample	FILTER_MIN
CAM_SMPL_003378	0.1	2.217853	3.104112	1.481556	4.185504	4.231023	2.261908	1.789779	1.927435	4.193771	2.973363	...	0.485059	1.100603	0.00000	1.987316	0.053097	1.655937	2.973762	0.167036	2.066227	1.372959
CAM_SMPL_003396	0.1	2.196063	2.807988	2.328216	2.349802	4.251768	1.305508	1.702963	0.884343	4.409935	3.170220	...	0.247442	0.396387	0.00000	1.750147	0.000000	0.447497	1.955898	0.000000	1.745357	0.617174
CAM_SMPL_003377	0.1	5.705442	4.945343	0.564016	3.891734	12.182144	0.975872	1.960872	0.555654	5.793683	5.142208	...	0.000000	1.231687	0.30636	1.153844	0.333392	4.756444	3.005515	0.000000	4.137977	0.000000
CAM_SMPL_003368	0.1	2.968671	1.863295	1.602269	1.672952	6.762408	0.000000	1.227721	0.759965	1.260330	0.868506	...	0.000000	3.062030	0.00000	0.327392	0.000000	2.792441	2.953601	0.000000	3.896846	0.000000
CAM_SMPL_003371	0.1	4.956612	3.899876	2.604266	4.769148	8.260903	1.841765	1.767899	0.655986	7.253030	3.634597	...	0.000000	2.231739	0.00000	1.705530	0.000000	4.657529	1.095331	0.697874	6.642156	1.049109

	Component 1	Component 2
0	11.114657	-5.224373
1	7.793254	-6.231991
2	10.776208	1.192819
3	0.830403	-2.127450
4	11.567918	1.184126

	COG0006 - Xaa-Pro aminopeptidase [Amino acid transport and metabolism].	COG0018 - Arginyl-tRNA synthetase [Translation, ribosomal structure and biogenesis].	COG0020 - Undecaprenyl pyrophosphate synthase [Lipid transport and metabolism].	COG0021 - Transketolase [Carbohydrate transport and metabolism].	COG0028 - Acetolactate synthase large subunit or other thiamine pyrophosphate-requiring enzyme [Amino acid transport and metabolism, Coenzyme transport and metabolism].	COG0035 - Uracil phosphoribosyltransferase [Nucleotide transport and metabolism].	COG0040 - ATP phosphoribosyltransferase [Amino acid transport and metabolism].	COG0049 - Ribosomal protein S7 [Translation, ribosomal structure and biogenesis].	COG0050 - Translation elongation factor EF-Tu, a GTPase [Translation, ribosomal structure and biogenesis].	COG0064 - Asp-tRNAAsn/Glu-tRNAGln amidotransferase B subunit [Translation, ribosomal structure and biogenesis].	...	COG4270 - Uncharacterized membrane protein [Function unknown].	COG4567 - Two-component response regulator, ActR/RegA family, consists of REC and Fis-type HTH domains [Signal transduction mechanisms, Transcription].	COG4579 - Isocitrate dehydrogenase kinase/phosphatase [Signal transduction mechanisms].	COG4581 - Superfamily II RNA helicase [Replication, recombination and repair].	COG4618 - ABC-type protease/lipase transport system, ATPase and permease components [Intracellular trafficking, secretion, and vesicular transport].	COG4663 - TRAP-type mannitol/chloroaromatic compound transport system, periplasmic component [Secondary metabolites biosynthesis, transport and catabolism].	COG4666 - TRAP-type uncharacterized transport system, fused permease components [General function prediction only].	COG4977 - Transcriptional regulator GlxA family, contains an amidase domain and an AraC-type DNA-binding HTH domain [Transcription].	COG5009 - Membrane carboxypeptidase/penicillin-binding protein [Cell wall/membrane/envelope biogenesis].	COG5282 - Uncharacterized conserved protein, DUF2342 family [Function unknown].
1	0.082970	0.084591	0.082658	0.098450	0.086372	0.068815	0.097695	0.066942	0.074851	0.078895	...	0.051625	0.064301	0.019433	0.079009	0.011210	0.072247	0.068873	0.015234	0.087356	0.078763
2	0.046882	0.012706	-0.004711	0.001974	0.060956	-0.074238	-0.023298	-0.070354	0.075204	-0.023305	...	-0.072290	-0.043273	0.173289	-0.058145	0.157041	0.055173	0.084453	0.170450	0.060258	-0.099727

	var1	var2	var3	T	E
0	0.595170	1.143472	1.571079	14.785652	1
1	0.209325	0.184677	0.356980	7.335846	1
2	0.693919	0.071893	0.557960	5.269797	1
3	0.443804	1.364646	0.374221	11.684092	1
4	1.613324	0.125566	1.921325	7.639492	1

Sample	CAM_SMPL_003398	CAM_SMPL_003374	CAM_SMPL_003389	CAM_SMPL_003418	CAM_SMPL_003391	CAM_SMPL_003416	CAM_SMPL_003372	CAM_SMPL_003356	CAM_SMPL_003406	CAM_SMPL_003384	...	CAM_SMPL_003400	CAM_SMPL_003388	CAM_SMPL_003394	CAM_SMPL_003358	CAM_SMPL_003407	CAM_SMPL_003359	CAM_SMPL_003382	CAM_SMPL_003357	CAM_SMPL_003368	CAM_SMPL_003378
AGS	1.444173e+06	1.195763e+06	1.185496e+06	1.720945e+06	1.461729e+06	1.179874e+06	1.851302e+06	1.232673e+06	1.458303e+06	954633.423109	...	1.709527e+06	1.537633e+06	1.268083e+06	1.295514e+06	2.152113e+06	1.142553e+06	1.278940e+06	1.419555e+06	1.026569e+06	1.109780e+06
GE	6.502469e+00	2.337699e+01	8.000187e+00	6.480953e+00	1.198879e+01	3.388096e+01	1.290436e+01	2.672666e+01	5.378497e+01	53.653646	...	1.860793e+01	2.890871e+01	4.623167e+01	2.374381e+00	5.126890e+00	2.573314e+01	1.675219e+01	3.168172e+01	1.212080e+01	4.157462e+01

Python workshop¶

bioinformatics@age.mpg.de¶

http://bioinformatics.age.mpg.de/¶

Pre workshop instructions¶

Why Python?¶

Basic syntax¶

Variables¶

Variables and identifiers¶

Basic Data Types summary in Python¶

Numerics:¶

Sequences:¶

Boolean:¶

Mapping:¶

Numbers¶

Number type conversions¶

Hands-on:¶

Numerical operations¶

Some extra operators:¶

Hands-on¶

Strings¶

Strings are “immutable”¶

Operations with strings¶

Attention to the absence of space¶

Hands-on¶

Escape sequences¶

Lists¶

List operations¶

Hans-on¶

Tuple¶

Dictionary¶

Comparison and logical operators in Python¶

Logical:¶

Comparison:¶

Decision making¶

Identation¶

Attention for the indentation in Python:¶

Loops¶

Hands on:¶

I/O in Python¶

Functions¶

Global vs. Local variables:¶

Modules¶

To use a module you need first to import:¶

You can also use an alias for the module:¶

Hands-on¶

Questions ???¶

Second Day workshop¶

Numerical operations with Numpy¶

Operations with Numpy¶

Hands-on¶

Introduction to data frames with Pandas¶

What is a Pandas data frame?¶

This table stores the normalized abundance of each COG (ortholog genes) on each sample¶

Working with the provided files¶

It was provided 2 files from metagenomics analysis:¶

This file is a parsed output from MicrobeCensus software:¶

Checking the column names:¶

Subseting a dataframe:¶

For us, its only important now 2 columns¶

Rename columns (all):¶

Rename specific columns:¶

Working with index on dataframes:¶

General statistic on dataframe:¶

Get the min value from GE:¶

Get the min value from all columns:¶

Transpose dataframe:¶

Sort by values in one column (in this case, GE):¶

By default it will sort ascending values, but you can specify to be descending:¶

Selecting one column as Pandas Series:¶

Selecting one column as new Pandas DataFrame:¶

Slicing by rows (positional):¶

Slicing by rows (by index name):¶

Conditional selection:¶

Operations with the values:¶

Create a new column based on results of operations:¶

Delete a column:¶

More about Pandas in: http://pandas.pydata.org/¶

Cheat Sheet: http://pandas.pydata.org/Pandas_Cheat_Sheet.pdf¶

Hands on:¶

Plots with matplotlib¶

Cheat Sheet: http://pandas.pydata.org/Pandas_Cheat_Sheet.pdf ¶

More about seaborn styles in: https://seaborn.pydata.org/generated/seaborn.set_style.html ¶

Package sklearn - more info at http://scikit-learn.org/stable/index.html ¶