Python workshop

bioinformatics@age.mpg.de

http://bioinformatics.age.mpg.de/

 


Pre workshop instructions

  1. Install docker: https://www.docker.com/products/docker-desktop

  2. Pull the workshop image

    docker pull mpgagebioinformatics/python_workshop
  3. Run the image and enter the container:

    mdkir -p ~/python_workshop
    docker run -p 8787:8787 -p 8888:8888 -v ~/python_workshop/:/home/mpiage --name python_workshop -it mpgagebioinformatics/python_workshop:latest
  4. Start jupyterhub inside the container

    module load jupyterhub
    jupyter notebook --ip=0.0.0.0
  5. Access jupyter. You will be shown a message like this:

    Copy/paste this URL into your browser when you connect for the first time,
     to login with a token:
         http://b4b294d48b2b:8888/?token=01d1532b8624becc5c7b288cf0c600b7805e802349e81478&token=01d1532b8624becc5c7b288cf0c600b7805e802349e81478

    Replace the "b4b294d48b2b" string (ie. bettween the http:// and :8888/) with "localhost" (eg. http://localhost:8888/?token=01d1532b8624becc5c7b288cf0c600b7805e802349e81478&token=01d1532b8624becc5c7b288cf0c600b7805e802349e81478 ) and paste it into your web browser.

If you are not able to install docker you can still run the notebook by install Python3 - https://www.python.org - and jupyter - https://jupyter.org. Afterwards make sure you have installed all required packages:

Package            Version
------------------ -------
autograd           1.2    
backcall           0.1.0  
bleach             2.1.3  
Bottleneck         1.2.1  
cycler             0.10.0 
decorator          4.3.0  
entrypoints        0.2.3  
future             0.17.1 
html5lib           1.0.1  
ipykernel          4.8.2  
ipython            6.4.0  
ipython-genutils   0.2.0  
ipywidgets         7.2.1  
jedi               0.12.0 
Jinja2             2.10   
jsonschema         2.6.0  
jupyter            1.0.0  
jupyter-client     5.2.3  
jupyter-console    5.2.0  
jupyter-core       4.4.0  
kiwisolver         1.0.1  
lifelines          0.19.4 
MarkupSafe         1.0    
matplotlib         3.0.2  
matplotlib-venn    0.11.5 
mistune            0.8.3  
nbconvert          5.3.1  
nbformat           4.4.0  
notebook           5.5.0  
numpy              1.16.1 
pandas             0.24.1 
pandocfilters      1.4.2  
parso              0.2.1  
pexpect            4.6.0  
pickleshare        0.7.4  
pip                9.0.3  
prompt-toolkit     1.0.15 
ptyprocess         0.6.0  
Pygments           2.2.0  
pyparsing          2.3.1  
python-dateutil    2.7.3  
pytz               2018.9 
pyzmq              17.0.0 
qtconsole          4.3.1  
scikit-learn       0.20.2 
scipy              1.2.1  
seaborn            0.9.0  
Send2Trash         1.5.0  
setuptools         39.0.1 
simplegeneric      0.8.1  
six                1.11.0 
sklearn            0.0    
terminado          0.8.1  
testpath           0.3.1  
tornado            5.0.2  
traitlets          4.3.2  
wcwidth            0.1.7  
webencodings       0.5.1  
widgetsnbextension 3.2.1  
xlrd               1.2.0

Check your packages with pip3 list and install with pip3 install <package_name>==<version> --user.

You can then download the notebook start you juoyter with jupyter notebook --ip=0.0.0.0 and download the notebook from: https://github.com/mpg-age-bioinformatics/presentations-tutorials/blob/gh-pages/presentations/modules/python_workshop/python_workshop.ipynb.

 


Why Python?

 

  • Easy to learn
  • It's free and well documented.
  • Popular (easy to get help on internet forums), big community.
  • Fast !?
  • Scripts are portable (can run on windows, mac os or linux).
  • A lot of libraries for many applications ready to use.

 


 

Basic syntax

 

  • Python is a “high level” script language, witch means it is as closest as possible to natural human language.
  • It is possible to run commands interactively or in scripts.
  • On the terminal, type python:

1

2

  • Print is a function, usually in python all the functions are called by name_function()
  • But before python 3, print function can be called without parenthesis ()
  • Inside the parenthesis, for some functions, there are parameters, called arguments
  • Inside a script:

3

  • First line points to the python interpreter on the operational system
  • And now to run (on unix based system):

4

  • Using # on a script, you can write comments and they will be ignored for the execution of the program
  • Blank lines are ignored in a script
  • It is possible to write multiple statements on a single line, separated by ;

6

  • From now, in this workshop, we will use only iterative python or Jupyter notebook.
In [1]:
# First print on Jupyter

print("Hello, Python!")
Hello, Python!

Variables

 

  • A variable is something which can change!!
  • It is a way of referring to a memory location by a computer program.
  • It stores values, has a name (identifier) and data type.
  • While the program is running, the variable can be accessed, and sometimes can be changed.
  • Python is not “strongly-typed” language. It means that the type of data storage on variables can be changed (other languages like C the variable should be declared with a data type).

 


Variables and identifiers

 

  • Some people mistakes variables and identifiers.
  • But identifiers are names of variables, they have name AND other features, like a value and data type.
  • In addition, identifiers are not used only for variables, but also for functions, modules, packages, etc.
  • A valid identifier is a non-empty sequence of characters of any length with:

    a) The start character can be the underscore "_" or a capital or lower case letter.

    b) The letters following the start character can be anything which is permitted as a start character plus the digits.

    c) Just a warning for Windows-spoilt users: Identifiers are case-sensitive!

    d) Python keywords are not allowed as identifier names! Ex: and, as, assert, break, class, continue, def, del, elif, else, except, for, if, in, is, lambda, not, or, pass, return, try, with.

 

In [2]:
# Some example of variables

i = 30
J = 32.1

some_string= "string"

 

Basic Data Types summary in Python

 

Numerics:

  1. int: integers, ex: 610, 9580
  1. long: long integers of non-limited length (only python 2.x) <- Python 3 int is unlimited
  1. Floating-point, ex: 42.11, 2.5415e-12
  1. Complex, ex: x = 2 + 4i

Sequences:

  1. str: Strings (sequence of characters), ex: “ABCD”, “Hello world!”, “C_x-aer”
  1. list
  1. tuple

Boolean:

  • True or False

Mapping:

  • dict (dictionary)

Numbers

  • int (integers): positive or negative numbers without decimal point.
In [3]:
a_number = 35
type(a_number)
Out[3]:
int
  • Note that there is no “ ”. If you include quotation marks:

    It is not a integer (int) but a string (str) type.

In [4]:
a_number = "35" 
type(a_number)
Out[4]:
str
  • long (long integers): unlimited size integer to store very big numbers with a L in the and.

    Ex: 202520222271131711312111411287282652828918918181050001012021L (only Python 2.x)

In [5]:
a=202520222271131711312111411287282652828918918181050001012021
type(a)
Out[5]:
int
In [6]:
a
Out[6]:
202520222271131711312111411287282652828918918181050001012021
  • float (floating point real values): real numbers with decimal points (ex: 23.567) and also can be in scientific notation (1.54e2 – it is the same of 1.54 x 100 and the same of 154)
In [7]:
a_float_number=23.567
type(a_float_number)
Out[7]:
float
In [8]:
another_float_number = 1.54e20
type(another_float_number)
Out[8]:
float
In [9]:
another_float_number
Out[9]:
1.54e+20
In [10]:
another_float_number
Out[10]:
1.54e+20
  • complex (complex numbers): not used much in Python. They are of the form a + bJ (a and b are floats) and J an imaginary number
In [11]:
a_complexcomplex=23.567+0j
In [12]:
type(a_complexcomplex)
Out[12]:
complex

Number type conversions

In [13]:
a_float_number
Out[13]:
23.567
In [14]:
int(a_float_number)
Out[14]:
23
In [15]:
a_float_number
Out[15]:
23.567
In [16]:
# modifications on-fly during workshop

a_int_number = int(a_float_number)
In [17]:
a_int_number
Out[17]:
23
In [18]:
complex(a_float_number)
Out[18]:
(23.567+0j)
In [19]:
str(a_float_number)
Out[19]:
'23.567'

Hands-on:

  • Create a variable numeric type float
  • Convert it to a type inter

Numerical operations

 

  • Unlike for strings, for numeric data types the operators +,*,-,/ are arithmetic
In [20]:
x=55
y=30
In [21]:
x+y
Out[21]:
85
In [22]:
x-y
Out[22]:
25
In [23]:
x*y
Out[23]:
1650
In [24]:
x/y  # why 1 ??
Out[24]:
1.8333333333333333
In [25]:
float(x)/float(y)
Out[25]:
1.8333333333333333

Some extra operators:

  • % (Modulus) return the remainder of a division
  • ** (Exponent) return result of exponential calculation
In [26]:
x%y
Out[26]:
25
In [27]:
x**y
Out[27]:
16251022246560461184530336180516518652439117431640625

Hands-on

  • Create 3 variables with values 4, 12.4 and 30.
  • Calculate the sum of all and store it into a new variable
  • Multiple the result by 5 and assign it to another variable

Strings

 

  • Strings are marked by quotes
  • Wrapped with single-quote ('):
In [28]:
'This is a string with single quotes'
Out[28]:
'This is a string with single quotes'
  • Wrapped with double-quote ("):
In [29]:
"This is a string with double quotes"
Out[29]:
'This is a string with double quotes'
  • Wrapped with three characters, using either single-quote or double-quote:
In [30]:
'''A String in triple quotes can extend over multiple lines, and can contain "double" quotes.'''
Out[30]:
'A String in triple quotes can extend over multiple lines, and can contain "double" quotes.'
  • A string in Python consists of a series or sequence of characters - letters, numbers, and special characters.
  • Strings can be indexed. The first character of a string has the index 0.
In [31]:
str_1= "A string consists of characters"
In [32]:
str_1
Out[32]:
'A string consists of characters'
In [33]:
str_1[0]
Out[33]:
'A'
In [34]:
str_1[3]
Out[34]:
't'
In [35]:
len(str_1)
Out[35]:
31
  • Last character:
In [36]:
str_1[-1]
Out[36]:
's'
  • It is possible because the index can be also counted from the right, using negative values:

    “STRING” [-6],[-5],[-4],[-3],[-2],[-1]

  • In addition to the normal way, from the left:

    “STRING” [0],[1],[2],[3],[4],[5]

Strings are “immutable”

 

  • Like in Java, Python strings cannot be changed.
In [37]:
# a wrong example
#s = "Some things are immutable!"
#s[-1] = "."

Operations with strings

 

  • Concatenation: using operator + is possible to concatenate 2 or more strings:

Attention to the absence of space

In [38]:
"Hello" + "World"  #  <- *Attention to the absence of space*
Out[38]:
'HelloWorld'
  • Repetition: using operator * is possible to repeat n times one string:
In [39]:
"HelloWorld" * 3
Out[39]:
'HelloWorldHelloWorldHelloWorld'
  • Indexing: as mentioned before, is possible to recover one specific position of the string by index:
In [40]:
"HelloWorld"[0] 
Out[40]:
'H'
  • Slicing: it is possible to recover substring of strings with slicing notation:
In [41]:
"HelloWorld"[2:4]  # Note that [2:4] will get the 3nd  and 4rd  position, not the 5th 
Out[41]:
'll'
  • Size
In [42]:
len("HelloWorld")
Out[42]:
10
In [43]:
type(len("HelloWorld")) # This is a int number
Out[43]:
int
  • Split: It is possible to use a substring (or character) to split a string:
In [44]:
s = "Some things are immutable!"
In [45]:
print(s.split())  # Default is space character
['Some', 'things', 'are', 'immutable!']
In [46]:
print(s.split("a"))
['Some things ', 're immut', 'ble!']
In [47]:
s2 = "You can split some strings, and you can decide how to do it!"
In [48]:
print(s2.split(","))
['You can split some strings', ' and you can decide how to do it!']
In [49]:
print(s2.split(",")[0])
You can split some strings

Hands-on

  • Print the 7th letter of the string "RafaelCuadrat"
  • Split the names on string "Jorge,Rafael,Franziska,Daniel"
  • Create a string using the 2 first and 2 last characters from the string "I love python"
  • Add "ing" to the end of the string created.

Escape sequences

 

  • If you want to use some special characters with the literal meaning inside a string, you need to use backslash \ before.

    Ex: “The double quotation mark is \”” -> If you use only ” without \ it will close the string without show ” inside the string

    Ex2: “I like to use \” -> You need to use a backslash before a backslash to display \ inside a string

 

  • Another scape sequences using backslash include:

    • New line: \newline

    • tabular space: \t <- very important dealing with tabular files

In [50]:
# a wrong example
# print("This is quotes mark " in a string")
In [51]:
print("This is quotes mark \" in a string")
This is quotes mark " in a string
In [52]:
print("This is tabular t in a string")
This is tabular t in a string
In [53]:
print("This is tabular \t in a string")
This is tabular 	 in a string

Lists

 

  • It is a versatile data type in python and can be written as comma-separated values (items) limited by [ ].
  • Items in the list does not need to be of the same type!
In [54]:
list1 = ['physics', 'chemistry', 1997, 2000]
list2 = [1, 2, 3, 4, 5 ]
list3 = ["a", "b", "c", "d"]
  • Similar to string, index of list also starts with 0 and the list can be sliced, concatenated, etc.
In [55]:
list1
Out[55]:
['physics', 'chemistry', 1997, 2000]
In [56]:
list1[-2]
Out[56]:
1997
In [57]:
list2[0:3]
Out[57]:
[1, 2, 3]
In [58]:
list3[2]
Out[58]:
'c'
  • Unlike strings, it is possible to update a list by changing an element:
In [59]:
list1[2]
Out[59]:
1997
In [60]:
list1[2] = 2001 #replacing the item on position [2]  by the value 2001
In [61]:
list1[2]
Out[61]:
2001
In [62]:
list1
Out[62]:
['physics', 'chemistry', 2001, 2000]
In [63]:
del list1[2] # deleting  item on position [2] from the list (2001)
In [64]:
list1
Out[64]:
['physics', 'chemistry', 2000]

List operations

 

  • The same as string for + (concatenation) and * (repetition) and most of operators. Ex: len(list)
  • More examples:
In [65]:
list1= ['physics', 'chemistry', 1997, 2000, 2004, 1999,2000]
In [66]:
len(list1)
Out[66]:
7
In [67]:
2001 in list1
Out[67]:
False
In [68]:
2000 in list1
Out[68]:
True
In [69]:
list1.count(2000)
Out[69]:
2
In [70]:
list1=[ str(s) for s in list1 ]
In [71]:
list1.sort()
In [72]:
list1
Out[72]:
['1997', '1999', '2000', '2000', '2004', 'chemistry', 'physics']

Hans-on

  • From the list provided:

1 - create a new list with the lengh of each item

2 - Calculate the number of occurrences of the string "google"

In [73]:
list_hands_on = [11111, "python","perl","bash","R","google","microsoft","apple","google"]

Tuple

 

  • Tuple is very similar to list, but immutable.
  • Instead [ ] it uses ( )
In [74]:
tup1 = ('bla','ble','blu',1,30)
tup2 = (1,2,3,4,5,6) 
tup3=(50)

Dictionary

 

  • In the dictionary, for every item there is a key and a value, separated by colon (:)
  • The items are separated by comma (,)
  • Keys are unique in a dictionary, but the values may be not.
  • You can access the values by the key
In [75]:
dict1 = {'Name': 'John', 'Age': 25, 'Class': 'First'}
In [76]:
dict1['Name']
Out[76]:
'John'
In [77]:
dict1['Age']
Out[77]:
25

Comparison and logical operators in Python

 

Logical:

 *and* -  both if both operators are true, the result is true
 *or*  - on of the operators is true, the result is true
 *not*  -  reverse the logical operator

Comparison:

  ==  check if two values are equal. Ex: a == b.
  !=  check if two values are NOT equal.
  >   check if the left is greater than the right. Ex: a>b. 
  <   check if the left is less than right
  >=  bigger or equal
  <=  smaller or equal

Decision making

 

7

  • For decision making it is possible to use just if for one single condition, if and elif for more than one condition and else for anything else (and every elif).
In [78]:
var1=100
In [79]:
if var1>90:
    print("Variable  is greater than 90")
Variable  is greater than 90
In [80]:
if var1>90:
    print("Variable is greater than 90")
else:
    print("Variable can be equal or smaller than 90")
Variable is greater than 90
In [81]:
var1=90
In [82]:
if var1>90:
    print("Variable is greater than 90")
else:
    print("Variable can be equal or smaller than 90")
Variable can be equal or smaller than 90
In [83]:
if var1>90: 
    print("Variable is greater than 90")
elif var1<90:
    print("Variable is smaller than 90")
else:  #(could also be elif var1 == 90:)
    print("Variable is 90")
    
Variable is 90

Identation

 

Attention for the indentation in Python:

  • No braces or reserved words to indicate blocks of code for class, functions or flow control. It is done by indentation.

8

Loops

  • Sometimes you need to repeat the same operations, so it is possible to use a loop, with a structure of repetition based in a decision or just a fixed number of times.

9

  • For loop has the ability to iterate over the items of any sequence, such as a list or a string. 10
In [84]:
for letter in 'Python bla': 
    print('Current Letter :', letter)
Current Letter : P
Current Letter : y
Current Letter : t
Current Letter : h
Current Letter : o
Current Letter : n
Current Letter :  
Current Letter : b
Current Letter : l
Current Letter : a
In [85]:
fruits = ["apple", "mango", "banana"]
for fruit in fruits:
    print('Current fruit :', fruit)
Current fruit : apple
Current fruit : mango
Current fruit : banana
In [86]:
fruits = ["apple", "mango", "banana"]
for fruit in fruits:
    for letter in fruit:
        print('Current fruit :', fruit, 'Current letter :', letter)
Current fruit : apple Current letter : a
Current fruit : apple Current letter : p
Current fruit : apple Current letter : p
Current fruit : apple Current letter : l
Current fruit : apple Current letter : e
Current fruit : mango Current letter : m
Current fruit : mango Current letter : a
Current fruit : mango Current letter : n
Current fruit : mango Current letter : g
Current fruit : mango Current letter : o
Current fruit : banana Current letter : b
Current fruit : banana Current letter : a
Current fruit : banana Current letter : n
Current fruit : banana Current letter : a
Current fruit : banana Current letter : n
Current fruit : banana Current letter : a
  • While loop statement in Python programming language repeatedly executes a target statement as long as a given condition is true.

11

In [87]:
count = 0 
while (count < 9):
    print("The count is:", count)
    count = count+1
print("Bye!")
The count is: 0
The count is: 1
The count is: 2
The count is: 3
The count is: 4
The count is: 5
The count is: 6
The count is: 7
The count is: 8
Bye!
  • The infinite loop: you need to take care when using while loops to not create an endless loop (unless you need one).  
var == 1 
while var == 1:
    print "Endless loop"

 

  • To stop it you need to use CTRL+C

Hands on:

  • Create a list of words on the string "I'm learning python for bioinformatics"
  • Create a loop and print all words from the list, but just if there is "n" in the word

I/O in Python

 

  • A program (or script) needs to deal with an input and to generate an output. In Python there are many ways to input data and we will see the most basic ones first.
  • The input([prompt]) function reads one line from standard input and returns it as a string (removing the trailing newline).
In [88]:
str_1 = input("Enter your input: ")
print("Received input is : ", str_1)
Enter your input: jorge
Received input is :  jorge
  • The input([prompt]) function is equivalent to raw_input, except that it assumes the input is a valid Python expression and returns the evaluated result to you.
In [89]:
# str_2 = input("Enter your input: ");
# print "Received input is : ", str_2
  • Before read or write a file, you have to open it with the open() function.
  • This function create a file object:
In [90]:
afile = open("example.txt", "w+")
  • The first argument is the file name, and the second is the “access mode”: It determines if the file will be only read, if can be write, append, etc. The default is r (read).
  • In the example, w+ is for both writing and reading, overwriting the existing file if the file exists.

  • Example of other modes are:

‘w’ – Write mode which is used to edit and write new information to the file (any existing files with the same name will be erased when this mode is activated)

‘a’ – Appending mode, which is used to add new data to the end of the file; that is new information is automatically amended to the end

‘r+’ – Special read and write mode, which is used to handle both actions when working with a file

In [91]:
afile #afile is an object 
Out[91]:
<_io.TextIOWrapper name='example.txt' mode='w+' encoding='UTF-8'>
  • Writing text on the new file
In [92]:
afile.write( "Python is amazing.\nYeah its great!!\n")  # \n is new line 
Out[92]:
36
  • Closing the file:
In [93]:
afile.close()
  • Open, read and close a file:
In [94]:
fo = open("example.txt", "r+")
In [95]:
str_1 = fo.read()
In [96]:
print(str_1.endswith)
<built-in method endswith of str object at 0x7f414dfc4138>
In [97]:
str_1.splitlines()[1]
Out[97]:
'Yeah its great!!'
In [98]:
fo = open("example.txt", "r+")
str_2 = fo.read();
print(str_2)
fo.close()
Python is amazing.
Yeah its great!!

In [99]:
fo.close()
In [100]:
### include fasta file reading and manupulations 

Functions

 

  • A function is a block of organized, reusable code that is used to perform a single, related action. Functions provide better modularity for your application and a high degree of code reusing.
  • As you already know, Python gives you many built-in functions like print(), etc. but you can also create your own functions. These functions are called user-defined functions.
  • You can define functions to provide the required functionality.
  • Function blocks begin with the keyword def followed by the function name and parentheses ( ):
In [101]:
def a_new_function():
    print("A new function")
    return
In [102]:
a_new_function()
A new function
  • Any input parameters or arguments should be placed within these parentheses. You can also define parameters inside these parentheses.
  • The first statement of a function can be an optional statement - the documentation string of the function or docstring.
  • The code block within every function starts with a colon (:) and is indented.
  • The statement return [expression] exits a function, optionally passing back an expression to the caller. A return statement with no arguments is the same as return None.
In [103]:
def a_new_function(a_str):
    print(a_str)
    return
In [104]:
a_new_function("another string")
another string
  • It is possible to use arguments as keyword arguments. It allows you to skip arguments or place out of order because python will identify by the keyword
In [105]:
# Function definition is here
def printinfo( name, age ):
    "This prints a passed info into this function"
    print("Name: ", name)
    print("Age ", age)
    return;
In [106]:
# Now you can call printinfo function - no keyword, but same order 
printinfo("miki",50)
Name:  miki
Age  50
In [107]:
printinfo(50,"miki")
Name:  50
Age  miki
In [108]:
# Now you can call printinfo function - using the keywords 
printinfo(age=50,name="miki")
Name:  miki
Age  50
  • How to get help:
In [109]:
help(printinfo)
Help on function printinfo in module __main__:

printinfo(name, age)
    This prints a passed info into this function

  • It is possible to use default arguments:
In [110]:
def printinfo( name, age = 35 ): # default argument age is 35
    "This prints a passed info into this function"
    print("Name: ", name)
    print("Age ", age)
    return;
In [111]:
printinfo( name="miki" ) # no age argument, it will be the default
Name:  miki
Age  35
  • The statement return [expression] exits a function, optionally passing back an expression to the caller. A return statement with no arguments is the same as return None.
In [112]:
def sum_( arg1, arg2 ):
   # Add both the parameters and return them."
    total = arg1 + arg2
    print("Inside the function : ", total)
    return total;
In [113]:
sum_( 10, 20 )
Inside the function :  30
Out[113]:
30

Global vs. Local variables:

  • Variables that are defined inside a function body have a local scope, and those defined outside have a global scope.
In [114]:
total = 0 # Global variable.
In [115]:
def sum_2( arg1, arg2 ):
    # Add both the parameters and return them."
    total = arg1 + arg2; # Local variable.
    print("Inside the function local total : ", total)
    return
In [116]:
sum_2( 10, 20 );
print("Outside the function global total : ", total )
Inside the function local total :  30
Outside the function global total :  0

Modules

 

  • A module allows you to logically organize your Python code. Grouping related code into a module makes the code easier to understand and use.
  • A module is a Python object with arbitrarily named attributes that you can bind and reference.
  • Simply, a module is a file consisting of Python code. A module can define functions, classes and variables. A module can also include runnable code.
  • The Python code for a module named aname normally resides in a file named aname.py. Here's an example of a simple module, support.py
In [117]:
def print_func( name ):
    print("Hello : ", name)
    return
In [118]:
# Creating a module on fly - just illustrative, never do it, use a text editor or IDE
In [119]:
new_mod=open("support.py","w+")
new_mod.write("def print_func( name ):\n\tprint(\"Hello : \", name)\n\treturn")
new_mod.close()

To use a module you need first to import:

In [120]:
# Import module support
import support

# Now you can call defined function that module as follows
support.print_func("Zara")
Hello :  Zara

You can also use an alias for the module:

In [121]:
# Import module support with an alias “sup”
import support as sup
In [122]:
sup.print_func("Zara")
Hello :  Zara

Hands-on

  • Write a Python function that takes a list of words and returns the length of the longest one.

Questions ???

12


   

Second Day workshop

 

  • Numerical operations (Numpy)
  • Dataframes (pandas)
  • Plots (matplotlib and seaborn)  

Numerical operations with Numpy

 

  • Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays (including many numerical operations)
  • We will not cover arrays in detail, but we will see how to use numpy to do numerical operations
  • NumPy’s array class is called ndarray. It is also known by the alias array. Note that numpy.array is not the same as the Standard Python Library class array.array, which only handles one-dimensional arrays and offers less functionality.
In [123]:
import numpy as np
In [124]:
a = np.arange(15).reshape(3,5)
In [125]:
a
Out[125]:
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
In [126]:
a = np.array([2,3,4])
b = np.array([2.1,3.2,4.5])

Operations with Numpy

In [127]:
c=a+b
c
Out[127]:
array([4.1, 6.2, 8.5])
In [128]:
d=a-b
d
Out[128]:
array([-0.1, -0.2, -0.5])
In [129]:
d**2
Out[129]:
array([0.01, 0.04, 0.25])
In [130]:
a*b # Element wise product
Out[130]:
array([ 4.2,  9.6, 18. ])
In [131]:
a.dot(b) # Matrix product
Out[131]:
31.8
In [132]:
e=np.arange(12).reshape(3,4)
e
Out[132]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
In [133]:
e.sum(axis=0)
Out[133]:
array([12, 15, 18, 21])
In [134]:
e.sum(axis=1)
Out[134]:
array([ 6, 22, 38])
In [135]:
e.min(axis=0)
Out[135]:
array([0, 1, 2, 3])
In [136]:
e.min(axis=1)
Out[136]:
array([0, 4, 8])
In [137]:
e.max(axis=1)
Out[137]:
array([ 3,  7, 11])
In [138]:
e.cumsum(axis=1)
Out[138]:
array([[ 0,  1,  3,  6],
       [ 4,  9, 15, 22],
       [ 8, 17, 27, 38]])
In [139]:
e.T
Out[139]:
array([[ 0,  4,  8],
       [ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11]])
In [140]:
e[2,3]
Out[140]:
11
In [141]:
e[-1]
Out[141]:
array([ 8,  9, 10, 11])

Hands-on

  • with the 2 provided arrays:

1 - reshape both to 3 rows and 4 columns

2 - Transpose both

3 - calculate the sum of the cumsum of both arrays

In [142]:
anewarray=np.random.rand(4,3)
anotherarray=np.random.rand(4,3)

Introduction to data frames with Pandas

What is a Pandas data frame?

  • It is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).
  • Arithmetic operations align on both row and column labels.

13

  • In general, we import a tabular or CSV file to a pandas dataframe
  • The are some options, for example, we can use different separator character, we can read a file with or without a header
In [143]:
import pandas as pd


df=pd.read_csv("new_AVS.tsv")
In [144]:
df.head()
Out[144]:
metagenome: reads_sampled: trimmed_length: min_quality: mean_quality: filter_dups: max_unknown: average_genome_size: total_bases: genome_equivalents:
0 CAM_SMPL_003398\t14100\t400\t-5\t-5\tFalse\t10...
1 CAM_SMPL_003374\t35265\t450\t-5\t-5\tFalse\t10...
2 CAM_SMPL_003389\t12694\t400\t-5\t-5\tFalse\t10...
3 CAM_SMPL_003418\t13981\t450\t-5\t-5\tFalse\t10...
4 CAM_SMPL_003391\t26779\t400\t-5\t-5\tFalse\t10...
In [145]:
#checking the first rows of dataframe - It shows 5 rows by default, but you can pass argument inside ()
df.head()
Out[145]:
metagenome: reads_sampled: trimmed_length: min_quality: mean_quality: filter_dups: max_unknown: average_genome_size: total_bases: genome_equivalents:
0 CAM_SMPL_003398\t14100\t400\t-5\t-5\tFalse\t10...
1 CAM_SMPL_003374\t35265\t450\t-5\t-5\tFalse\t10...
2 CAM_SMPL_003389\t12694\t400\t-5\t-5\tFalse\t10...
3 CAM_SMPL_003418\t13981\t450\t-5\t-5\tFalse\t10...
4 CAM_SMPL_003391\t26779\t400\t-5\t-5\tFalse\t10...
In [146]:
df.tail() # Last 5 rows 
Out[146]:
metagenome: reads_sampled: trimmed_length: min_quality: mean_quality: filter_dups: max_unknown: average_genome_size: total_bases: genome_equivalents:
58 CAM_SMPL_003359\t46837\t450\t-5\t-5\tFalse\t10...
59 CAM_SMPL_003382\t31822\t400\t-5\t-5\tFalse\t10...
60 CAM_SMPL_003357\t55946\t450\t-5\t-5\tFalse\t10...
61 CAM_SMPL_003368\t15148\t450\t-5\t-5\tFalse\t10...
62 CAM_SMPL_003378\t67961\t450\t-5\t-5\tFalse\t10...
  • By default, function read_csv() would use "," as separator to read the table.
  • We can change it for any carachter, for example tabular "\t":
In [147]:
df=pd.read_csv("new_AVS.tsv",sep="\t")
In [148]:
df.head()
Out[148]:
metagenome: reads_sampled: trimmed_length: min_quality: mean_quality: filter_dups: max_unknown: average_genome_size: total_bases: genome_equivalents:
0 CAM_SMPL_003398 14100 400 -5 -5 False 100 1.444173e+06 9390688 6.502469
1 CAM_SMPL_003374 35265 450 -5 -5 False 100 1.195763e+06 27953336 23.376988
2 CAM_SMPL_003389 12694 400 -5 -5 False 100 1.185496e+06 9484191 8.000187
3 CAM_SMPL_003418 13981 450 -5 -5 False 100 1.720945e+06 11153362 6.480953
4 CAM_SMPL_003391 26779 400 -5 -5 False 100 1.461729e+06 17524363 11.988789
  • It's also possible to read excel tables using read_excel("file.xls")
In [149]:
df_cog=pd.read_excel("CAMERA_COG_RPKG_sig.xlsx")

This table stores the normalized abundance of each COG (ortholog genes) on each sample

In [150]:
df_cog.head(10)
Out[150]:
Unnamed: 0 CAM_SMPL_003410 CAM_SMPL_003375 CAM_SMPL_003388 CAM_SMPL_003380 CAM_SMPL_003361 CAM_SMPL_003395 CAM_SMPL_003378 CAM_SMPL_003365 CAM_SMPL_003373 ... g3-mean g1-sd g2-sd g3-sd ANOVA F ANOVA pvalue pvalue adjusted log2(FC)_g1/g2 log2(FC)_g1/g3 log2(FC)_g2/g3
0 COG0006 - Xaa-Pro aminopeptidase [Amino acid t... 1.545723 4.167033 1.534017 0.000000 1.724899 0.000000 2.217853 6.629477 0.000000 ... 0.838736 1.323263 1.241376 1.042022 21.810674 7.604322e-08 0.000291 0.450025 1.961559 1.511534
1 COG0018 - Arginyl-tRNA synthetase [Translation... 1.024882 3.036821 2.855767 4.149942 0.000000 0.000000 3.104112 3.991667 0.000000 ... 0.933072 1.318947 1.371911 1.309937 15.510043 3.718949e-06 0.014236 0.463642 1.779949 1.316308
2 COG0020 - Undecaprenyl pyrophosphate synthase ... 0.000000 3.220692 0.973403 0.000000 1.542692 0.000000 1.481556 2.195459 0.000000 ... 0.388321 0.790581 0.971956 0.640390 17.853436 8.246011e-07 0.003157 0.751614 2.277260 1.525646
3 COG0021 - Transketolase [Carbohydrate transpor... 0.659364 2.388176 1.485892 0.000000 0.000000 0.000000 4.185504 5.248721 1.673285 ... 0.714487 1.139996 1.044342 1.109277 29.397518 1.260633e-09 0.000005 0.738438 2.211978 1.473539
4 COG0028 - Acetolactate synthase large subunit ... 1.799150 6.434617 4.867390 0.000000 3.255559 0.000000 4.231023 9.012939 2.817330 ... 2.285751 2.220290 2.264018 2.401486 15.491232 3.765361e-06 0.014414 0.330109 1.428212 1.098103
5 COG0035 - Uracil phosphoribosyltransferase [Nu... 0.000000 0.745211 0.205852 0.000000 0.000000 0.000000 2.261908 0.623539 0.000000 ... 0.186456 0.538982 0.359889 0.359018 15.691544 3.300370e-06 0.012634 0.927917 2.299039 1.371122
6 COG0040 - ATP phosphoribosyltransferase [Amino... 0.637089 0.948030 1.106626 0.000000 0.000000 1.385314 1.789779 1.529666 0.000000 ... 0.196886 0.611950 0.536970 0.408410 25.869401 7.914384e-09 0.000030 0.855852 2.785937 1.930085
7 COG0049 - Ribosomal protein S7 [Translation, r... 0.000000 0.000000 0.875543 0.000000 0.000000 0.000000 1.927435 1.336095 0.000000 ... 0.242650 0.538830 0.441765 0.537685 14.139673 9.306025e-06 0.035623 0.757113 2.144314 1.387201
8 COG0050 - Translation elongation factor EF-Tu,... 6.583426 4.057610 4.395697 2.263685 4.002741 0.000000 4.193771 6.215216 0.000000 ... 2.142647 1.355634 1.279844 2.319426 14.538509 7.105056e-06 0.027198 0.142089 1.167283 1.025195
9 COG0064 - Asp-tRNAAsn/Glu-tRNAGln amidotransfe... 3.050589 3.351769 1.429165 1.218677 0.000000 0.000000 2.973363 3.349802 0.000000 ... 1.206369 1.109038 1.057338 1.454888 15.511164 3.716203e-06 0.014226 0.531574 1.452333 0.920759

10 rows × 76 columns

Working with the provided files

It was provided 2 files from metagenomics analysis:

  • new_avs.csv and metatable.csv
  • AVS is Average Genome Size calculated based on gene markers for each metagenomic sample
  • Metatable is metadata about each sample
In [151]:
df=pd.read_csv("new_AVS.tsv",sep="\t") 

This file is a parsed output from MicrobeCensus software:

https://github.com/snayfach/MicrobeCensus

MicrobeCensus is a fast and easy to use pipeline for estimating the average genome size (AGS) of a microbial community from metagenomic data.

In [152]:
df.head()  
Out[152]:
metagenome: reads_sampled: trimmed_length: min_quality: mean_quality: filter_dups: max_unknown: average_genome_size: total_bases: genome_equivalents:
0 CAM_SMPL_003398 14100 400 -5 -5 False 100 1.444173e+06 9390688 6.502469
1 CAM_SMPL_003374 35265 450 -5 -5 False 100 1.195763e+06 27953336 23.376988
2 CAM_SMPL_003389 12694 400 -5 -5 False 100 1.185496e+06 9484191 8.000187
3 CAM_SMPL_003418 13981 450 -5 -5 False 100 1.720945e+06 11153362 6.480953
4 CAM_SMPL_003391 26779 400 -5 -5 False 100 1.461729e+06 17524363 11.988789
  • The samples used on this study are from GOS - baltic sea.

Checking the column names:

In [153]:
df.columns
Out[153]:
Index(['metagenome:', 'reads_sampled:', 'trimmed_length:', 'min_quality:',
       'mean_quality:', 'filter_dups:', 'max_unknown:', 'average_genome_size:',
       'total_bases:', 'genome_equivalents:'],
      dtype='object')
In [154]:
l=list(df.columns)
In [155]:
l
Out[155]:
['metagenome:',
 'reads_sampled:',
 'trimmed_length:',
 'min_quality:',
 'mean_quality:',
 'filter_dups:',
 'max_unknown:',
 'average_genome_size:',
 'total_bases:',
 'genome_equivalents:']

Subseting a dataframe:

For us, its only important now 2 columns

In [156]:
new_df=df[["metagenome:","average_genome_size:","genome_equivalents:"]]
In [157]:
new_df.head()
Out[157]:
metagenome: average_genome_size: genome_equivalents:
0 CAM_SMPL_003398 1.444173e+06 6.502469
1 CAM_SMPL_003374 1.195763e+06 23.376988
2 CAM_SMPL_003389 1.185496e+06 8.000187
3 CAM_SMPL_003418 1.720945e+06 6.480953
4 CAM_SMPL_003391 1.461729e+06 11.988789

Rename columns (all):

In [158]:
new_df.columns=["Sample","AGS","GE"]
In [159]:
new_df.head()
Out[159]:
Sample AGS GE
0 CAM_SMPL_003398 1.444173e+06 6.502469
1 CAM_SMPL_003374 1.195763e+06 23.376988
2 CAM_SMPL_003389 1.185496e+06 8.000187
3 CAM_SMPL_003418 1.720945e+06 6.480953
4 CAM_SMPL_003391 1.461729e+06 11.988789

Rename specific columns:

In [160]:
new_df.rename(columns= {"Sample":"sample"},inplace=False).head() #Needs inplace = True to really change it
Out[160]:
sample AGS GE
0 CAM_SMPL_003398 1.444173e+06 6.502469
1 CAM_SMPL_003374 1.195763e+06 23.376988
2 CAM_SMPL_003389 1.185496e+06 8.000187
3 CAM_SMPL_003418 1.720945e+06 6.480953
4 CAM_SMPL_003391 1.461729e+06 11.988789

Working with index on dataframes:

  • this will set index and display the new_df with new index, but no alterations to original dataframe:
In [161]:
new_df.set_index("Sample").head() #Just show the dataframe with the index on "Sample", 
                                  # no alterations to original df
Out[161]:
AGS GE
Sample
CAM_SMPL_003398 1.444173e+06 6.502469
CAM_SMPL_003374 1.195763e+06 23.376988
CAM_SMPL_003389 1.185496e+06 8.000187
CAM_SMPL_003418 1.720945e+06 6.480953
CAM_SMPL_003391 1.461729e+06 11.988789
In [162]:
#Original still unchanged
new_df.head()
Out[162]:
Sample AGS GE
0 CAM_SMPL_003398 1.444173e+06 6.502469
1 CAM_SMPL_003374 1.195763e+06 23.376988
2 CAM_SMPL_003389 1.185496e+06 8.000187
3 CAM_SMPL_003418 1.720945e+06 6.480953
4 CAM_SMPL_003391 1.461729e+06 11.988789
In [163]:
new_df.set_index("Sample",inplace=True) # inplace=True modifies the dataframe
In [164]:
# Now its changed
new_df.head()
Out[164]:
AGS GE
Sample
CAM_SMPL_003398 1.444173e+06 6.502469
CAM_SMPL_003374 1.195763e+06 23.376988
CAM_SMPL_003389 1.185496e+06 8.000187
CAM_SMPL_003418 1.720945e+06 6.480953
CAM_SMPL_003391 1.461729e+06 11.988789

General statistic on dataframe:

In [165]:
new_df.describe()
Out[165]:
AGS GE
count 6.300000e+01 63.000000
mean 1.397449e+06 20.054522
std 3.212700e+05 17.076021
min 7.909761e+05 1.777490
25% 1.182685e+06 6.281441
50% 1.412405e+06 16.458093
75% 1.635010e+06 30.137062
max 2.235292e+06 94.008744

Get the min value from GE:

In [166]:
new_df["GE"].min()
Out[166]:
1.77748993023

Get the min value from all columns:

In [167]:
new_df.min()
Out[167]:
AGS    790976.120718
GE          1.777490
dtype: float64
In [168]:
new_df.max()
Out[168]:
AGS    2.235292e+06
GE     9.400874e+01
dtype: float64
In [169]:
new_df.dtypes
Out[169]:
AGS    float64
GE     float64
dtype: object
In [170]:
type(new_df)
Out[170]:
pandas.core.frame.DataFrame

Transpose dataframe:

In [171]:
new_df.T
Out[171]:
Sample CAM_SMPL_003398 CAM_SMPL_003374 CAM_SMPL_003389 CAM_SMPL_003418 CAM_SMPL_003391 CAM_SMPL_003416 CAM_SMPL_003372 CAM_SMPL_003356 CAM_SMPL_003406 CAM_SMPL_003384 ... CAM_SMPL_003400 CAM_SMPL_003388 CAM_SMPL_003394 CAM_SMPL_003358 CAM_SMPL_003407 CAM_SMPL_003359 CAM_SMPL_003382 CAM_SMPL_003357 CAM_SMPL_003368 CAM_SMPL_003378
AGS 1.444173e+06 1.195763e+06 1.185496e+06 1.720945e+06 1.461729e+06 1.179874e+06 1.851302e+06 1.232673e+06 1.458303e+06 954633.423109 ... 1.709527e+06 1.537633e+06 1.268083e+06 1.295514e+06 2.152113e+06 1.142553e+06 1.278940e+06 1.419555e+06 1.026569e+06 1.109780e+06
GE 6.502469e+00 2.337699e+01 8.000187e+00 6.480953e+00 1.198879e+01 3.388096e+01 1.290436e+01 2.672666e+01 5.378497e+01 53.653646 ... 1.860793e+01 2.890871e+01 4.623167e+01 2.374381e+00 5.126890e+00 2.573314e+01 1.675219e+01 3.168172e+01 1.212080e+01 4.157462e+01

2 rows × 63 columns

In [172]:
#some hands on here

Sort by values in one column (in this case, GE):

In [173]:
new_df.sort_values(by="GE").head()
Out[173]:
AGS GE
Sample
CAM_SMPL_003367 1.082166e+06 1.777490
CAM_SMPL_003364 1.480442e+06 1.863965
CAM_SMPL_003358 1.295514e+06 2.374381
CAM_SMPL_003386 1.151237e+06 3.005024
CAM_SMPL_003373 9.909482e+05 3.064752

By default it will sort ascending values, but you can specify to be descending:

In [174]:
new_df.sort_values(by="GE",ascending=False).head()
Out[174]:
AGS GE
Sample
CAM_SMPL_003393 8.948887e+05 94.008744
CAM_SMPL_003406 1.458303e+06 53.784971
CAM_SMPL_003384 9.546334e+05 53.653646
CAM_SMPL_003390 8.312135e+05 48.673980
CAM_SMPL_003394 1.268083e+06 46.231671

Selecting one column as Pandas Series:

In [175]:
new_df["AGS"].head()
Out[175]:
Sample
CAM_SMPL_003398    1.444173e+06
CAM_SMPL_003374    1.195763e+06
CAM_SMPL_003389    1.185496e+06
CAM_SMPL_003418    1.720945e+06
CAM_SMPL_003391    1.461729e+06
Name: AGS, dtype: float64

Selecting one column as new Pandas DataFrame:

In [176]:
new_df[["AGS"]].head()
Out[176]:
AGS
Sample
CAM_SMPL_003398 1.444173e+06
CAM_SMPL_003374 1.195763e+06
CAM_SMPL_003389 1.185496e+06
CAM_SMPL_003418 1.720945e+06
CAM_SMPL_003391 1.461729e+06

Slicing by rows (positional):

In [177]:
new_df.iloc[0:3]
Out[177]:
AGS GE
Sample
CAM_SMPL_003398 1.444173e+06 6.502469
CAM_SMPL_003374 1.195763e+06 23.376988
CAM_SMPL_003389 1.185496e+06 8.000187

Slicing by rows (by index name):

In [178]:
new_df.loc["CAM_SMPL_003398":"CAM_SMPL_003418"]
Out[178]:
AGS GE
Sample
CAM_SMPL_003398 1.444173e+06 6.502469
CAM_SMPL_003374 1.195763e+06 23.376988
CAM_SMPL_003389 1.185496e+06 8.000187
CAM_SMPL_003418 1.720945e+06 6.480953

Conditional selection:

In [179]:
new_df[(new_df["GE"]<40) & (new_df["AGS"]>1000000)][["AGS"]].head()
Out[179]:
AGS
Sample
CAM_SMPL_003398 1.444173e+06
CAM_SMPL_003374 1.195763e+06
CAM_SMPL_003389 1.185496e+06
CAM_SMPL_003418 1.720945e+06
CAM_SMPL_003391 1.461729e+06
In [180]:
new_df[new_df["GE"]>40] # Select the rows where GE is bigger than 40
Out[180]:
AGS GE
Sample
CAM_SMPL_003406 1.458303e+06 53.784971
CAM_SMPL_003384 9.546334e+05 53.653646
CAM_SMPL_003393 8.948887e+05 94.008744
CAM_SMPL_003381 9.109262e+05 46.050540
CAM_SMPL_003390 8.312135e+05 48.673980
CAM_SMPL_003394 1.268083e+06 46.231671
CAM_SMPL_003378 1.109780e+06 41.574616

Operations with the values:

In [181]:
new_df.max()-new_df.min()
Out[181]:
AGS    1.444315e+06
GE     9.223125e+01
dtype: float64

Create a new column based on results of operations:

In [182]:
new_df["GE - AGS"]=new_df["GE"]-new_df["AGS"]
/modules/software/jupyterhub/0.9.0/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [183]:
new_df.head()
Out[183]:
AGS GE GE - AGS
Sample
CAM_SMPL_003398 1.444173e+06 6.502469 -1.444166e+06
CAM_SMPL_003374 1.195763e+06 23.376988 -1.195740e+06
CAM_SMPL_003389 1.185496e+06 8.000187 -1.185488e+06
CAM_SMPL_003418 1.720945e+06 6.480953 -1.720938e+06
CAM_SMPL_003391 1.461729e+06 11.988789 -1.461717e+06

Delete a column:

In [184]:
del new_df["GE - AGS"]
In [185]:
new_df.head()
Out[185]:
AGS GE
Sample
CAM_SMPL_003398 1.444173e+06 6.502469
CAM_SMPL_003374 1.195763e+06 23.376988
CAM_SMPL_003389 1.185496e+06 8.000187
CAM_SMPL_003418 1.720945e+06 6.480953
CAM_SMPL_003391 1.461729e+06 11.988789
In [186]:
new_df["AGS - GE"]=new_df["AGS"]-new_df["GE"]
/modules/software/jupyterhub/0.9.0/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

Hands on:

  • read a dataframe (any!! From internet, from your computer...)
  • print the columns
  • store the column names into a list
  • print the max value from the last column
  • print the max value from the last row

Plots with matplotlib

  • Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

https://matplotlib.org/

In [187]:
import matplotlib.pyplot as plt
  • Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

https://seaborn.pydata.org/

In [188]:
import seaborn as sns

Starting with plots - the ugly and lazy way

In [189]:
#function plot direct from pandas dataframe
new_df["GE"].plot()
plt.show()
In [190]:
#Rotate the name of samples 

new_df["GE"].plot()


plt.xticks(rotation=90)

plt.show()

This plot is useless, we dont have continuos data, we need a bar plot in this case

In [191]:
new_df["GE"].plot(kind="bar")  # kind can be also pie, scatter, etc ...
plt.xticks(rotation=90)
plt.show()

We need to control the size and style of figure, and also axis titles, etc

In [192]:
sns.set_style("white")

#Define a figure and its size
plt.figure(figsize=(15,8))



#Bar plot 

# range(len(new_df["AGS"])) -> define the number of bars to be plotted and the position on X axis 
x=range(len(new_df["AGS"]))

# new_df["AGS"] -> values for axis Y
y=new_df["AGS"]

plt.bar(x,y,color="b")

#axis labels
plt.ylabel("AGS ",fontsize="16")
plt.xlabel("Samples",fontsize="16")

plt.title("PLOT FOR WORKSHOP",fontsize="22")
#xticks -> define the text to be ploted on axis X -> here we use index (the name of samples)


### error .... index values not matching the positions of xticks

plt.xticks(range(0,len(new_df.index),2), new_df.index,rotation="vertical",fontsize="12")
plt.yticks(fontsize="22")

## Save the figure in pdf

plt.savefig("my_first_plot.pdf", format = 'pdf', dpi = 300, bbox_inches = 'tight')

#Necessary to display the plot here
plt.show()

The values on Y are in bases, hard to read, so we can convert to Mb

In [193]:
new_df["AGS (mb)"]=new_df["AGS"]/1000000
/modules/software/jupyterhub/0.9.0/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
In [194]:
#Seaborn style
sns.set_style("white")

#Define a figure and its size
plt.figure(figsize=(15,8))


x=range(len(new_df["AGS (mb)"]))

y=new_df["AGS (mb)"]


#Bar plot 
plt.bar(x,y,color="b")

#axis labels
plt.ylabel("AGS (Mb)",fontsize="16")
plt.xlabel("Samples",fontsize="16")


plt.xticks(range(len(new_df.index)), new_df.index,rotation="vertical")

plt.show()

Metadata from samples

In [195]:
df2=pd.read_csv("metatable.csv", encoding = "latin")
In [196]:
df2.head()
Out[196]:
SAMPLE_ACC Sample Type SAMPLE_VOLUME VOLUME_UNIT FILTER_MIN FILTER_MAX SAMPLE_DESCRIPTION SAMPLE_NAME COMMENTS TAXON_ID ... MATERIAL_ID sample depth - (m) Ammonium - (umol/kg) Silicate - (umol/kg) Total phosphorus salinity - (psu) temperature - (C) nitrite - (µmol/l) phosphate - (umol/kg) nitrate(NO3) - (umol/l)
0 CAM_SMPL_003356 metagenome 190 L 0.1 0.8 NaN GS682_0.1 NaN NaN ... 3785 24 0.263 7.079 0.395 8.00 9.47 0.033 0.208 0.212
1 CAM_SMPL_003357 metagenome 190 L 0.8 3.0 NaN GS682_0.8 NaN NaN ... 3786 24 0.263 7.079 0.395 8.00 9.47 0.033 0.208 0.212
2 CAM_SMPL_003358 metagenome 190 L 3.0 200.0 NaN GS682_3.0 NaN NaN ... 3787 24 0.263 7.079 0.395 8.00 9.47 0.033 0.208 0.212
3 CAM_SMPL_003359 metagenome 180 L 0.1 0.8 NaN GS683_0.1 NaN NaN ... 3788 1 0.188 7.258 0.458 13.72 18.15 NaN 0.073 0.336
4 CAM_SMPL_003360 metagenome 180 L 0.8 3.0 NaN GS683_0.8 NaN NaN ... 3789 1 0.188 7.258 0.458 13.72 18.15 NaN 0.073 0.336

5 rows × 41 columns

In [197]:
new_df.head()
Out[197]:
AGS GE AGS - GE AGS (mb)
Sample
CAM_SMPL_003398 1.444173e+06 6.502469 1.444166e+06 1.444173
CAM_SMPL_003374 1.195763e+06 23.376988 1.195740e+06 1.195763
CAM_SMPL_003389 1.185496e+06 8.000187 1.185488e+06 1.185496
CAM_SMPL_003418 1.720945e+06 6.480953 1.720938e+06 1.720945
CAM_SMPL_003391 1.461729e+06 11.988789 1.461717e+06 1.461729
In [198]:
new_df.reset_index(inplace=True)
In [199]:
new_df.head()
Out[199]:
Sample AGS GE AGS - GE AGS (mb)
0 CAM_SMPL_003398 1.444173e+06 6.502469 1.444166e+06 1.444173
1 CAM_SMPL_003374 1.195763e+06 23.376988 1.195740e+06 1.195763
2 CAM_SMPL_003389 1.185496e+06 8.000187 1.185488e+06 1.185496
3 CAM_SMPL_003418 1.720945e+06 6.480953 1.720938e+06 1.720945
4 CAM_SMPL_003391 1.461729e+06 11.988789 1.461717e+06 1.461729
In [200]:
# checking size of both dataframes, if they have same number of samples
In [201]:
len(new_df)
Out[201]:
63
In [202]:
len(df2)
Out[202]:
63
In [203]:
#mergeing 2 df
merged=pd.merge(new_df,df2,left_on="Sample",right_on="SAMPLE_ACC")
In [204]:
merged.head(15)
Out[204]:
Sample AGS GE AGS - GE AGS (mb) SAMPLE_ACC Sample Type SAMPLE_VOLUME VOLUME_UNIT FILTER_MIN ... MATERIAL_ID sample depth - (m) Ammonium - (umol/kg) Silicate - (umol/kg) Total phosphorus salinity - (psu) temperature - (C) nitrite - (µmol/l) phosphate - (umol/kg) nitrate(NO3) - (umol/l)
0 CAM_SMPL_003398 1.444173e+06 6.502469 1.444166e+06 1.444173 CAM_SMPL_003398 metagenome 200 L 3.0 ... 3827 16 0.390 9.919 0.189 4.51 10.30 NaN 0.030 0.344
1 CAM_SMPL_003374 1.195763e+06 23.376988 1.195740e+06 1.195763 CAM_SMPL_003374 metagenome 200 L 0.1 ... 3803 30 0.196 5.747 0.461 32.40 13.54 0.047 0.124 1.305
2 CAM_SMPL_003389 1.185496e+06 8.000187 1.185488e+06 1.185496 CAM_SMPL_003389 metagenome 200 L 3.0 ... 3818 10 0.498 17.074 0.153 2.90 9.10 0.073 0.041 2.867
3 CAM_SMPL_003418 1.720945e+06 6.480953 1.720938e+06 1.720945 CAM_SMPL_003418 metagenome 200 L 3.0 ... 3847 19 0.472 4.685 0.345 33.05 17.47 NaN 0.027 0.495
4 CAM_SMPL_003391 1.461729e+06 11.988789 1.461717e+06 1.461729 CAM_SMPL_003391 metagenome 200 L 0.8 ... 3820 1 0.239 16.979 0.162 0.00 7.89 NaN 0.041 2.321
5 CAM_SMPL_003416 1.179874e+06 33.880956 1.179840e+06 1.179874 CAM_SMPL_003416 metagenome 200 L 0.1 ... 3845 19 0.472 4.685 0.345 33.05 17.47 NaN 0.027 0.495
6 CAM_SMPL_003372 1.851302e+06 12.904355 1.851289e+06 1.851302 CAM_SMPL_003372 metagenome 200 L 0.8 ... 3801 1 0.832 3.635 0.382 23.20 18.85 NaN 0.029 0.348
7 CAM_SMPL_003356 1.232673e+06 26.726657 1.232646e+06 1.232673 CAM_SMPL_003356 metagenome 190 L 0.1 ... 3785 24 0.263 7.079 0.395 8.00 9.47 0.033 0.208 0.212
8 CAM_SMPL_003406 1.458303e+06 53.784971 1.458249e+06 1.458303 CAM_SMPL_003406 metagenome 180 L 0.8 ... 3835 1 0.440 7.912 0.421 6.89 19.39 NaN 0.130 0.355
9 CAM_SMPL_003384 9.546334e+05 53.653646 9.545798e+05 0.954633 CAM_SMPL_003384 metagenome 200 L 0.1 ... 3813 1 0.177 22.688 0.188 2.90 10.80 0.073 0.061 2.571
10 CAM_SMPL_003379 1.525316e+06 22.148538 1.525294e+06 1.525316 CAM_SMPL_003379 metagenome 180 L 0.8 ... 3808 1 0.237 7.113 0.201 4.79 15.31 NaN 0.042 0.215
11 CAM_SMPL_003402 1.283465e+06 38.315809 1.283427e+06 1.283465 CAM_SMPL_003402 metagenome 200 L 0.1 ... 3831 74 1.372 27.640 2.267 9.90 5.32 NaN 1.796 0.645
12 CAM_SMPL_003393 8.948887e+05 94.008744 8.947947e+05 0.894889 CAM_SMPL_003393 metagenome 200 L 0.1 ... 3822 1 0.288 9.270 0.208 3.46 14.90 NaN 0.048 0.136
13 CAM_SMPL_003410 1.853869e+06 7.070452 1.853862e+06 1.853869 CAM_SMPL_003410 metagenome 200 L 3.0 ... 3839 1 0.757 7.458 0.424 7.52 17.56 NaN 0.110 0.210
14 CAM_SMPL_003411 1.643691e+06 18.249304 1.643672e+06 1.643691 CAM_SMPL_003411 metagenome 200 L 0.8 ... 3840 72 0.831 16.290 1.172 34.35 6.35 NaN 0.913 8.538

15 rows × 46 columns

In [205]:
l1=list(merged.columns)
In [206]:
#removing columns - (axis 1) -  where ALL the values are NaN (null, missing)
merged.dropna(how="all",axis=1,inplace=True)
In [207]:
merged.head()
Out[207]:
Sample AGS GE AGS - GE AGS (mb) SAMPLE_ACC Sample Type SAMPLE_VOLUME VOLUME_UNIT FILTER_MIN ... MATERIAL_ID sample depth - (m) Ammonium - (umol/kg) Silicate - (umol/kg) Total phosphorus salinity - (psu) temperature - (C) nitrite - (µmol/l) phosphate - (umol/kg) nitrate(NO3) - (umol/l)
0 CAM_SMPL_003398 1.444173e+06 6.502469 1.444166e+06 1.444173 CAM_SMPL_003398 metagenome 200 L 3.0 ... 3827 16 0.390 9.919 0.189 4.51 10.30 NaN 0.030 0.344
1 CAM_SMPL_003374 1.195763e+06 23.376988 1.195740e+06 1.195763 CAM_SMPL_003374 metagenome 200 L 0.1 ... 3803 30 0.196 5.747 0.461 32.40 13.54 0.047 0.124 1.305
2 CAM_SMPL_003389 1.185496e+06 8.000187 1.185488e+06 1.185496 CAM_SMPL_003389 metagenome 200 L 3.0 ... 3818 10 0.498 17.074 0.153 2.90 9.10 0.073 0.041 2.867
3 CAM_SMPL_003418 1.720945e+06 6.480953 1.720938e+06 1.720945 CAM_SMPL_003418 metagenome 200 L 3.0 ... 3847 19 0.472 4.685 0.345 33.05 17.47 NaN 0.027 0.495
4 CAM_SMPL_003391 1.461729e+06 11.988789 1.461717e+06 1.461729 CAM_SMPL_003391 metagenome 200 L 0.8 ... 3820 1 0.239 16.979 0.162 0.00 7.89 NaN 0.041 2.321

5 rows × 36 columns

In [208]:
l2=list(merged.columns)
In [209]:
[x for x in l1 if x not in l2]
Out[209]:
['SAMPLE_DESCRIPTION',
 'COMMENTS',
 'TAXON_ID',
 'DESCRIPTION',
 'ALTITUDE',
 'SITE_DEPTH',
 'SITE_DESCRIPTION',
 'HOST_TAXON_ID',
 'HOST_DESCRIPTION',
 'HOST_ORGANISM']

Sort values by filter size:

In [210]:
merged=merged.sort_values("FILTER_MIN")
In [211]:
merged.head(10)
Out[211]:
Sample AGS GE AGS - GE AGS (mb) SAMPLE_ACC Sample Type SAMPLE_VOLUME VOLUME_UNIT FILTER_MIN ... MATERIAL_ID sample depth - (m) Ammonium - (umol/kg) Silicate - (umol/kg) Total phosphorus salinity - (psu) temperature - (C) nitrite - (µmol/l) phosphate - (umol/kg) nitrate(NO3) - (umol/l)
62 CAM_SMPL_003378 1.109780e+06 41.574616 1.109738e+06 1.109780 CAM_SMPL_003378 metagenome 180 L 0.1 ... 3807 1 0.237 7.113 0.201 4.79 15.31 NaN 0.042 0.215
51 CAM_SMPL_003396 9.275912e+05 32.594154 9.275586e+05 0.927591 CAM_SMPL_003396 metagenome 200 L 0.1 ... 3825 16 0.390 9.919 0.189 4.51 10.30 NaN 0.030 0.344
25 CAM_SMPL_003377 1.412405e+06 19.604380 1.412385e+06 1.412405 CAM_SMPL_003377 metagenome 200 L 0.1 ... 3806 72 0.831 16.290 1.172 34.35 6.35 NaN 0.913 8.538
61 CAM_SMPL_003368 1.026569e+06 12.120798 1.026557e+06 1.026569 CAM_SMPL_003368 metagenome 200 L 0.1 ... 3797 20 0.801 5.109 0.360 31.50 15.00 0.035 0.107 1.011
44 CAM_SMPL_003371 1.456099e+06 18.592277 1.456081e+06 1.456099 CAM_SMPL_003371 metagenome 200 L 0.1 ... 3800 1 0.832 3.635 0.382 23.20 18.85 NaN 0.029 0.348
50 CAM_SMPL_003362 1.235364e+06 16.458093 1.235347e+06 1.235364 CAM_SMPL_003362 metagenome 200 L 0.1 ... 3791 15 0.303 6.115 0.495 16.45 17.00 NaN 0.136 0.304
18 CAM_SMPL_003413 1.414618e+06 11.659474 1.414606e+06 1.414618 CAM_SMPL_003413 metagenome 200 L 0.1 ... 3842 1 0.328 2.895 0.335 32.16 17.43 NaN NaN 0.365
16 CAM_SMPL_003365 1.278115e+06 27.447482 1.278088e+06 1.278115 CAM_SMPL_003365 metagenome 200 L 0.1 ... 3794 1 0.250 1.417 0.258 20.22 18.63 NaN NaN 0.285
36 CAM_SMPL_003387 7.909761e+05 32.882668 7.909432e+05 0.790976 CAM_SMPL_003387 metagenome 200 L 0.1 ... 3816 10 0.498 17.074 0.153 2.90 9.10 0.073 0.041 2.867
27 CAM_SMPL_003399 1.107213e+06 37.547570 1.107175e+06 1.107213 CAM_SMPL_003399 metagenome 150 L 0.1 ... 3828 9 0.309 4.540 0.283 5.75 15.31 NaN 0.089 0.187

10 rows × 36 columns

In [212]:
merged.reset_index(inplace=True,drop=True)
In [213]:
merged.head(10)
Out[213]:
Sample AGS GE AGS - GE AGS (mb) SAMPLE_ACC Sample Type SAMPLE_VOLUME VOLUME_UNIT FILTER_MIN ... MATERIAL_ID sample depth - (m) Ammonium - (umol/kg) Silicate - (umol/kg) Total phosphorus salinity - (psu) temperature - (C) nitrite - (µmol/l) phosphate - (umol/kg) nitrate(NO3) - (umol/l)
0 CAM_SMPL_003378 1.109780e+06 41.574616 1.109738e+06 1.109780 CAM_SMPL_003378 metagenome 180 L 0.1 ... 3807 1 0.237 7.113 0.201 4.79 15.31 NaN 0.042 0.215
1 CAM_SMPL_003396 9.275912e+05 32.594154 9.275586e+05 0.927591 CAM_SMPL_003396 metagenome 200 L 0.1 ... 3825 16 0.390 9.919 0.189 4.51 10.30 NaN 0.030 0.344
2 CAM_SMPL_003377 1.412405e+06 19.604380 1.412385e+06 1.412405 CAM_SMPL_003377 metagenome 200 L 0.1 ... 3806 72 0.831 16.290 1.172 34.35 6.35 NaN 0.913 8.538
3 CAM_SMPL_003368 1.026569e+06 12.120798 1.026557e+06 1.026569 CAM_SMPL_003368 metagenome 200 L 0.1 ... 3797 20 0.801 5.109 0.360 31.50 15.00 0.035 0.107 1.011
4 CAM_SMPL_003371 1.456099e+06 18.592277 1.456081e+06 1.456099 CAM_SMPL_003371 metagenome 200 L 0.1 ... 3800 1 0.832 3.635 0.382 23.20 18.85 NaN 0.029 0.348
5 CAM_SMPL_003362 1.235364e+06 16.458093 1.235347e+06 1.235364 CAM_SMPL_003362 metagenome 200 L 0.1 ... 3791 15 0.303 6.115 0.495 16.45 17.00 NaN 0.136 0.304
6 CAM_SMPL_003413 1.414618e+06 11.659474 1.414606e+06 1.414618 CAM_SMPL_003413 metagenome 200 L 0.1 ... 3842 1 0.328 2.895 0.335 32.16 17.43 NaN NaN 0.365
7 CAM_SMPL_003365 1.278115e+06 27.447482 1.278088e+06 1.278115 CAM_SMPL_003365 metagenome 200 L 0.1 ... 3794 1 0.250 1.417 0.258 20.22 18.63 NaN NaN 0.285
8 CAM_SMPL_003387 7.909761e+05 32.882668 7.909432e+05 0.790976 CAM_SMPL_003387 metagenome 200 L 0.1 ... 3816 10 0.498 17.074 0.153 2.90 9.10 0.073 0.041 2.867
9 CAM_SMPL_003399 1.107213e+06 37.547570 1.107175e+06 1.107213 CAM_SMPL_003399 metagenome 150 L 0.1 ... 3828 9 0.309 4.540 0.283 5.75 15.31 NaN 0.089 0.187

10 rows × 36 columns

Creating 3 groups (g1, g2, g3) for each filter_min size

In [214]:
merged["FILTER_MIN"].drop_duplicates()
Out[214]:
0     0.1
21    0.8
42    3.0
Name: FILTER_MIN, dtype: float64
In [215]:
g1_ = merged[merged["FILTER_MIN"]==0.1]
g2_ = merged[merged["FILTER_MIN"]==0.8]
g3_ = merged[merged["FILTER_MIN"]==3.0]
In [216]:
# Group by and calculating the mean of each group

list(merged.groupby("FILTER_MIN")["AGS (mb)"].mean())
Out[216]:
[1.1259436419745714, 1.5636385480638095, 1.502763909280381]

Hands-on:

  • plot a barplot from with GE columns only for g1 (FILTER_MIN = 0.1).
  • Add the legend of axis x (xticks) with 45 degrees rotation, using name of samples

Creating more complicated bar plots:

  • ploting 3 groups (g1, g2 and g3), different colors, and a line defining the average value for each group
In [217]:
sns.set_style("white")

plt.figure(figsize=(15,10))

#Here we are ploting on the same figure, 3 bar plots, ranging from the original index obtained from the merged dataframe

plt.bar(g1_.index,g1_["AGS (mb)"],color="b") # -> color b=blue, each bar plot we use different colors

plt.bar(g2_.index,g2_["AGS (mb)"],color="c")

plt.bar(g3_.index,g3_["AGS (mb)"],color="r")

plt.ylabel("AGS (mb)",size="16") 
plt.xticks(range(len(merged)), merged["Sample"], size='small',rotation="vertical")
plt.legend(merged["FILTER_MIN"].drop_duplicates())

#here we plot 3 horizontal lines (hlines) with the mean of AGS values for each group
plt.hlines(g1_["AGS (mb)"].astype(float).mean(),g1_.index[0],g1_.index[-1])
plt.hlines(g2_["AGS (mb)"].astype(float).mean(),g2_.index[0],g2_.index[-1])
plt.hlines(g3_["AGS (mb)"].astype(float).mean(),g3_.index[0],g3_.index[-1])
plt.show()

Starting with seaborn, using hue and crating more fancy plots

In [218]:
sns.set_style("white")

plt.figure(figsize=(15,10))
x=merged.index
y="AGS (mb)"
sns.barplot(x=x,y=y,hue="FILTER_MIN",data=merged)
plt.xticks(range(len(x)), merged["Sample"], size='small',rotation="vertical")

plt.show()

Hands-on:

  • add the mean values line to the previous plot

Barplot with error bars, violinplot and boxplots

In [219]:
sns.set_style("white")

plt.figure(figsize=(15,10))
x=merged["FILTER_MIN"]
y=merged["AGS (mb)"]
sns.barplot(x,y)
plt.show()
In [220]:
sns.set_style("white")

plt.figure(figsize=(15,10))
x=merged["FILTER_MIN"]
y=merged["AGS (mb)"]
sns.violinplot(x,y)
plt.show()
In [221]:
sns.set_style("white")

plt.figure(figsize=(15,10))
x=merged["FILTER_MIN"]
y=merged["AGS (mb)"]
sns.boxplot(x,y)
plt.show()

Hands on:

  • create the same violin plots, but adding dots on the plot
  • create violin plots for sample volume, instead filter min

Exploring data, checking correlations

In [222]:
#correlation between 2 variables

merged[["AGS (mb)","temperature - (C)"]].corr()
Out[222]:
AGS (mb) temperature - (C)
AGS (mb) 1.000000 0.135368
temperature - (C) 0.135368 1.000000

Do you think AGS (average genome size) are correlated with temperature?

In [223]:
#Doing a scatter plot

sns.set_style("white")
x=merged["AGS (mb)"]
y=merged["temperature - (C)"]
plt.figure(figsize=(15,10))
plt.scatter(x,y)
plt.ylabel("temperature - (C)",size="16")
plt.xlabel("AGS (mb)",size="16")
plt.show()
In [224]:
sns.jointplot("temperature - (C)", "AGS (mb)", data=merged, kind="reg",
                  xlim=(0, 20), ylim=(0, 3), color="r", size=7)
plt.show()
/modules/software/jupyterhub/0.9.0/lib/python3.6/site-packages/seaborn/axisgrid.py:2262: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)
In [225]:
sns.jointplot("LATITUDE", "AGS (mb)", data=merged, kind="reg",
                  xlim=(0, 20), ylim=(0, 3), color="r", size=7)
plt.show()
/modules/software/jupyterhub/0.9.0/lib/python3.6/site-packages/seaborn/axisgrid.py:2262: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

Hands on:

  • plot correlation plot with other variables

Heatmaps

  • Working with genes abundance on samples
In [226]:
df_cog.index=df_cog["Unnamed: 0"].tolist()
df_cog=df_cog.drop(["Unnamed: 0"], axis=1)
In [227]:
# Remove rows only with 0
df_2=df_cog[(df_cog.T != 0).any()]
In [228]:
df_2.columns
Out[228]:
Index(['CAM_SMPL_003410', 'CAM_SMPL_003375', 'CAM_SMPL_003388',
       'CAM_SMPL_003380', 'CAM_SMPL_003361', 'CAM_SMPL_003395',
       'CAM_SMPL_003378', 'CAM_SMPL_003365', 'CAM_SMPL_003373',
       'CAM_SMPL_003384', 'CAM_SMPL_003406', 'CAM_SMPL_003417',
       'CAM_SMPL_003372', 'CAM_SMPL_003374', 'CAM_SMPL_003364',
       'CAM_SMPL_003362', 'CAM_SMPL_003409', 'CAM_SMPL_003418',
       'CAM_SMPL_003400', 'CAM_SMPL_003405', 'CAM_SMPL_003370',
       'CAM_SMPL_003392', 'CAM_SMPL_003382', 'CAM_SMPL_003394',
       'CAM_SMPL_003358', 'CAM_SMPL_003367', 'CAM_SMPL_003396',
       'CAM_SMPL_003391', 'CAM_SMPL_003360', 'CAM_SMPL_003381',
       'CAM_SMPL_003404', 'CAM_SMPL_003387', 'CAM_SMPL_003376',
       'CAM_SMPL_003379', 'CAM_SMPL_003398', 'CAM_SMPL_003399',
       'CAM_SMPL_003412', 'CAM_SMPL_003403', 'CAM_SMPL_003356',
       'CAM_SMPL_003414', 'CAM_SMPL_003368', 'CAM_SMPL_003401',
       'CAM_SMPL_003390', 'CAM_SMPL_003385', 'CAM_SMPL_003397',
       'CAM_SMPL_003357', 'CAM_SMPL_003408', 'CAM_SMPL_003359',
       'CAM_SMPL_003415', 'CAM_SMPL_003407', 'CAM_SMPL_003366',
       'CAM_SMPL_003413', 'CAM_SMPL_003386', 'CAM_SMPL_003369',
       'CAM_SMPL_003402', 'CAM_SMPL_003383', 'CAM_SMPL_003377',
       'CAM_SMPL_003371', 'CAM_SMPL_003363', 'CAM_SMPL_003389',
       'CAM_SMPL_003393', 'CAM_SMPL_003411', 'CAM_SMPL_003416', 'g1-mean',
       'g2-mean', 'g3-mean', 'g1-sd', 'g2-sd', 'g3-sd', 'ANOVA F',
       'ANOVA pvalue', 'pvalue adjusted', 'log2(FC)_g1/g2', 'log2(FC)_g1/g3',
       'log2(FC)_g2/g3'],
      dtype='object')
In [229]:
len(df_cog) # number of cogs (gene ortologs groups) on the original table
Out[229]:
363
In [230]:
len(df_2) # number of cogs after removing those with 0 abundance in all the samples
Out[230]:
164
In [231]:
# New dataframe only with the RPKG for all the samples

heat=df_2[[x for x in df_2.columns if "CAM_SMPL"in x ]]
In [232]:
heat.head()
Out[232]:
CAM_SMPL_003410 CAM_SMPL_003375 CAM_SMPL_003388 CAM_SMPL_003380 CAM_SMPL_003361 CAM_SMPL_003395 CAM_SMPL_003378 CAM_SMPL_003365 CAM_SMPL_003373 CAM_SMPL_003384 ... CAM_SMPL_003369 CAM_SMPL_003402 CAM_SMPL_003383 CAM_SMPL_003377 CAM_SMPL_003371 CAM_SMPL_003363 CAM_SMPL_003389 CAM_SMPL_003393 CAM_SMPL_003411 CAM_SMPL_003416
COG0006 - Xaa-Pro aminopeptidase [Amino acid transport and metabolism]. 1.545723 4.167033 1.534017 0.000000 1.724899 0.0 2.217853 6.629477 0.000000 2.193711 ... 2.495555 1.735013 0.000000 5.705442 4.956612 3.911169 1.068351 2.793754 5.592646 4.276987
COG0018 - Arginyl-tRNA synthetase [Translation, ribosomal structure and biogenesis]. 1.024882 3.036821 2.855767 4.149942 0.000000 0.0 3.104112 3.991667 0.000000 4.225001 ... 2.951984 3.135016 0.000000 4.945343 3.899876 3.066643 0.000000 4.223395 4.434279 2.736568
COG0020 - Undecaprenyl pyrophosphate synthase [Lipid transport and metabolism]. 0.000000 3.220692 0.973403 0.000000 1.542692 0.0 1.481556 2.195459 0.000000 3.121637 ... 2.019257 2.167604 0.000000 0.564016 2.604266 0.713610 0.000000 2.373235 0.220067 0.886617
COG0021 - Transketolase [Carbohydrate transport and metabolism]. 0.659364 2.388176 1.485892 0.000000 0.000000 0.0 4.185504 5.248721 1.673285 2.988938 ... 2.865772 1.737866 0.000000 3.891734 4.769148 1.945160 0.420866 2.309424 3.629034 4.007391
COG0028 - Acetolactate synthase large subunit or other thiamine pyrophosphate-requiring enzyme [Amino acid transport and metabolism, Coenzyme transport and metabolism]. 1.799150 6.434617 4.867390 0.000000 3.255559 0.0 4.231023 9.012939 2.817330 5.372445 ... 10.164218 4.382438 1.437488 12.182144 8.260903 5.797622 1.759561 4.892757 7.395851 5.533674

5 rows × 63 columns

In [233]:
##test.ix[:, test.columns != 'compliance']
In [234]:
sns.clustermap(heat,figsize=(12,25),cbar_kws={'label': 'COGs RPKG'},yticklabels=False)
#plt.savefig("../clustermap_cogs.pdf", format = 'pdf', dpi = 300, bbox_inches = 'tight')
plt.show()
In [235]:
# customizing colors 
In [236]:
import matplotlib.cm as cm
from  matplotlib.colors import LinearSegmentedColormap
In [237]:
c = ["pink","lightcoral","red","darkred","black"]
v = [0,.25,.5,.75,1.]
l = list(zip(v,c))
cmap=LinearSegmentedColormap.from_list('rg',l, N=1024)
In [238]:
sns.clustermap(heat,figsize=(12,25),cbar_kws={'label': 'COGs RPKG'},yticklabels=False,cmap=cmap)
#plt.savefig("../clustermap_cogs.pdf", format = 'pdf', dpi = 300, bbox_inches = 'tight')
plt.show()
In [239]:
heat_log= heat.apply(lambda x: np.log10(x+0.001))
In [240]:
c = ["darkred","red","lightcoral","white","palegreen","green","darkgreen"]
v = [0,.15,.4,.5,0.6,.9,1.]
l = list(zip(v,c))
cmap=LinearSegmentedColormap.from_list('rg',l, N=1024)
In [241]:
sns.clustermap(heat_log,figsize=(12,25),cbar_kws={'label': 'log2(COGs RPKG)'},yticklabels=False,cmap=cmap)
#plt.savefig("../clustermap_cogs.pdf", format = 'pdf', dpi = 300, bbox_inches = 'tight')
plt.show()

Hands-on:

  • Plot the same heatmaps, but only for genes where pvalue adjusted < 0.01
  • plot a heatmap for z-score

Getting upregulated COGs in the comparisons of different filtrations methods groups

In [242]:
a=set(df_2[df_2["log2(FC)_g1/g3"]>0].index)
b=set(df_2[df_2["log2(FC)_g1/g2"]>0].index)
c=set(df_2[df_2["log2(FC)_g2/g3"]>0].index)

To install matplotlib_venn:

pip install --user matplotlib-venn

In [243]:
from matplotlib_venn import venn3,venn3_unweighted 
In [244]:
plt.figure(figsize=(10,10))


vd = venn3_unweighted([a, b,c], ('up g1/g3', 'up g1/g2','up g2/g3'))

plt.show()
In [245]:
plt.figure(figsize=(10,10))


vd = venn3([a, b,c], ('up g1/g3', 'up g1/g2','up g2/g3'))

plt.show()

Hands-on

  • Plot a heatmap for log2(FC) up regulated genes for the 3 comparions (g1/g2, g2/g3 and g1/g3)

Package sklearn - more info at http://scikit-learn.org/stable/index.html

In [246]:
from sklearn import preprocessing
from sklearn.decomposition import PCA
from itertools import cycle
In [247]:
heat.head(2)
Out[247]:
CAM_SMPL_003410 CAM_SMPL_003375 CAM_SMPL_003388 CAM_SMPL_003380 CAM_SMPL_003361 CAM_SMPL_003395 CAM_SMPL_003378 CAM_SMPL_003365 CAM_SMPL_003373 CAM_SMPL_003384 ... CAM_SMPL_003369 CAM_SMPL_003402 CAM_SMPL_003383 CAM_SMPL_003377 CAM_SMPL_003371 CAM_SMPL_003363 CAM_SMPL_003389 CAM_SMPL_003393 CAM_SMPL_003411 CAM_SMPL_003416
COG0006 - Xaa-Pro aminopeptidase [Amino acid transport and metabolism]. 1.545723 4.167033 1.534017 0.000000 1.724899 0.0 2.217853 6.629477 0.0 2.193711 ... 2.495555 1.735013 0.0 5.705442 4.956612 3.911169 1.068351 2.793754 5.592646 4.276987
COG0018 - Arginyl-tRNA synthetase [Translation, ribosomal structure and biogenesis]. 1.024882 3.036821 2.855767 4.149942 0.000000 0.0 3.104112 3.991667 0.0 4.225001 ... 2.951984 3.135016 0.0 4.945343 3.899876 3.066643 0.000000 4.223395 4.434279 2.736568

2 rows × 63 columns

In [248]:
len(heat)
Out[248]:
164
In [249]:
# we have 164 COGs, that means, we have 164 variables to be transformed to 2 (or more) dimensions
In [250]:
df_pca=heat.T.reset_index()
In [251]:
df_pca.head()
Out[251]:
index COG0006 - Xaa-Pro aminopeptidase [Amino acid transport and metabolism]. COG0018 - Arginyl-tRNA synthetase [Translation, ribosomal structure and biogenesis]. COG0020 - Undecaprenyl pyrophosphate synthase [Lipid transport and metabolism]. COG0021 - Transketolase [Carbohydrate transport and metabolism]. COG0028 - Acetolactate synthase large subunit or other thiamine pyrophosphate-requiring enzyme [Amino acid transport and metabolism, Coenzyme transport and metabolism]. COG0035 - Uracil phosphoribosyltransferase [Nucleotide transport and metabolism]. COG0040 - ATP phosphoribosyltransferase [Amino acid transport and metabolism]. COG0049 - Ribosomal protein S7 [Translation, ribosomal structure and biogenesis]. COG0050 - Translation elongation factor EF-Tu, a GTPase [Translation, ribosomal structure and biogenesis]. ... COG4270 - Uncharacterized membrane protein [Function unknown]. COG4567 - Two-component response regulator, ActR/RegA family, consists of REC and Fis-type HTH domains [Signal transduction mechanisms, Transcription]. COG4579 - Isocitrate dehydrogenase kinase/phosphatase [Signal transduction mechanisms]. COG4581 - Superfamily II RNA helicase [Replication, recombination and repair]. COG4618 - ABC-type protease/lipase transport system, ATPase and permease components [Intracellular trafficking, secretion, and vesicular transport]. COG4663 - TRAP-type mannitol/chloroaromatic compound transport system, periplasmic component [Secondary metabolites biosynthesis, transport and catabolism]. COG4666 - TRAP-type uncharacterized transport system, fused permease components [General function prediction only]. COG4977 - Transcriptional regulator GlxA family, contains an amidase domain and an AraC-type DNA-binding HTH domain [Transcription]. COG5009 - Membrane carboxypeptidase/penicillin-binding protein [Cell wall/membrane/envelope biogenesis]. COG5282 - Uncharacterized conserved protein, DUF2342 family [Function unknown].
0 CAM_SMPL_003410 1.545723 1.024882 0.000000 0.659364 1.799150 0.000000 0.637089 0.000000 6.583426 ... 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 1.486878 0.000000 2.201261 0.000000
1 CAM_SMPL_003375 4.167033 3.036821 3.220692 2.388176 6.434617 0.745211 0.948030 0.000000 4.057610 ... 0.0 1.114127 0.202873 2.062192 0.287403 4.738382 2.831516 2.353076 4.581341 0.000000
2 CAM_SMPL_003388 1.534017 2.855767 0.973403 1.485892 4.867390 0.205852 1.106626 0.875543 4.395697 ... 0.0 0.249309 0.600689 0.452479 0.484652 2.253002 3.081664 0.733418 2.113862 0.439787
3 CAM_SMPL_003380 0.000000 4.149942 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.263685 ... 0.0 0.000000 0.000000 0.000000 0.000000 0.798198 0.000000 0.000000 0.000000 0.682459
4 CAM_SMPL_003361 1.724899 0.000000 1.542692 0.000000 3.255559 0.000000 0.000000 0.000000 4.002741 ... 0.0 0.000000 0.000000 1.445955 0.000000 1.043153 0.000000 0.000000 1.035755 0.000000

5 rows × 165 columns

In [252]:
merged.head()
Out[252]:
Sample AGS GE AGS - GE AGS (mb) SAMPLE_ACC Sample Type SAMPLE_VOLUME VOLUME_UNIT FILTER_MIN ... MATERIAL_ID sample depth - (m) Ammonium - (umol/kg) Silicate - (umol/kg) Total phosphorus salinity - (psu) temperature - (C) nitrite - (µmol/l) phosphate - (umol/kg) nitrate(NO3) - (umol/l)
0 CAM_SMPL_003378 1.109780e+06 41.574616 1.109738e+06 1.109780 CAM_SMPL_003378 metagenome 180 L 0.1 ... 3807 1 0.237 7.113 0.201 4.79 15.31 NaN 0.042 0.215
1 CAM_SMPL_003396 9.275912e+05 32.594154 9.275586e+05 0.927591 CAM_SMPL_003396 metagenome 200 L 0.1 ... 3825 16 0.390 9.919 0.189 4.51 10.30 NaN 0.030 0.344
2 CAM_SMPL_003377 1.412405e+06 19.604380 1.412385e+06 1.412405 CAM_SMPL_003377 metagenome 200 L 0.1 ... 3806 72 0.831 16.290 1.172 34.35 6.35 NaN 0.913 8.538
3 CAM_SMPL_003368 1.026569e+06 12.120798 1.026557e+06 1.026569 CAM_SMPL_003368 metagenome 200 L 0.1 ... 3797 20 0.801 5.109 0.360 31.50 15.00 0.035 0.107 1.011
4 CAM_SMPL_003371 1.456099e+06 18.592277 1.456081e+06 1.456099 CAM_SMPL_003371 metagenome 200 L 0.1 ... 3800 1 0.832 3.635 0.382 23.20 18.85 NaN 0.029 0.348

5 rows × 36 columns

In [253]:
df_pca=pd.merge(merged[["Sample","FILTER_MIN"]],df_pca,left_on="Sample",right_on="index").drop("index",axis=1)
In [254]:
df_pca.set_index(["Sample","FILTER_MIN"],inplace=True)
In [255]:
df_pca.head()
Out[255]:
COG0006 - Xaa-Pro aminopeptidase [Amino acid transport and metabolism]. COG0018 - Arginyl-tRNA synthetase [Translation, ribosomal structure and biogenesis]. COG0020 - Undecaprenyl pyrophosphate synthase [Lipid transport and metabolism]. COG0021 - Transketolase [Carbohydrate transport and metabolism]. COG0028 - Acetolactate synthase large subunit or other thiamine pyrophosphate-requiring enzyme [Amino acid transport and metabolism, Coenzyme transport and metabolism]. COG0035 - Uracil phosphoribosyltransferase [Nucleotide transport and metabolism]. COG0040 - ATP phosphoribosyltransferase [Amino acid transport and metabolism]. COG0049 - Ribosomal protein S7 [Translation, ribosomal structure and biogenesis]. COG0050 - Translation elongation factor EF-Tu, a GTPase [Translation, ribosomal structure and biogenesis]. COG0064 - Asp-tRNAAsn/Glu-tRNAGln amidotransferase B subunit [Translation, ribosomal structure and biogenesis]. ... COG4270 - Uncharacterized membrane protein [Function unknown]. COG4567 - Two-component response regulator, ActR/RegA family, consists of REC and Fis-type HTH domains [Signal transduction mechanisms, Transcription]. COG4579 - Isocitrate dehydrogenase kinase/phosphatase [Signal transduction mechanisms]. COG4581 - Superfamily II RNA helicase [Replication, recombination and repair]. COG4618 - ABC-type protease/lipase transport system, ATPase and permease components [Intracellular trafficking, secretion, and vesicular transport]. COG4663 - TRAP-type mannitol/chloroaromatic compound transport system, periplasmic component [Secondary metabolites biosynthesis, transport and catabolism]. COG4666 - TRAP-type uncharacterized transport system, fused permease components [General function prediction only]. COG4977 - Transcriptional regulator GlxA family, contains an amidase domain and an AraC-type DNA-binding HTH domain [Transcription]. COG5009 - Membrane carboxypeptidase/penicillin-binding protein [Cell wall/membrane/envelope biogenesis]. COG5282 - Uncharacterized conserved protein, DUF2342 family [Function unknown].
Sample FILTER_MIN
CAM_SMPL_003378 0.1 2.217853 3.104112 1.481556 4.185504 4.231023 2.261908 1.789779 1.927435 4.193771 2.973363 ... 0.485059 1.100603 0.00000 1.987316 0.053097 1.655937 2.973762 0.167036 2.066227 1.372959
CAM_SMPL_003396 0.1 2.196063 2.807988 2.328216 2.349802 4.251768 1.305508 1.702963 0.884343 4.409935 3.170220 ... 0.247442 0.396387 0.00000 1.750147 0.000000 0.447497 1.955898 0.000000 1.745357 0.617174
CAM_SMPL_003377 0.1 5.705442 4.945343 0.564016 3.891734 12.182144 0.975872 1.960872 0.555654 5.793683 5.142208 ... 0.000000 1.231687 0.30636 1.153844 0.333392 4.756444 3.005515 0.000000 4.137977 0.000000
CAM_SMPL_003368 0.1 2.968671 1.863295 1.602269 1.672952 6.762408 0.000000 1.227721 0.759965 1.260330 0.868506 ... 0.000000 3.062030 0.00000 0.327392 0.000000 2.792441 2.953601 0.000000 3.896846 0.000000
CAM_SMPL_003371 0.1 4.956612 3.899876 2.604266 4.769148 8.260903 1.841765 1.767899 0.655986 7.253030 3.634597 ... 0.000000 2.231739 0.00000 1.705530 0.000000 4.657529 1.095331 0.697874 6.642156 1.049109

5 rows × 164 columns

In [256]:
pca = PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

#scaling the values
df_pca_scaled = preprocessing.scale(df_pca)

projected=pca.fit_transform(df_pca_scaled)
In [257]:
print(pca.explained_variance_ratio_)
[0.42767701 0.09641663]
In [258]:
tmp=pd.DataFrame(projected)
tmp.rename(columns={0: 'Component 1',1: 'Component 2'}, inplace=True)
tmp.head()
Out[258]:
Component 1 Component 2
0 11.114657 -5.224373
1 7.793254 -6.231991
2 10.776208 1.192819
3 0.830403 -2.127450
4 11.567918 1.184126
In [259]:
final_pca=pd.merge(df_pca.reset_index(),tmp,left_index=True,right_index=True)
In [260]:
# in the final_pca dataframe, in adition to all the COGs, 
# we have new columns (0 and 1), with the PCA results (2 components)
In [261]:
final_pca.head()
Out[261]:
Sample FILTER_MIN COG0006 - Xaa-Pro aminopeptidase [Amino acid transport and metabolism]. COG0018 - Arginyl-tRNA synthetase [Translation, ribosomal structure and biogenesis]. COG0020 - Undecaprenyl pyrophosphate synthase [Lipid transport and metabolism]. COG0021 - Transketolase [Carbohydrate transport and metabolism]. COG0028 - Acetolactate synthase large subunit or other thiamine pyrophosphate-requiring enzyme [Amino acid transport and metabolism, Coenzyme transport and metabolism]. COG0035 - Uracil phosphoribosyltransferase [Nucleotide transport and metabolism]. COG0040 - ATP phosphoribosyltransferase [Amino acid transport and metabolism]. COG0049 - Ribosomal protein S7 [Translation, ribosomal structure and biogenesis]. ... COG4579 - Isocitrate dehydrogenase kinase/phosphatase [Signal transduction mechanisms]. COG4581 - Superfamily II RNA helicase [Replication, recombination and repair]. COG4618 - ABC-type protease/lipase transport system, ATPase and permease components [Intracellular trafficking, secretion, and vesicular transport]. COG4663 - TRAP-type mannitol/chloroaromatic compound transport system, periplasmic component [Secondary metabolites biosynthesis, transport and catabolism]. COG4666 - TRAP-type uncharacterized transport system, fused permease components [General function prediction only]. COG4977 - Transcriptional regulator GlxA family, contains an amidase domain and an AraC-type DNA-binding HTH domain [Transcription]. COG5009 - Membrane carboxypeptidase/penicillin-binding protein [Cell wall/membrane/envelope biogenesis]. COG5282 - Uncharacterized conserved protein, DUF2342 family [Function unknown]. Component 1 Component 2
0 CAM_SMPL_003378 0.1 2.217853 3.104112 1.481556 4.185504 4.231023 2.261908 1.789779 1.927435 ... 0.00000 1.987316 0.053097 1.655937 2.973762 0.167036 2.066227 1.372959 11.114657 -5.224373
1 CAM_SMPL_003396 0.1 2.196063 2.807988 2.328216 2.349802 4.251768 1.305508 1.702963 0.884343 ... 0.00000 1.750147 0.000000 0.447497 1.955898 0.000000 1.745357 0.617174 7.793254 -6.231991
2 CAM_SMPL_003377 0.1 5.705442 4.945343 0.564016 3.891734 12.182144 0.975872 1.960872 0.555654 ... 0.30636 1.153844 0.333392 4.756444 3.005515 0.000000 4.137977 0.000000 10.776208 1.192819
3 CAM_SMPL_003368 0.1 2.968671 1.863295 1.602269 1.672952 6.762408 0.000000 1.227721 0.759965 ... 0.00000 0.327392 0.000000 2.792441 2.953601 0.000000 3.896846 0.000000 0.830403 -2.127450
4 CAM_SMPL_003371 0.1 4.956612 3.899876 2.604266 4.769148 8.260903 1.841765 1.767899 0.655986 ... 0.00000 1.705530 0.000000 4.657529 1.095331 0.697874 6.642156 1.049109 11.567918 1.184126

5 rows × 168 columns

In [262]:
sns.set_style("white")
color_gen = cycle(('blue', 'green', 'red'))
plt.figure(figsize=(10,10))
for lab in set(final_pca["FILTER_MIN"]):
    plt.scatter(final_pca.loc[final_pca['FILTER_MIN'] == lab, 'Component 1'], 
                final_pca.loc[final_pca['FILTER_MIN'] == lab, 'Component 2'], 
                c=next(color_gen),
                label=lab)

    
    
plt.xlabel('component 1  - ' + str(pca.explained_variance_ratio_[0]*100)+ " % " )
plt.ylabel('component 2  - '+ str(pca.explained_variance_ratio_[1]*100)+ "  % " )
plt.legend(loc='best')
plt.show()
In [263]:
#pca.components_
In [264]:
components = pd.DataFrame(pca.components_, columns = df_pca.columns, index=[1, 2])
In [265]:
# Principal axes in feature space, representing the directions of maximum variance in the data. 
components
Out[265]:
COG0006 - Xaa-Pro aminopeptidase [Amino acid transport and metabolism]. COG0018 - Arginyl-tRNA synthetase [Translation, ribosomal structure and biogenesis]. COG0020 - Undecaprenyl pyrophosphate synthase [Lipid transport and metabolism]. COG0021 - Transketolase [Carbohydrate transport and metabolism]. COG0028 - Acetolactate synthase large subunit or other thiamine pyrophosphate-requiring enzyme [Amino acid transport and metabolism, Coenzyme transport and metabolism]. COG0035 - Uracil phosphoribosyltransferase [Nucleotide transport and metabolism]. COG0040 - ATP phosphoribosyltransferase [Amino acid transport and metabolism]. COG0049 - Ribosomal protein S7 [Translation, ribosomal structure and biogenesis]. COG0050 - Translation elongation factor EF-Tu, a GTPase [Translation, ribosomal structure and biogenesis]. COG0064 - Asp-tRNAAsn/Glu-tRNAGln amidotransferase B subunit [Translation, ribosomal structure and biogenesis]. ... COG4270 - Uncharacterized membrane protein [Function unknown]. COG4567 - Two-component response regulator, ActR/RegA family, consists of REC and Fis-type HTH domains [Signal transduction mechanisms, Transcription]. COG4579 - Isocitrate dehydrogenase kinase/phosphatase [Signal transduction mechanisms]. COG4581 - Superfamily II RNA helicase [Replication, recombination and repair]. COG4618 - ABC-type protease/lipase transport system, ATPase and permease components [Intracellular trafficking, secretion, and vesicular transport]. COG4663 - TRAP-type mannitol/chloroaromatic compound transport system, periplasmic component [Secondary metabolites biosynthesis, transport and catabolism]. COG4666 - TRAP-type uncharacterized transport system, fused permease components [General function prediction only]. COG4977 - Transcriptional regulator GlxA family, contains an amidase domain and an AraC-type DNA-binding HTH domain [Transcription]. COG5009 - Membrane carboxypeptidase/penicillin-binding protein [Cell wall/membrane/envelope biogenesis]. COG5282 - Uncharacterized conserved protein, DUF2342 family [Function unknown].
1 0.082970 0.084591 0.082658 0.098450 0.086372 0.068815 0.097695 0.066942 0.074851 0.078895 ... 0.051625 0.064301 0.019433 0.079009 0.011210 0.072247 0.068873 0.015234 0.087356 0.078763
2 0.046882 0.012706 -0.004711 0.001974 0.060956 -0.074238 -0.023298 -0.070354 0.075204 -0.023305 ... -0.072290 -0.043273 0.173289 -0.058145 0.157041 0.055173 0.084453 0.170450 0.060258 -0.099727

2 rows × 164 columns

In [266]:
import math

def get_important_features(transformed_features, components_, columns):
    """
    This function will return the most "important" 
    features so we can determine which have the most
    effect on multi-dimensional scaling
    """
    num_columns = len(columns)

    # Scale the principal components by the max value in
    # the transformed set belonging to that component
    xvector = components_[0] * max(transformed_features[:,0])
    yvector = components_[1] * max(transformed_features[:,1])

    # Sort each column by it's length. These are your *original*
    # columns, not the principal components.
    important_features = { columns[i] : math.sqrt(xvector[i]**2 + yvector[i]**2) for i in range(num_columns) }
    important_features = sorted(zip(important_features.values(), important_features.keys()), reverse=True)
    return important_features,xvector,yvector
In [267]:
important_features,xvector,yvector=get_important_features(projected, pca.components_, df_pca.columns.values)
In [268]:
def draw_vectors(transformed_features, components_, columns):
    """
    This funtion will project your *original* features
    onto your principal component feature-space, so that you can
    visualize how "important" each one was in the
    multi-dimensional scaling
    """

    num_columns = len(columns)

    # Scale the principal components by the max value in
    # the transformed set belonging to that component
    xvector = components_[0] * max(transformed_features[:,0])
    yvector = components_[1] * max(transformed_features[:,1])
    
    # Sort each column by it's length. These are your *original*
    # columns, not the principal components.
    important_features = { columns[i] : math.sqrt(xvector[i]**2 + yvector[i]**2) for i in range(num_columns) }
    important_features = sorted(zip(important_features.values(), important_features.keys()), reverse=True)
    
    ax = plt.axes()
    
    tmp=pd.DataFrame({"xvector":xvector,"columns":columns} )
    tmp2=pd.DataFrame({"yvector":yvector,"columns":columns} )

    
    azip=zip(xvector,columns)
    azip2=zip(yvector,columns) 
    col_array=[]
    
    # Use an arrow to project each original feature as a
    # labeled vector on your principal component axes
    for i in range(num_columns):
        
        if columns[i]==tmp[tmp["xvector"]==max(xvector)]["columns"].tolist()[0]:
            plt.arrow(0, 0, xvector[i], yvector[i], color='b', width=0.0005, head_width=0.02, alpha=0.75)
            plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='b', alpha=0.75)
            col_array.append(columns[i])
        if columns[i]==tmp[tmp["xvector"]==min(xvector)]["columns"].tolist()[0]:
            plt.arrow(0, 0, xvector[i], yvector[i], color='b', width=0.0005, head_width=0.02, alpha=0.75)
            plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='b', alpha=0.75)
            col_array.append(columns[i])
        if columns[i]==tmp2[tmp2["yvector"]==min(yvector)]["columns"].tolist()[0]:
            plt.arrow(0, 0, xvector[i], yvector[i], color='b', width=0.0005, head_width=0.02, alpha=0.75)
            plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='b', alpha=0.75) 
            col_array.append(columns[i])
        if columns[i]==tmp2[tmp2["yvector"]==min(yvector)]["columns"].tolist()[0]:
            plt.arrow(0, 0, xvector[i], yvector[i], color='b', width=0.0005, head_width=0.02, alpha=0.75)
            plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='b', alpha=0.75) 
            col_array.append(columns[i])
    
    return ax,col_array
In [269]:
color_gen = cycle(('red', 'green', 'blue'))
plt.figure(figsize=(10,10))
for lab in set(final_pca["FILTER_MIN"]):
    plt.scatter(final_pca.loc[final_pca['FILTER_MIN'] == lab, 'Component 1'], 
                final_pca.loc[final_pca['FILTER_MIN'] == lab, 'Component 2'], 
                c=next(color_gen),
                label=lab,alpha=0.8)

ax,cols=draw_vectors(projected, pca.components_, df_pca.columns.values)    
    
plt.xlabel('Component 1  - ' + str(pca.explained_variance_ratio_[0]*100)+ " % " )
plt.ylabel('Component 2  - '+ str(pca.explained_variance_ratio_[1]*100)+ "  % " )
plt.legend(loc='best')
plt.show()
/modules/software/jupyterhub/0.9.0/lib/python3.6/site-packages/matplotlib/figure.py:98: MatplotlibDeprecationWarning: 
Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  "Adding an axes using the same arguments as a previous axes "

Hands-on:

  • Plot the same PCA, but fixing the text on arrows (keeping only COG number, removing the descriptions)
  • Add a vertical line and a horizontal line on zero values

Plotting the 4 COGs with extreme vectors on PCA

In [270]:
sns.set_style("white")
plt.figure(figsize=(15,10))
tmp=final_pca[final_pca['Sample'].isin(g1_["Sample"])]
tmp2=final_pca[final_pca['Sample'].isin(g2_["Sample"])]
tmp3=final_pca[final_pca['Sample'].isin(g3_["Sample"])]
for c in cols:
    plt.figure(figsize=(15,10))
    plt.bar(tmp.index,tmp[c],color="b")
    plt.bar(tmp2.index,tmp2[c],color="g")
    plt.bar(tmp3.index,tmp3[c],color="r")
    plt.ylabel(c,size="12") 
    plt.xticks(range(len(final_pca)), final_pca["Sample"], size='small',rotation="vertical")
    plt.legend(final_pca["FILTER_MIN"].drop_duplicates())

    #here we plot 3 horizontal lines (hlines) with the mean of AGS values for each group
    plt.hlines(tmp[c].astype(float).mean(),g1_.index[0],g1_.index[-1])
    plt.hlines(tmp2[c].astype(float).mean(),g2_.index[0],g2_.index[-1])
    plt.hlines(tmp3[c].astype(float).mean(),g3_.index[0],g3_.index[-1])
    plt.show()
<Figure size 1080x720 with 0 Axes>
In [271]:
sns.set_style("white")

for c in cols:
    plt.figure(figsize=(15,10))
    x=final_pca["FILTER_MIN"]
    y=final_pca[c]
    sns.violinplot(x,y)
plt.show()
In [272]:
## Kmeans

## clusterization (kmeans with euc distance)
## calculate distance btw centroid of clusters and each gene
## plot bar plots for the genes with shorter and bigger distance from one cluster

### a bit more about K-means: https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/
In [273]:
def dist(a, b, ax=1):
    return np.linalg.norm(a - b, axis=ax)
In [274]:
f1 = final_pca["Component 1"].values
f2 = final_pca["Component 2"].values
X = np.array(list(zip(f1, f2)))
plt.scatter(f1, f2, c='black', s=15)
plt.show()
In [275]:
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
In [276]:
# k means determine k
distortions = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k).fit(X)
    kmeanModel.fit(X)
    distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])

# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
In [277]:
from sklearn.cluster import KMeans

# Number of clusters
k=3
kmeans = KMeans(n_clusters=k)
# Fitting the input data
kmeans = kmeans.fit(X)
# Getting the cluster labels
labels = kmeans.predict(X)
# Centroid values
C = kmeans.cluster_centers_
In [278]:
print(C)
[[-8.89103651 -1.33884732]
 [ 9.92669966 -3.51000995]
 [ 2.028681    4.17284112]]
In [279]:
from copy import deepcopy
In [280]:
# colors = ['#FF0000', 'g', '#FF7D40', '#66CDAA', '#473C8B', '#9400D3','#8B8386','#FFC0CB','#B0171F']
# fig, ax = plt.subplots()
# for i in range(k):
#         points = np.array([X[j] for j in range(len(X)) if clusters[j] == i])
#         ax.scatter(points[:, 0], points[:, 1], s=15, c=colors[i])
# ax.scatter(C[:, 0], C[:, 1], marker='*', s=100, c='#050505')
# plt.show()
In [281]:
kmeans.labels_
Out[281]:
array([1, 1, 1, 2, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 0,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 0,
       0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
      dtype=int32)
In [282]:
final_pca.head()
Out[282]:
Sample FILTER_MIN COG0006 - Xaa-Pro aminopeptidase [Amino acid transport and metabolism]. COG0018 - Arginyl-tRNA synthetase [Translation, ribosomal structure and biogenesis]. COG0020 - Undecaprenyl pyrophosphate synthase [Lipid transport and metabolism]. COG0021 - Transketolase [Carbohydrate transport and metabolism]. COG0028 - Acetolactate synthase large subunit or other thiamine pyrophosphate-requiring enzyme [Amino acid transport and metabolism, Coenzyme transport and metabolism]. COG0035 - Uracil phosphoribosyltransferase [Nucleotide transport and metabolism]. COG0040 - ATP phosphoribosyltransferase [Amino acid transport and metabolism]. COG0049 - Ribosomal protein S7 [Translation, ribosomal structure and biogenesis]. ... COG4579 - Isocitrate dehydrogenase kinase/phosphatase [Signal transduction mechanisms]. COG4581 - Superfamily II RNA helicase [Replication, recombination and repair]. COG4618 - ABC-type protease/lipase transport system, ATPase and permease components [Intracellular trafficking, secretion, and vesicular transport]. COG4663 - TRAP-type mannitol/chloroaromatic compound transport system, periplasmic component [Secondary metabolites biosynthesis, transport and catabolism]. COG4666 - TRAP-type uncharacterized transport system, fused permease components [General function prediction only]. COG4977 - Transcriptional regulator GlxA family, contains an amidase domain and an AraC-type DNA-binding HTH domain [Transcription]. COG5009 - Membrane carboxypeptidase/penicillin-binding protein [Cell wall/membrane/envelope biogenesis]. COG5282 - Uncharacterized conserved protein, DUF2342 family [Function unknown]. Component 1 Component 2
0 CAM_SMPL_003378 0.1 2.217853 3.104112 1.481556 4.185504 4.231023 2.261908 1.789779 1.927435 ... 0.00000 1.987316 0.053097 1.655937 2.973762 0.167036 2.066227 1.372959 11.114657 -5.224373
1 CAM_SMPL_003396 0.1 2.196063 2.807988 2.328216 2.349802 4.251768 1.305508 1.702963 0.884343 ... 0.00000 1.750147 0.000000 0.447497 1.955898 0.000000 1.745357 0.617174 7.793254 -6.231991
2 CAM_SMPL_003377 0.1 5.705442 4.945343 0.564016 3.891734 12.182144 0.975872 1.960872 0.555654 ... 0.30636 1.153844 0.333392 4.756444 3.005515 0.000000 4.137977 0.000000 10.776208 1.192819
3 CAM_SMPL_003368 0.1 2.968671 1.863295 1.602269 1.672952 6.762408 0.000000 1.227721 0.759965 ... 0.00000 0.327392 0.000000 2.792441 2.953601 0.000000 3.896846 0.000000 0.830403 -2.127450
4 CAM_SMPL_003371 0.1 4.956612 3.899876 2.604266 4.769148 8.260903 1.841765 1.767899 0.655986 ... 0.00000 1.705530 0.000000 4.657529 1.095331 0.697874 6.642156 1.049109 11.567918 1.184126

5 rows × 168 columns

In [283]:
from sklearn.metrics import pairwise_distances_argmin_min
In [284]:
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, X)
In [285]:
closest
Out[285]:
array([60, 13, 34])
In [286]:
final_pca.loc[[34]]
Out[286]:
Sample FILTER_MIN COG0006 - Xaa-Pro aminopeptidase [Amino acid transport and metabolism]. COG0018 - Arginyl-tRNA synthetase [Translation, ribosomal structure and biogenesis]. COG0020 - Undecaprenyl pyrophosphate synthase [Lipid transport and metabolism]. COG0021 - Transketolase [Carbohydrate transport and metabolism]. COG0028 - Acetolactate synthase large subunit or other thiamine pyrophosphate-requiring enzyme [Amino acid transport and metabolism, Coenzyme transport and metabolism]. COG0035 - Uracil phosphoribosyltransferase [Nucleotide transport and metabolism]. COG0040 - ATP phosphoribosyltransferase [Amino acid transport and metabolism]. COG0049 - Ribosomal protein S7 [Translation, ribosomal structure and biogenesis]. ... COG4579 - Isocitrate dehydrogenase kinase/phosphatase [Signal transduction mechanisms]. COG4581 - Superfamily II RNA helicase [Replication, recombination and repair]. COG4618 - ABC-type protease/lipase transport system, ATPase and permease components [Intracellular trafficking, secretion, and vesicular transport]. COG4663 - TRAP-type mannitol/chloroaromatic compound transport system, periplasmic component [Secondary metabolites biosynthesis, transport and catabolism]. COG4666 - TRAP-type uncharacterized transport system, fused permease components [General function prediction only]. COG4977 - Transcriptional regulator GlxA family, contains an amidase domain and an AraC-type DNA-binding HTH domain [Transcription]. COG5009 - Membrane carboxypeptidase/penicillin-binding protein [Cell wall/membrane/envelope biogenesis]. COG5282 - Uncharacterized conserved protein, DUF2342 family [Function unknown]. Component 1 Component 2
34 CAM_SMPL_003403 0.8 2.388587 3.668152 1.613449 2.843723 5.780277 0.797623 1.491319 0.656284 ... 0.863644 0.0 1.079436 1.825532 0.877082 0.572272 1.765964 0.185048 1.883236 3.674445

1 rows × 168 columns

In [287]:
final_pca.loc[[13]]
Out[287]:
Sample FILTER_MIN COG0006 - Xaa-Pro aminopeptidase [Amino acid transport and metabolism]. COG0018 - Arginyl-tRNA synthetase [Translation, ribosomal structure and biogenesis]. COG0020 - Undecaprenyl pyrophosphate synthase [Lipid transport and metabolism]. COG0021 - Transketolase [Carbohydrate transport and metabolism]. COG0028 - Acetolactate synthase large subunit or other thiamine pyrophosphate-requiring enzyme [Amino acid transport and metabolism, Coenzyme transport and metabolism]. COG0035 - Uracil phosphoribosyltransferase [Nucleotide transport and metabolism]. COG0040 - ATP phosphoribosyltransferase [Amino acid transport and metabolism]. COG0049 - Ribosomal protein S7 [Translation, ribosomal structure and biogenesis]. ... COG4579 - Isocitrate dehydrogenase kinase/phosphatase [Signal transduction mechanisms]. COG4581 - Superfamily II RNA helicase [Replication, recombination and repair]. COG4618 - ABC-type protease/lipase transport system, ATPase and permease components [Intracellular trafficking, secretion, and vesicular transport]. COG4663 - TRAP-type mannitol/chloroaromatic compound transport system, periplasmic component [Secondary metabolites biosynthesis, transport and catabolism]. COG4666 - TRAP-type uncharacterized transport system, fused permease components [General function prediction only]. COG4977 - Transcriptional regulator GlxA family, contains an amidase domain and an AraC-type DNA-binding HTH domain [Transcription]. COG5009 - Membrane carboxypeptidase/penicillin-binding protein [Cell wall/membrane/envelope biogenesis]. COG5282 - Uncharacterized conserved protein, DUF2342 family [Function unknown]. Component 1 Component 2
13 CAM_SMPL_003384 0.1 2.193711 4.225001 3.121637 2.988938 5.372445 0.793944 1.35856 1.466793 ... 0.09558 3.063213 0.0 1.374784 2.009722 0.178233 2.772713 0.850626 10.354176 -5.340018

1 rows × 168 columns

In [288]:
final_pca.loc[[60]]
Out[288]:
Sample FILTER_MIN COG0006 - Xaa-Pro aminopeptidase [Amino acid transport and metabolism]. COG0018 - Arginyl-tRNA synthetase [Translation, ribosomal structure and biogenesis]. COG0020 - Undecaprenyl pyrophosphate synthase [Lipid transport and metabolism]. COG0021 - Transketolase [Carbohydrate transport and metabolism]. COG0028 - Acetolactate synthase large subunit or other thiamine pyrophosphate-requiring enzyme [Amino acid transport and metabolism, Coenzyme transport and metabolism]. COG0035 - Uracil phosphoribosyltransferase [Nucleotide transport and metabolism]. COG0040 - ATP phosphoribosyltransferase [Amino acid transport and metabolism]. COG0049 - Ribosomal protein S7 [Translation, ribosomal structure and biogenesis]. ... COG4579 - Isocitrate dehydrogenase kinase/phosphatase [Signal transduction mechanisms]. COG4581 - Superfamily II RNA helicase [Replication, recombination and repair]. COG4618 - ABC-type protease/lipase transport system, ATPase and permease components [Intracellular trafficking, secretion, and vesicular transport]. COG4663 - TRAP-type mannitol/chloroaromatic compound transport system, periplasmic component [Secondary metabolites biosynthesis, transport and catabolism]. COG4666 - TRAP-type uncharacterized transport system, fused permease components [General function prediction only]. COG4977 - Transcriptional regulator GlxA family, contains an amidase domain and an AraC-type DNA-binding HTH domain [Transcription]. COG5009 - Membrane carboxypeptidase/penicillin-binding protein [Cell wall/membrane/envelope biogenesis]. COG5282 - Uncharacterized conserved protein, DUF2342 family [Function unknown]. Component 1 Component 2
60 CAM_SMPL_003389 3.0 1.068351 0.0 0.0 0.420866 1.759561 0.706198 0.0 0.0 ... 0.0 0.0 0.64598 1.496105 1.522249 0.0 2.237561 0.0 -8.821492 -1.807208

1 rows × 168 columns

Hands-on:

  • Try the kmeans clusterization with different number of cluters (4 and 5)

Lifespan analysis

In [289]:
from lifelines.datasets import load_waltons
df = load_waltons() # returns a Pandas DataFrame

print(df.head())


T = df['T']
E = df['E']
      T  E    group
0   6.0  1  miR-137
1  13.0  1  miR-137
2  13.0  1  miR-137
3  13.0  1  miR-137
4  19.0  1  miR-137
  • T is an array of durations, E is a either boolean or binary array representing whether the “death” was observed (alternatively an individual can be censored).
In [290]:
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
kmf.fit(T, event_observed=E)  # or, more succiently, kmf.fit(T, E)
Out[290]:
<lifelines.KaplanMeierFitter: fitted with 163 observations, 7 censored>
In [291]:
kmf.survival_function_
kmf.median_
kmf.plot()
plt.show()
In [292]:
groups = df['group']
ix = (groups == 'miR-137')

kmf.fit(T[~ix], E[~ix], label='control')
ax = kmf.plot()

kmf.fit(T[ix], E[ix], label='miR-137')
kmf.plot(ax=ax)
plt.show()
In [293]:
from lifelines.statistics import logrank_test

results = logrank_test(T[ix], T[~ix], E[ix], E[~ix], alpha=.99)

results.print_summary()
<lifelines.StatisticalResult>
               t_0 = -1
 null_distribution = chi squared
degrees_of_freedom = 1
             alpha = 0.99

---
 test_statistic      p  -log2(p)
         122.25 <0.005     91.99
In [294]:
from lifelines.datasets import load_regression_dataset
regression_dataset = load_regression_dataset()

regression_dataset.head()
Out[294]:
var1 var2 var3 T E
0 0.595170 1.143472 1.571079 14.785652 1
1 0.209325 0.184677 0.356980 7.335846 1
2 0.693919 0.071893 0.557960 5.269797 1
3 0.443804 1.364646 0.374221 11.684092 1
4 1.613324 0.125566 1.921325 7.639492 1
In [295]:
from lifelines import CoxPHFitter

# Using Cox Proportional Hazards model
cph = CoxPHFitter()
cph.fit(regression_dataset, 'T', event_col='E')
cph.print_summary()
<lifelines.CoxPHFitter: fitted with 200 observations, 11 censored>
      duration col = 'T'
         event col = 'E'
number of subjects = 200
  number of events = 189
    log-likelihood = -807.62
  time fit was run = 2019-02-26 15:53:53 UTC

---
      coef  exp(coef)  se(coef)    z      p  -log2(p)  lower 0.95  upper 0.95
var1  0.22       1.25      0.07 2.99 <0.005      8.49        0.08        0.37
var2  0.05       1.05      0.08 0.61   0.54      0.89       -0.11        0.21
var3  0.22       1.24      0.08 2.88 <0.005      7.97        0.07        0.37
---
Concordance = 0.58
Log-likelihood ratio test = 15.54 on 3 df, -log2(p)=9.47
In [296]:
cph.plot()
plt.show()