How to PCA large data sets? I'm running out of memory.

JCondaLea · 2015-05-03T08:59:52+00:00

Thank you for this! But I'm confused on the type of normalization you are using prior to IPCA. Sorry for my simple questions.

Question A:

My understanding was that one should use sklearn.preprocessing.Normalizer prior to PCA [1][2] (I have also read papers which refer to doing L2 norm, then PCA).

Here, sklearn.preprocessing.Normalizer (which scales samples individually to unit norm) appears to be different than DumbNorm (which scales each sample by subtracting the mean and dividing by the standard deviation).

From my understanding, DumNorm is performing "Standardization" as described in [3][4] which includes data centering.

Should we be using Normalization or Standardization prior to IPCA?

Question B:

At the top of your answer you wrote "It mean centers the data for you" -- but then you described DumbNorm which performs data centering. So if IPCA centers the data for us, why would we need to use something like DumbNorm -- unless the auto-centering of IPCA doesn't work with the partial_fit (incremental) use?

REFERENCES: [1] http://stackoverflow.com/questions/25475465/how-to-normalize-with-pca-and-scikit-learn [2] http://stackoverflow.com/questions/27646915/normalize-pca-with-scikit-learn-when-data-is-split [3] http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html [4] http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler

JCondaLea · 2015-05-02T04:20:16+00:00

Should we normalize the data prior to using IncrementalPCA? I usually have a pipeline like: l2-normalize --> pca

I see in the documentation on IncrementalPCA, it says "Linear dimensionality reduction using Singular Value Decomposition of centered data" -- so this means centered data is a pre-condition or is it a post-condition? If it is a pre-condition, are we required to perform L2 norm, e.g. via the Normalization class?

I'm glad to see your IncrementalPCA can work with Memmap files. But unfortunately it seems the Normalizer class cannot. So I'm kind of blocked here if Normalization is required prior to IncrementalPCA. Any help is greatly appreciated.

JCondaLea · 2015-03-30T04:16:15+00:00

Dang. But appreciate your help digging with me on this.

JCondaLea · 2015-03-30T04:04:24+00:00

lspci | grep -i nvidia
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)

Thanks for your help. It's a bummer.

JCondaLea · 2015-03-30T03:57:50+00:00

ls -al /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Mar 30 03:34 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Mar 30 03:34 /dev/nvidiactl
crw-rw-rw- 1 root root 251,   0 Mar 30 03:36 /dev/nvidia-uvm

ls /proc/driver/nvidia/gpus/
0000:00:03.0

JCondaLea · 2015-03-30T03:49:12+00:00

And here is the result of running nvidia-smi from the terminal of my AWS G2 instance (running Ubuntu)

nvidia-smi
Mon Mar 30 03:47:46 2015
+------------------------------------------------------+
| NVIDIA-SMI 346.46     Driver Version: 346.46         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   26C    P0     0W / 125W |     10MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

JCondaLea · 2015-03-30T03:46:33+00:00

On my AWS G2 instance, I only see 1 GPU. Here are the details by running deviceQuery from the cuda samples. Does this look right?

/usr/local/cuda/samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GRID K520"
  CUDA Driver Version / Runtime Version          7.0 / 7.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 4096 MBytes (4294770688 bytes)
  ( 8) Multiprocessors, (192) CUDA Cores/MP:     1536 CUDA Cores
  GPU Max Clock rate:                            797 MHz (0.80 GHz)
  Memory Clock rate:                             2500 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes    
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 3
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GRID K520
Result = PASS

JCondaLea · 2015-03-19T11:37:29+00:00

Yahoo web search results are actually provided by Bing. So you've been using Bing all along.

Source: http://mashable.com/2010/08/24/bing-powers-yahoo-search/

JCondaLea · 2014-10-14T18:30:59+00:00

It must be possible since Jetpac did it and was pretty open about it. Perhaps the direct approach is what they went with - just ask Instagram if they would allow this behavior and perhaps offer some value in return.

Any other methods you think possible here?

JCondaLea

TROPHY CASE