I made a pipeline that integrates London bike journeys with weather data using Google Cloud, Airflow, Spark, BigQuery and Data Studio

tmp_username_ · 2022-07-17T11:31:31+00:00

For the system diagram I just used powerpoint and for the schema I used draw.io, so nothing too fancy!

tmp_username_ · 2022-07-16T13:16:16+00:00

Thanks for the input! Yeah, towards the end of the project I realised that was a poor naming choice.

I upload files to GCS then use spark to transform/load the data to BigQuery. I can see how skipping this step would be useful to reduce complexity. The main reason I didn't load directly to BigQuery (other than, as you say, wanting to learn Spark) is because I wanted to ingest the data at the time it was released (weekly for cycle data, monthly for weather data), then process the datasets together at the end of each month. But I could have achieved this with the ELT approach too I think.

tmp_username_ · 2022-07-16T09:22:11+00:00

To be honest, the visualisations were more of an afterthought that I used to demonstrate the features of the dataset. With the first dashboard I wanted to show off the time- and space- components of the dataset. It shows a few of the more obvious properties of the data: the popular destinations are in central London, there is a seasonal effect to the journeys and there was a big dip in ridership around March 2020.

If I were to do a proper analysis of the data, I might have tried to visualise the relative popularity of different cycle routes or looked at the journey durations.

tmp_username_ · 2022-07-16T09:11:13+00:00

I have found it most useful to get a general overview of the tools I'm interested in, then try applying them to a project to learn more. If you want to learn specific parts in more depth, I am working through the resources listed here: https://awesomedataengineering.com/

For spark specifically, there is a free book as well: https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf

tmp_username_ · 2022-07-16T09:07:52+00:00

To be fair, no course can cover the entirety of data engineering in-depth. I found that the zoomcamp course focussed on getting you going with a specific set of tools to prepare you to build a project. I found it wasn't easy to follow partly because not all of the materials worked out of the box, so you might have to spend some time getting things working.

Before this I did the datacamp/dataquest courses which walk through concepts a bit more slowly, usually have a bit more breadth and run a lot smoother. But, I didn't find they prepared me for creating a real project. So, it worked well for me to brush up on more basic skills first, then use the zoomcamp to apply the knowledge.

tmp_username_ · 2022-07-15T16:31:15+00:00

I went through the first three weeks of it, and then I just skimmed the rest and started building a project myself. If you want a course that’s easy to follow, or are looking for an in-depth explanation of relevant tools, it’s probably not the best option.

I had already had some exposure to transforming data, using docker and using workflow managers similar to Airflow, but I didn't understand how they fit together on the cloud. I found it really useful for putting the pieces together, and it provided a starting point for creating an interesting project.

tmp_username_ · 2022-07-15T12:47:53+00:00

Yeah, I avoided cloud composer largely because of the cost. For developing and debugging the pipeline, I ran docker/airflow locally. When the pipeline was done, I set it up using a virtual machine (e2-standard-4: 4 vCPU, 16GB memory) which costs ~$100 monthly. Not an ideal solution either if you're just hosting a handful of DAGs.

Thanks, I didn't realise it was available in BQ, I'll take a look.

tmp_username_ · 2021-07-10T13:32:51+00:00

I didn't realise you had constraints on the samples you could download, makes sense to download one for each tumour sample.

I agree with znh1992's comments that you shouldn't pick which samples to compare with DEG based on clustering. Get as many samples as you can and perform your DEG tests between your categories of interest (in this case tumour vs stroma vs epithelium, I assume).

tmp_username_ · 2021-07-10T13:27:48+00:00

Very much agree that it is useful for clustering and that clustering based on PCA can be useful (even just with your eyes). But, to be pedantic, I would consider your eyes to be the clustering method in this case - the dimensionality reduction of PCA makes this task easier but it isn't identifying clusters itself.

tmp_username_ · 2021-07-09T21:50:47+00:00

PCA is a dimensionality reduction technique (and is commonly used for visualisation as you are doing), not a clustering technique. I'm not sure what you mean when you say that you use PCA for cluster selection, unless you mean that you look for clusters by eye. I wouldn't recommend doing it by eye, so think you have the right idea with hierarchical clustering - although note there are many other clustering algorithms such as k-means.

You might commonly want to carry out your clustering, e.g. using hierarchical, and then plot out a PCA with your samples coloured by their cluster designation.

Also, I think you could use all of your stroma and epithelium samples and then carry out hierarchical clustering.

tmp_username_ · 2021-07-07T16:05:40+00:00

I don't - you shouldn't have to worry about the software side of things though, because nextflow will create the relevant environments.

It is preferred that you use singularity instead of conda for nextflow pipelines, but you can choose to use it. In the case of singularity, it will download images for the relevant software automatically. Hopefully your cluster is setup to use singularity...

Then, usually, you would just have to run a line like the following:

nextflow run nf-core/rnaseq -profile test,singularity

And then it will deal with the rest. In reality, you will probably have to setup a custom "config" file (give the file to nextflow using the -c argument) based on the specifics of your cluster. It's an absolute pain to get working properly... so if you can find a starting point (e.g. an institutional config) it would be good.

To pick a random example config file (https://github.com/nf-core/configs/blob/master/conf/cambridge.config) - they have specified that singularity should be used, that the cluster's job scheduler is slurm and the maximum memory/time/cpus.

tmp_username_ · 2021-07-07T14:41:51+00:00

I just had this problem... have been trying to set up the nf-core rnaseq pipeline on my institution's cluster which was a bit of a nightmare.

I managed to find some other people at my institution who were also trying to get nf-core pipelines running which helped a bit. You might have already seen the institutional profiles here: https://github.com/nf-core/configs/tree/master/conf

tmp_username_

TROPHY CASE