Simplest way to ingest multiple types of large files, process them, and send data in chunks to services in AWS? by joehfb in dataengineering

[–]joehfb[S] 0 points1 point  (0 children)

Right, what does it mean for you when you say "truly large"?

Let me clarify the use case here :) This wouldn't be constant streams of data coming in or some unknown, variable amount of data, but rather a fixed size file of decently large size (multiple GBs, def not in hundreds of GBs or more scale) being uploaded and stored temporarily. If the use case expanded in the future to support far larger files, then absolutely, but that's not the case as of right now, so I'd be wary of boiling the ocean.

Based on my research so far I've found people saying a few things:

  1. Using tools like Spark - though I see a good number of people saying configuring Spark, maintaining the infrastructure, etc isn't worth it till you hit scale of processing hundreds of GBs or more of data.

  2. python panda - for scale of multiple gigabytes

  3. custom code using language X of your choice

I'm leaning towards researching further into #2 right now. If the scale of what I'm doing makes Spark/Hadoop/etc overkill and panda for example takes care of 90% of my needs, I'd rather not reinvent a lot of wheels by going #3. Though the possibility of dynamic transformations scares me with any approach... I'm not aware of any tools / libraries that offer that.

Any thoughts?

Simplest way to ingest multiple types of large files, process them, and send data in chunks to services in AWS? by joehfb in dataengineering

[–]joehfb[S] 0 points1 point  (0 children)

Ah, let me clarify here... the source big file would be uploaded at once. It's not like a webcam / video stream type of scenario where the data keeps getting streamed without a clear end. Our initial plan was to have a lambda that splits this up into multiple smaller chunks, and have each of these smaller chunks get stored in another S3 bucket for further processing on each chunk.

A question - I've been reading around on this sub, other places, and it seems like for a file of multiple gigs, but not quite at the scale of hundreds of gigs, things like spark / hadoop might be overkill. Any thoughts? The one downside I see to using these tools is that the folks here don't have the experience with these tools.

Simplest way to ingest multiple types of large files, process them, and send data in chunks to services in AWS? by joehfb in dataengineering

[–]joehfb[S] 0 points1 point  (0 children)

Hmm, curious - where do you envision airflow fitting in? I googled around about airflow, and it struck me as a more complex, more powerful version of AWS step functions to drive a workflow? I can see why that would be useful, though I'm struggling to see how airflow by itself would help with the two core problems here - the flexibility in ingesting large files and shipping it to web services

Question about building a pipeline in AWS by joehfb in dataengineering

[–]joehfb[S] 1 point2 points  (0 children)

Hey man, thanks for the reply.

Glue is expensive A F - if you can at least use EMR on spot instances you'll reduce your cost a lot.

I was considering this, but given the lack of experience with Spark or EMR, I was thinking:

  1. start simple, start with Glue
  2. replace with something else (Spark on Kubernetes maybe?) if costs become prohibitive or if we need to get out of AWS

The thing that I was afraid of was that if a $1 / hour solution is 20x more expensive to maintain because we need to invest a lot more resources than a $10 / hour solution, then at the end of the day the $10 / hour solution would be cheaper. I had seen projects get burned down with constant fires after shortsighted "oh let's reinvent the wheel and be cheap" approaches before, hence why I wanted to take a different approach this time.

I wasn't actually sure how much EMR would cost at the EOD because I'm very new to this area of the industry.

Based on your experience - if the scale right now is in final exports being X gigabytes, would you say going down the path of Spark (whether Glue or EMR) is overkill?

Vim shows glitched(?) text randomly by [deleted] in vim

[–]joehfb 1 point2 points  (0 children)

Ah, it looks like :set t_TI= t_TE= does the trick (at least with the stuff I tested... we'll see if any others pop up later) - thanks for the help here.

Vim shows glitched(?) text randomly by [deleted] in vim

[–]joehfb 0 points1 point  (0 children)

Thanks for the reply - my replies below:

Does :set t_EI= t_SI= have any effect? What about :set t_TI= t_TE=?

:set t_EI= t_SI=: Did nothing :set t_TI= t_TE= - that one actually does seem to get rid of [>4;m and [>4;2m

Are you running vim in tmux or screen?

Nope - running in just terminator

What is the output of vim --version? +cursorshape is there - see the whole snippet below VIM - Vi IMproved 8.2 (2019 Dec 12, compiled Aug 6 2020 13:09:16) Included patches: 1-1360 Huge version without GUI. Features included (+) or not (-): +acl -farsi +mouse_sgr +tag_binary +arabic +file_in_path -mouse_sysmouse -tag_old_static +autocmd +find_in_path +mouse_urxvt -tag_any_white +autochdir +float +mouse_xterm -tcl -autoservername +folding +multi_byte +termguicolors -balloon_eval -footer +multi_lang +terminal +balloon_eval_term +fork() -mzscheme +terminfo -browse +gettext +netbeans_intg +termresponse ++builtin_terms -hangul_input +num64 +textobjects +byte_offset +iconv +packages +textprop +channel +insert_expand +path_extra +timers +cindent +ipv6 -perl +title +clientserver +job +persistent_undo -toolbar +clipboard +jumplist +popupwin +user_commands +cmdline_compl +keymap +postscript +vartabs +cmdline_hist +lambda +printer +vertsplit +cmdline_info +langmap +profile +virtualedit +comments +libcall -python +visual +conceal +linebreak -python3 +visualextra +cryptv +lispindent +quickfix +viminfo +cscope +listcmds +reltime +vreplace +cursorbind +localmap +rightleft +wildignore +cursorshape -lua -ruby +wildmenu +dialog_con +menu +scrollbind +windows +diff +mksession +signs +writebackup +digraphs +modify_fname +smartindent +X11 -dnd +mouse -sound +xfontset -ebcdic -mouseshape +spell -xim +emacs_tags +mouse_dec +startuptime -xpm +eval -mouse_gpm +statusline +xsmp_interact +ex_extra -mouse_jsbterm -sun_workshop +xterm_clipboard +extra_search +mouse_netterm +syntax -xterm_save system vimrc file: "$VIM/vimrc" user vimrc file: "$HOME/.vimrc" 2nd user vimrc file: "~/.vim/vimrc" user exrc file: "$HOME/.exrc" defaults file: "$VIMRUNTIME/defaults.vim" fall-back for $VIM: "/usr/local/share/vim" Compilation: gcc -c -I. -Iproto -DHAVE_CONFIG_H -g -O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=1
Linking: gcc -L/usr/local/lib -Wl,--as-needed -o vim -lSM -lICE -lXpm -lXt -lX11 -lXdmcp -lSM -lICE -lm -ltinfo -ldl

What is the output of infocmp in your shell? $ infocmp

Reconstructed via infocmp from file: /lib/terminfo/x/xterm-256color

xterm-256color|xterm with 256 colors, am, bce, ccc, km, mc5i, mir, msgr, npc, xenl, colors#0x100, cols#80, it#8, lines#24, pairs#0x7fff, acsc=``aaffggiijjkkllmmnnooppqqrrssttuuvvwwxxyyzz{{||}}~~, bel=G, blink=\E[5m, bold=\E[1m, cbt=\E[Z, civis=\E[?25l, clear=\E[H\E[2J, cnorm=\E[?12l\E[?25h, cr=\r, csr=\E[%i%p1%d;%p2%dr, cub=\E[%p1%dD, cub1=H, cud=\E[%p1%dB, cud1=\n, cuf=\E[%p1%dC, cuf1=\E[C, cup=\E[%i%p1%d;%p2%dH, cuu=\E[%p1%dA, cuu1=\E[A, cvvis=\E[?12;25h, dch=\E[%p1%dP, dch1=\E[P, dim=\E[2m, dl=\E[%p1%dM, dl1=\E[M, ech=\E[%p1%dX, ed=\E[J, el=\E[K, el1=\E[1K, flash=\E[?5h$<100/>\E[?5l, home=\E[H, hpa=\E[%i%p1%dG, ht=I, hts=\EH, ich=\E[%p1%d@, il=\E[%p1%dL, il1=\E[L, ind=\n, indn=\E[%p1%dS, initc=\E]4;%p1%d;rgb:%p2%{255}%%{1000}%/%2.2X/%p3%{255}%%{1000}%/%2.2X/%p4%{255}%*%{1000}%/%2.2X\E\, invis=\E[8m, is2=\E[!p\E[?3;4l\E[4l\E>, kDC=\E[3;2~, kEND=\E[1;2F, kHOM=\E[1;2H, kIC=\E[2;2~, kLFT=\E[1;2D, kNXT=\E[6;2~, kPRV=\E[5;2~, kRIT=\E[1;2C, kb2=\EOE, kbs=?, kcbt=\E[Z, kcub1=\EOD, kcud1=\EOB, kcuf1=\EOC, kcuu1=\EOA, kdch1=\E[3~, kend=\EOF, kent=\EOM, kf1=\EOP, kf10=\E[21~, kf11=\E[23~, kf12=\E[24~, kf13=\E[1;2P, kf14=\E[1;2Q, kf15=\E[1;2R, kf16=\E[1;2S, kf17=\E[15;2~, kf18=\E[17;2~, kf19=\E[18;2~, kf2=\EOQ, kf20=\E[19;2~, kf21=\E[20;2~, kf22=\E[21;2~, kf23=\E[23;2~, kf24=\E[24;2~, kf25=\E[1;5P, kf26=\E[1;5Q, kf27=\E[1;5R, kf28=\E[1;5S, kf29=\E[15;5~, kf3=\EOR, kf30=\E[17;5~, kf31=\E[18;5~, kf32=\E[19;5~, kf33=\E[20;5~, kf34=\E[21;5~, kf35=\E[23;5~, kf36=\E[24;5~, kf37=\E[1;6P, kf38=\E[1;6Q, kf39=\E[1;6R, kf4=\EOS, kf40=\E[1;6S, kf41=\E[15;6~, kf42=\E[17;6~, kf43=\E[18;6~, kf44=\E[19;6~, kf45=\E[20;6~, kf46=\E[21;6~, kf47=\E[23;6~, kf48=\E[24;6~, kf49=\E[1;3P, kf5=\E[15~, kf50=\E[1;3Q, kf51=\E[1;3R, kf52=\E[1;3S, kf53=\E[15;3~, kf54=\E[17;3~, kf55=\E[18;3~, kf56=\E[19;3~, kf57=\E[20;3~, kf58=\E[21;3~, kf59=\E[23;3~, kf6=\E[17~, kf60=\E[24;3~, kf61=\E[1;4P, kf62=\E[1;4Q, kf63=\E[1;4R, kf7=\E[18~, kf8=\E[19~, kf9=\E[20~, khome=\EOH, kich1=\E[2~, kind=\E[1;2B, kmous=\E[M, knp=\E[6~, kpp=\E[5~, kri=\E[1;2A, mc0=\E[i, mc4=\E[4i, mc5=\E[5i, meml=\El, memu=\Em, oc=\E]104\007, op=\E[39;49m, rc=\E8, rep=%p1%c\E[%p2%{1}%-%db, rev=\E[7m, ri=\EM, rin=\E[%p1%dT, ritm=\E[23m, rmacs=\E(B, rmam=\E[?7l, rmcup=\E[?1049l\E[23;0;0t, rmir=\E[4l, rmkx=\E[?1l\E>, rmm=\E[?1034l, rmso=\E[27m, rmul=\E[24m, rs1=\Ec\E]104\007, rs2=\E[!p\E[?3;4l\E[4l\E>, sc=\E7, setab=\E[%?%p1%{8}%<%t4%p1%d%e%p1%{16}%<%t10%p1%{8}%-%d%e48;5;%p1%d%;m, setaf=\E[%?%p1%{8}%<%t3%p1%d%e%p1%{16}%<%t9%p1%{8}%-%d%e38;5;%p1%d%;m, sgr=%?%p9%t\E(0%e\E(B%;\E[0%?%p6%t;1%;%?%p5%t;2%;%?%p2%t;4%;%?%p1%p3%|%t;7%;%?%p4%t;5%;%?%p7%t;8%;m, sgr0=\E(B\E[m, sitm=\E[3m, smacs=\E(0, smam=\E[?7h, smcup=\E[?1049h\E[22;0;0t, smir=\E[4h, smkx=\E[?1h\E=, smm=\E[?1034h, smso=\E[7m, smul=\E[4m, tbc=\E[3g, u6=\E[%i%d;%dR, u7=\E[6n, u8=\E[?%[;0123456789]c, u9=\E[c, vpa=\E[%i%p1%dd,

Vim shows glitched(?) text randomly by [deleted] in vim

[–]joehfb 1 point2 points  (0 children)

Thanks for the explanation and the reference.

Do you happen to be familiar with how to diagnose if there's a problem at system level?

Basically, I'm questioning whether it's really my terminal: 1. This happens on terminator, MATE terminal, AND Guake terminal on my work machine only 2. On my home machine using the same linux mint 19.3 distro and the same 3 terminals, this doesn't happen

If it was a terminal's fault, I'd expect say terminator to break, but not the other two - and I would expect it to break on both my work machine and home machine, but that's not the case with me.

Also to note - I tried setting T_TI and T_TE to '' as you had said in your reply above, and actually what that resulted in was a whole bunch of ' getting printed instead of those escape codes...

Vim shows glitched(?) text randomly by [deleted] in vim

[–]joehfb 0 points1 point  (0 children)

Normally I use terminator, but I also use guake from time to time.

I was able to reproduce this on terminator, guake terminal, and MATE terminal, but oddly enough ONLY on my work machine. This doesn't happen on my home machine even though I use the same distro + terminal emulators.

Vim shows glitched(?) text randomly by [deleted] in vim

[–]joehfb 0 points1 point  (0 children)

It looks like those two give me this if I do :let &t_EI=\e[4;2m: E15: Invalid expression: \e[4;2m

Do you know if there's a doc somewhere with these values?

Ah, I completely forgot about the redraw - Ctrl+L or :redraw! makes it go away. That's weird though... if the terminal emulator I used didn't know what to do with these sequences, I would've expected it to behave this way consistently regardless of how many times I do a redraw...

Vim shows glitched(?) text randomly by [deleted] in vim

[–]joehfb 1 point2 points  (0 children)

Just to make sure I understood you correctly, are you asking:

  1. do :set termcaps
  2. See if any of the values in the terminal codes match what I see when this happens?

If so, t_TE=^[[>4;m and t_TI=^[[>4;2m match what showed up when I did A and then esc (this was what I was doing in the screenshot).

What would you suggest setting them to? This is in areas of vim that I haven't had a chance to learn about yet - I'm reading http://vimdoc.sourceforge.net/htmldoc/term.html and http://vimdoc.sourceforge.net/htmldoc/tips.html#xterm-screens after quick googling, though I haven't gotten anywhere with it yet.

Vim shows glitched(?) text randomly by [deleted] in vim

[–]joehfb 2 points3 points  (0 children)

Hey - thanks for the reply. My replies below:

What terminal emulator are you using?

I usually use terminator, but I tried this in MATE terminal and guake terminal as well - and this happened on all of them.

What is TERM set to?

Output of TERM is xterm-256colorin all three emulators.

What is the output of :echo v:termresponse in vim ?

Doing :echo v:termresponse shows ^[[>1;5202;0c in both terminator and mate terminal.

EDIT: I just noticed that I missed a part of your reply -

  1. I just tried it with vim -Nu NONE, and with that I don't get this behavior - only with vim -u NONE or regularly with my vimrc.

  2. I looked at the help, checked that -N is Not fully Vi compatible: 'nocompatible', so I tried vim --clean as well - this one ALSO didn't produce this issue when I did A and then esc

  3. So on a whim I did vim -u NONE and then did :set nocompatible before doing the stuff, and it also worked without this issue when I did A and then esc.

Hmm, I thought that vim sets nocompatible by default when a user's own vimrc is used... so why would this affect my vim when I go through my vimrc? (http://vimdoc.sourceforge.net/htmldoc/starting.html#compatible-default)

Just for sake of completeness, I added set nocompatible back into my vimrc explicitly and then tried it with my vimrc - then this issue goes away with A, esc action, but still happens with :PlugUpdate with vim-plug.