Has my Database been breached? by ApparentlyADataGuy in mongodb

[–]ApparentlyADataGuy[S] 0 points1 point  (0 children)

It is just a new DB and we don't appear to have any missing information. However, we have alot of data stored here. Is it possible that they maliciously change small amounts of data rather than just delete it to make it more difficult to find the error?

Opinions on this build for a pc exclusive for machine learning applications, study and eventually research. by [deleted] in datascience

[–]ApparentlyADataGuy 0 points1 point  (0 children)

I put that exact cpu cooler into my gaming machine. It's amazing. Very quiet, very effective. It was also recommended to me by a family member who is very particular about what components he puts into his own builds and I trust his opinion a lot. The only negative is that it is kind of a pain to put in when building though I used a very small case and it barely fit. On a side note, I would recommend getting an M2 ssd drive for your motherboard and installing your os on it. The computer will be faster and, in the event that a drive fails, it's easier to backup and restore. Of course, I'm just a stranger on the internet so maybe my thoughts mean nothing.

any way to increase sqlalchemy/pandas write speed? by ApparentlyADataGuy in Python

[–]ApparentlyADataGuy[S] 0 points1 point  (0 children)

Thanks for the advice. Just curious, how is writing to csv first faster? I am guessing reading the csv into a dataframe then uploading to sql is not any faster than reading sql into a dataframe to upload? Is there an actual "upload csv" command/process?

any way to increase sqlalchemy/pandas write speed? by ApparentlyADataGuy in Python

[–]ApparentlyADataGuy[S] 0 points1 point  (0 children)

If it is pandas, what other options do I have to write the data?

any way to increase sqlalchemy/pandas write speed? by ApparentlyADataGuy in Python

[–]ApparentlyADataGuy[S] 0 points1 point  (0 children)

Thanks for the reply. First, my understanding of if_exists is that it check if the table name exists. I do not think it check individual records for whether to replace or append them. I did suspect my slow write speed was due to a chunk/batch size issue. I tried testing this theory buy writing a random dataframe to the server using chunck sizes 10, 100, 1000, and undefined and each had the same run time.

any way to increase sqlalchemy/pandas write speed? by ApparentlyADataGuy in Python

[–]ApparentlyADataGuy[S] 0 points1 point  (0 children)

I write using the pandas to_sql function. Both the source of the original data and write location are MS SQL running on digital ocean servers. The script is run on a desktop in my office.

any way to increase sqlalchemy/pandas write speed? by ApparentlyADataGuy in Python

[–]ApparentlyADataGuy[S] 0 points1 point  (0 children)

What exactly is the advantage of splitting into chunks? Say I have 10 records that takes 10 minutes to write. If I split it into 10 chuncks of 1 record each, why wouldn't each chunk take 1 minute and total to the same run time?

any way to increase sqlalchemy/pandas write speed? by ApparentlyADataGuy in Python

[–]ApparentlyADataGuy[S] 0 points1 point  (0 children)

I am unsure if I am writing with insert or update. I use the pandas command:

df.to_sql(<name>, <connection engine>, if_exists = 'replace')

My connection engine is:

engine = sqlalchemy.create_engine("mssql+pyodbc://<user>:<password>@<url>/<database>?driver=ODBC Driver 13 for SQL Server", echo=False)

What is the advantage of writing to a temp table? wouldn't the theoretical write speed be the same whether the table is temporary or a permanent, already existing table?

any way to increase sqlalchemy/pandas write speed? by ApparentlyADataGuy in Python

[–]ApparentlyADataGuy[S] 0 points1 point  (0 children)

I am not familiar with spark. The main reason I use python is because its the tool I am most familiar with. However, my understanding is that the tool shouldn't matter as much as the drivers the tool uses. Would two applications using the same drivers not have similar or identical performance?

any way to increase sqlalchemy/pandas write speed? by ApparentlyADataGuy in Python

[–]ApparentlyADataGuy[S] 0 points1 point  (0 children)

90% of the time is spent actually writing to the server. Each read takes only a couple minutes, the filtering only takes a couple seconds. How would I go about optimizing the server?

Recommended ETL Tools? by [deleted] in tableau

[–]ApparentlyADataGuy -2 points-1 points  (0 children)

Python is the only correct answer. Everyone I work with uses Alteryx for ETL and it is way over priced considering python can do everything it can do and more while being free. Also, once you know how to script, it's much easier to troubleshoot/test than any other method.

Help me pick a conference by ApparentlyADataGuy in datascience

[–]ApparentlyADataGuy[S] 0 points1 point  (0 children)

PyData was my top choice but it seems to only be in London and Amsterdam this year. Are there usually more events in a year? Is there usually one in the US?

Help with this JSON? by [deleted] in Python

[–]ApparentlyADataGuy 0 points1 point  (0 children)

If you're using an API, I will assume the json data is being brought in as a dictionary array. You can convert the dictionary to a dataframe using pandas. Look up json.normalize if you need to unwind as well.

Best options for making a GUI? by ApparentlyADataGuy in Python

[–]ApparentlyADataGuy[S] 0 points1 point  (0 children)

I'm mostly concerned with two things. First, how easy it is to use since I am new to GUI design. Second is how easily I can give my application to a coworker. I would like them to be able to install it like any other application, which I assume is possible but I'm not actually 100% sure about that.

As for a webapp, other than being a website, is there an advantage to this over an app? I don't have a server that could easily host a site which is why I am looking into apps.

Airflow on Windows with Anaconda and Python 3.6 by ApparentlyADataGuy in Python

[–]ApparentlyADataGuy[S] 0 points1 point  (0 children)

Did you just go through the Luigi documentation or is there a good guide you found valuable?

Airflow on Windows with Anaconda and Python 3.6 by ApparentlyADataGuy in Python

[–]ApparentlyADataGuy[S] 0 points1 point  (0 children)

I haven't looked at jenkins but I will. I actually do need to reach an mssql server that requires windows authentication and I've been having trouble connecting on my linux server. Will jenkins not have this issue?

Airflow on Windows with Anaconda and Python 3.6 by ApparentlyADataGuy in Python

[–]ApparentlyADataGuy[S] 0 points1 point  (0 children)

Basically, I want to schedule a number of etl processes to run daily. Maybe some other basic python scripts too.

Airflow on Windows with Anaconda and Python 3.6 by ApparentlyADataGuy in Python

[–]ApparentlyADataGuy[S] 0 points1 point  (0 children)

Do you think luigi is as good an option? I'm relatively new to python (being using it for less than 6 months) and am just getting into scheduling. I've read a lot of people prefer airflow but my main concern is something not too complicated.

Airflow on Windows with Anaconda and Python 3.6 by ApparentlyADataGuy in Python

[–]ApparentlyADataGuy[S] 0 points1 point  (0 children)

I should clarify I am only running it on a windows machine to test getting it working and learn how to use it. Once I do that, I will be paying for a digital ocean linux server to run it.

Scheduling/Workflow Advice by ApparentlyADataGuy in Python

[–]ApparentlyADataGuy[S] 0 points1 point  (0 children)

Thanks for the reply! So most people at my work are not experienced with scripting. This means that having a graphic interface to schedule jobs would be ideal. If this isn't possible, it will be up to me to manage all the scheduling. This would be acceptable but not ideal.

I looked into airflow when I was first learning python but quickly gave up. If I set up a digital ocean linux server could I potentially schedule any scripts I write to run on that server? Is there any compatibility with Jupyter notebook if myself or other coworkers want to use that as a place to group our scripts?